Hey all – how can I get Spark driver/executor logs...
# ask-community
k
Hey all – how can I get Spark driver/executor logs to show up in Dagit?
s
Hi Kevin - how are you running Spark? I think getting the executor logs to show up in Dagit would be tough (other than maybe a link), but in many cases it shouldn't be hard to get the driver logs to show up.
k
Hey Sandy – I'm starting a new container per asset build with the
DockerRunLauncher
, and then using
dagster_pyspark
to get a spark session
It gets started up and destroyed on the same host so I think actually there should only be driver logs
I tried adding everything I could think of under
managed_python_loggers
but I haven't been able to capture the stdout / what gets printed in the Docker container
Copy code
python_logs:
  python_log_level: INFO
  managed_python_loggers:
    - py4j.clientserver
    - py4j.java_gateway
    - py4j.protocol
    - py4j.java_collections
    - py4j.finalizer
    - py4j.signals
    - py4j
    - pyspark
    - pyspark.sql
    - dagster_pyspark
    - dagster-pyspark
s
Have you checked the compute logs to see if it's there? Along with the "managed logs", Dagster also records stdout and stderr from the processes it manages. There's a toggle to access these on the run page:
k
yep. the only thing I see there are the Python loggers I enabled in that block above, but it's not the typical nicely-formatted Spark logs, it's a bunch of control statements from py4j like this:
Copy code
DEBUG:py4j.clientserver:__ASSET_JOB_0 - 8283175e-b8ca-4e80-abae-4884af8242db - [redacted] - Answer received: !yv
DEBUG:py4j.clientserver:__ASSET_JOB_0 - 8283175e-b8ca-4e80-abae-4884af8242db - [redacted] - Command to send: c
o62
collectToPython
e
whereas in the docker container, i'm seeing stuff like this:
Copy code
2023-06-07 18:29:49 23/06/07 22:29:49 INFO Executor: Running task 26.0 in stage 1.0 (TID 78)
2023-06-07 18:29:49 23/06/07 22:29:49 INFO Executor: Finished task 21.0 in stage 1.0 (TID 73). 1964 bytes result sent to driver
2023-06-07 18:29:49 23/06/07 22:29:49 INFO TaskSetManager: Starting task 27.0 in stage 1.0 (TID 79) (462fb47a4b12, executor driver, partition 27, PROCESS_LOCAL, 7475 bytes)
s
Just to make sure, you're seeing that when you click the icon with the "Raw compute logs" tooltip? Asking because I would expect the
managed_python_loggers
to only affect the logs that show up when the other icon ("Structured event logs") is selected.
k
yep, they appear in both for me actually
s
got it - this is not my area of expertise, but forwarding this question to the experts
k
thanks, appreciate it! for the other connectors that you guys have that talk w/ Spark (databricks, snowflake, etc) – do they give back a stream of logs from the driver + executor that you guys are explicitly piping into Dagit? wonder if we could do the same
@sandy any luck on looping in people here?
a
are you just using the default compute log manager? https://docs.dagster.io/deployment/dagster-instance#compute-log-storage I believe you either need to set up volume mounts to ensure the files written out for log capture are available outside the container or use one of the blob storage backed log managers
though, if the logs were simply lost or unavailable to dagit due to containerization i would expect nothing and it appears you are getting some
are you using a specific executor or just the default? log capture works by capturing the output of the processes that dagster is managing, and any created subprocesses that inherit its file descriptors for stderr/stdout my personal spark knowledge is minimal, is the pyspark driver a separate process that gets shared between steps?
k
standard compute log manager right now, yes. would it be right to say that if they're not showing up in Dagit (because host process is not Dagster managed, etc.), that the compute log manager won't make a difference? that's just where the files are stored? or is the compute log manager actually responsible for capturing logs as well? I'm using the
DockerRunLauncher
and a docker image that includes Spark and all our Python dependencies (I believe this is in-line w/ the rec in the docs for k8s/docker deployment). I believe part of the complexity in using PySpark is that the Spark backend is a Java process regardless of which language you write your transforms in – so both the execution and logging are handled by Java – PySpark is just a Python wrapper around them. They make calls to the Java API through py4j and logging through log4j (latter is a Java library) – the little logs I was able to get to show up so far are by declaring py4j as a Dagster-managed logger, but it's not very useful output. With that said, though – if they are making it to stdout on my Docker container, I feel like we should be able to capture them somehow, and PySpark is pretty widely used in terms of supported Dagster integrations so hopefully many will benefit if we can figure it out.
a
compute log manager is responsible for how logs are captured as well. The issue with the default one and docker run launcher is that the default one stores the captured logs in the filesystem which unless you set up volume mounts only exist ephemerally within the container launched for the run. you’ll need to either add config to the run launcher to set up a mapping from the containers file system to the host filesystem using volume mounts, or just switch to a compute log manager that uploads to blob storage ie s3.
f
Hi Alex, do you have an example on how to configure the docker run launcher to make the compute logs available in the UI? How to setup the container volume? Thanks!
a
I do not have an example on hand. Heres some more specific docs links how you set the directory for compute logs: https://docs.dagster.io/deployment/dagster-instance#localcomputelogmanager an example of volume mounting via container kwargs on the run launcher https://docs.dagster.io/deployment/guides/docker#mounting-volumes
e
Thanks for the help Alex, I've figured it out!
👌 1
p
@Edo Do you mind sharing how you managed? I’m still stuck here. So far: I’ve mounted the volume in Docker and added this volume to the respective containers. I my dagster.yml, I then added the following:
Copy code
compute_logs:
  module: dagster.core.storage.local_compute_log_manager
  class: LocalComputeLogManager
  config:
    base_dir: /opt/dagster/logs
Being
/opt/dagster/logs
my mounted volume. I can see my volume in Docker, and all of the subfolders being created by each run, I don’t see any files. ``` ```
Ah! Actually now I got it, I still need to add the volume that I’ve defined in the Docker-compose file, to the “DockerRunLauncher.container_kwargs” 👀
👍 1