Hi guys we want to implement logging across all scripts that dagster #ask-community

Hi guys, we want to implement logging across all ...

David Weber

05/24/2023, 1:53 PM

Hi guys, we want to implement logging across all scripts that are used in dagster, and also get the logs that are coming from imported libraries, as we have some self-written, private ones in our organization. Additionally, we are running one of our assets directly in Databricks via the

pyspark_step_launcher

. How can we achieve that? I know the documentation on python logging, however, we do not want to adapt every submodule with

logger = get_dagster_logger()

because we are using the modules also outside of Dagster. The second option is to use the

logging

library, and specifying a normal logger: logger =

logging.getLogger("my_logger")

Additionally, you need to do some modification to the dagster.yaml file. However, if we choose this second option, we cannot gather any logs from Databricks. Meaning, the libraries we use in the job that is started in Databricks, would be outputting some logs via the

logging

module. But they are not arriving in Dagit. It almost seems like, that the specification in dagster.yaml is ignored, when dagster is executed on Databricks (via

pyspark_step_launcher

). So, how do we achieve to receive all logs (incl. imported libraries) into Dagit, even when the scripts are executed in Databricks? Some help would be very much appreciated! Thanks

❤️ 1

rex

05/24/2023, 2:00 PM

This is still an ongoing question of configurability, as currently it is not ergonomic to configure the logging system. I recommend that you add your use case here: https://github.com/dagster-io/dagster/discussions/12495

David Weber

05/24/2023, 2:03 PM

Thanks @rex! I will definitely add it there. However, do we have a solution to the problem in the meantime?

rex

05/24/2023, 3:07 PM

@owen mind weighing in here? This should have been made possible by https://github.com/dagster-io/dagster/pull/6046, right?

owen

05/24/2023, 3:26 PM

hi @David Weber -- because using the

databricks_pyspark_step_launcher

causes your dagster code to be executed in an environment that does not have direct access to your DagsterInstance, it's correct that certain config (like the logging config) is ignored. So currently, it's not possible to capture

logging

calls directly into the structured event call when the step is executed on databricks. However, the unstructured stdout/stderr streams are captured automatically, and provided you have a compute log manager set up, these should be available in dagit. If you configure your logging (in python, not in the dagster instance) such that these logs are emitted to stdout/stderr, then they should be captured and viewable in dagit. The added benefit here of keeping these in stdout/stderr is that it prevents your database from being flooded with lots of log messages (and instead writes them to a more suitable, unstructured storage location).

David Weber

05/24/2023, 3:37 PM

Hi @owen, thank you for this suggestion! This is something we haven't thought about. We will try to implement this idea tomorrow and I will report back to you. Thanks again!

David Weber

05/25/2023, 12:01 PM

Hi @owen, we did the following steps now: • Setup a compute log manager in

dagster.yaml

, which points to an Azure blob container. • Setup the python logging with a

StreamHandler

in a way that it outputs to

sys.stdout

(where then Dagster puts things into the Azure container). This works fine for everything that runs locally! But for the asset that is executed via the

pyspark_step_launcher

in Databricks, we get some weird behavior. We can even confirm, that

dagster.yaml

is completely ignored (or at least the compute log manager) within the

pyspark_step_launcher

, because introducing mistakes to the

dagster.yaml

does not produce an error, only when executed locally Dagster throws an error. We are still troubeshooting and I would post an update if we have one.

David Weber

05/25/2023, 1:34 PM

To support this with some Dagit screenshots. The first one shows how the Logs are successfully captured when the asset is executed locally. The second image shows, that (1) the INFO message from the the INFO message in the first one and (2) the

stdout

logs are written directly inside dagit, instead of in a file. Additionally, the

stdout

files in Azure blob seem to be empty (0B), but the

stderr

files contain logs.

David Weber

05/25/2023, 3:35 PM

So local execution is just fine:

David Weber

05/25/2023, 3:36 PM

Execution via

pyspark_step_launcher

is weird:

owen

05/25/2023, 6:09 PM

ah sorry I was mistaken in the behavior -- to demystify things slightly, the way the databricks pyspark step launcher works is to: • ship a copy of your code into dbfs, along with some serialized state • execute that single step in-process in databricks (by executing databricks_step_main.py) • this process does not have access to your real DagsterInstance, so it creates a temporary one which is purely within databricks (this is why your dagster_instance.yaml does not impact execution) • as the step executes, pickled events are written to a file • back in the host process (outside of databricks) this file is being continually read from, and these events are being processed as if they were originated from the host process • finally, when the step ends in databricks, it writes its stdout/stderr to separate files • the host process reads from these files Originally, I mistakenly believed that it read these files, and then emitted their contents to stdout/stderr streams, but it turns out that they are logged into the structured event log. This isn't ideal behavior, and there's definitely a case for updating this to just do

sys.stdout.write

. If you're interested in filing an issue (or even contributing!) this would be a fairly straightforward fix

Döme Lőrinczy

05/30/2023, 1:09 PM

Hi @owen, I am Döme, we are working with David on this Dagster project. I have tested your proposed solution (changing log.info(stdout) to sys.stdout.write(stdout)) and it indeed solves the issue of the stdout on Databricks being not written to .out file in Azure container, however, this way we lose the final stdout logging in Dagit. So the solution seems to be to simply add the sys.stdout.write(stdout) call after log.info(stdout). I have created a git issue for this case: https://github.com/dagster-io/dagster/issues/14519

Döme Lőrinczy

05/30/2023, 2:20 PM

Furthermore, I have created a pull request with the aforementioned small modification, however, it is still not properly tested, please suggest a way for testing! Pull request: https://github.com/dagster-io/dagster/pull/14521

3 Views

Open in Slack

Previous Next