Hi guys, we want to implement logging across all ...
# ask-community
d
Hi guys, we want to implement logging across all scripts that are used in dagster, and also get the logs that are coming from imported libraries, as we have some self-written, private ones in our organization. Additionally, we are running one of our assets directly in Databricks via the
pyspark_step_launcher
. How can we achieve that? I know the documentation on python logging, however, we do not want to adapt every submodule with
logger = get_dagster_logger()
because we are using the modules also outside of Dagster. The second option is to use the
logging
library, and specifying a normal logger: logger =
logging.getLogger("my_logger")
Additionally, you need to do some modification to the dagster.yaml file. However, if we choose this second option, we cannot gather any logs from Databricks. Meaning, the libraries we use in the job that is started in Databricks, would be outputting some logs via the
logging
module. But they are not arriving in Dagit. It almost seems like, that the specification in dagster.yaml is ignored, when dagster is executed on Databricks (via
pyspark_step_launcher
). So, how do we achieve to receive all logs (incl. imported libraries) into Dagit, even when the scripts are executed in Databricks? Some help would be very much appreciated! Thanks
❤️ 1
r
This is still an ongoing question of configurability, as currently it is not ergonomic to configure the logging system. I recommend that you add your use case here: https://github.com/dagster-io/dagster/discussions/12495
d
Thanks @rex! I will definitely add it there. However, do we have a solution to the problem in the meantime?
r
@owen mind weighing in here? This should have been made possible by https://github.com/dagster-io/dagster/pull/6046, right?
o
hi @David Weber -- because using the
databricks_pyspark_step_launcher
causes your dagster code to be executed in an environment that does not have direct access to your DagsterInstance, it's correct that certain config (like the logging config) is ignored. So currently, it's not possible to capture
logging
calls directly into the structured event call when the step is executed on databricks. However, the unstructured stdout/stderr streams are captured automatically, and provided you have a compute log manager set up, these should be available in dagit. If you configure your logging (in python, not in the dagster instance) such that these logs are emitted to stdout/stderr, then they should be captured and viewable in dagit. The added benefit here of keeping these in stdout/stderr is that it prevents your database from being flooded with lots of log messages (and instead writes them to a more suitable, unstructured storage location).
d
Hi @owen, thank you for this suggestion! This is something we haven't thought about. We will try to implement this idea tomorrow and I will report back to you. Thanks again!
Hi @owen, we did the following steps now: • Setup a compute log manager in
dagster.yaml
, which points to an Azure blob container. • Setup the python logging with a
StreamHandler
in a way that it outputs to
sys.stdout
(where then Dagster puts things into the Azure container). This works fine for everything that runs locally! But for the asset that is executed via the
pyspark_step_launcher
in Databricks, we get some weird behavior. We can even confirm, that
dagster.yaml
is completely ignored (or at least the compute log manager) within the
pyspark_step_launcher
, because introducing mistakes to the
dagster.yaml
does not produce an error, only when executed locally Dagster throws an error. We are still troubeshooting and I would post an update if we have one.
To support this with some Dagit screenshots. The first one shows how the Logs are successfully captured when the asset is executed locally. The second image shows, that (1) the INFO message from the the INFO message in the first one and (2) the
stdout
logs are written directly inside dagit, instead of in a file. Additionally, the
stdout
files in Azure blob seem to be empty (0B), but the
stderr
files contain logs.
So local execution is just fine:
Execution via
pyspark_step_launcher
is weird:
o
ah sorry I was mistaken in the behavior -- to demystify things slightly, the way the databricks pyspark step launcher works is to: • ship a copy of your code into dbfs, along with some serialized state • execute that single step in-process in databricks (by executing databricks_step_main.py) • this process does not have access to your real DagsterInstance, so it creates a temporary one which is purely within databricks (this is why your dagster_instance.yaml does not impact execution) • as the step executes, pickled events are written to a file • back in the host process (outside of databricks) this file is being continually read from, and these events are being processed as if they were originated from the host process • finally, when the step ends in databricks, it writes its stdout/stderr to separate files • the host process reads from these files Originally, I mistakenly believed that it read these files, and then emitted their contents to stdout/stderr streams, but it turns out that they are logged into the structured event log. This isn't ideal behavior, and there's definitely a case for updating this to just do
sys.stdout.write
. If you're interested in filing an issue (or even contributing!) this would be a fairly straightforward fix
d
Hi @owen, I am Döme, we are working with David on this Dagster project. I have tested your proposed solution (changing log.info(stdout) to sys.stdout.write(stdout)) and it indeed solves the issue of the stdout on Databricks being not written to .out file in Azure container, however, this way we lose the final stdout logging in Dagit. So the solution seems to be to simply add the sys.stdout.write(stdout) call after log.info(stdout). I have created a git issue for this case: https://github.com/dagster-io/dagster/issues/14519
Furthermore, I have created a pull request with the aforementioned small modification, however, it is still not properly tested, please suggest a way for testing! Pull request: https://github.com/dagster-io/dagster/pull/14521