David Weber
05/24/2023, 1:53 PMpyspark_step_launcher
.
How can we achieve that? I know the documentation on python logging, however, we do not want to adapt every submodule with
logger = get_dagster_logger()
because we are using the modules also outside of Dagster.
The second option is to use the logging
library, and specifying a normal logger: logger = logging.getLogger("my_logger")
Additionally, you need to do some modification to the dagster.yaml file.
However, if we choose this second option, we cannot gather any logs from Databricks. Meaning, the libraries we use in the job that is started in Databricks, would be outputting some logs via the logging
module. But they are not arriving in Dagit. It almost seems like, that the specification in dagster.yaml is ignored, when dagster is executed on Databricks (via pyspark_step_launcher
).
So, how do we achieve to receive all logs (incl. imported libraries) into Dagit, even when the scripts are executed in Databricks?
Some help would be very much appreciated! Thanksrex
05/24/2023, 2:00 PMDavid Weber
05/24/2023, 2:03 PMrex
05/24/2023, 3:07 PMowen
05/24/2023, 3:26 PMdatabricks_pyspark_step_launcher
causes your dagster code to be executed in an environment that does not have direct access to your DagsterInstance, it's correct that certain config (like the logging config) is ignored. So currently, it's not possible to capture logging
calls directly into the structured event call when the step is executed on databricks.
However, the unstructured stdout/stderr streams are captured automatically, and provided you have a compute log manager set up, these should be available in dagit. If you configure your logging (in python, not in the dagster instance) such that these logs are emitted to stdout/stderr, then they should be captured and viewable in dagit.
The added benefit here of keeping these in stdout/stderr is that it prevents your database from being flooded with lots of log messages (and instead writes them to a more suitable, unstructured storage location).David Weber
05/24/2023, 3:37 PMDavid Weber
05/25/2023, 12:01 PMdagster.yaml
, which points to an Azure blob container.
• Setup the python logging with a StreamHandler
in a way that it outputs to sys.stdout
(where then Dagster puts things into the Azure container).
This works fine for everything that runs locally! But for the asset that is executed via the pyspark_step_launcher
in Databricks, we get some weird behavior.
We can even confirm, that dagster.yaml
is completely ignored (or at least the compute log manager) within the pyspark_step_launcher
, because introducing mistakes to the dagster.yaml
does not produce an error, only when executed locally Dagster throws an error.
We are still troubeshooting and I would post an update if we have one.David Weber
05/25/2023, 1:34 PMstdout
logs are written directly inside dagit, instead of in a file.
Additionally, the stdout
files in Azure blob seem to be empty (0B), but the stderr
files contain logs.David Weber
05/25/2023, 3:35 PMDavid Weber
05/25/2023, 3:36 PMpyspark_step_launcher
is weird:owen
05/25/2023, 6:09 PMsys.stdout.write
. If you're interested in filing an issue (or even contributing!) this would be a fairly straightforward fixDöme Lőrinczy
05/30/2023, 1:09 PMDöme Lőrinczy
05/30/2023, 2:20 PM