https://dagster.io/ logo
Title
z

Zach

05/31/2022, 2:46 PM
A few comments regarding the job runs view with hundreds of ops: • log loading performance is pretty poor, it takes a few minutes to load the logs and logs are completely unqueryable until they're all loaded • all logs are re-fetched every time the job run page is opened, so if you accidentally close the job runs tab you have to go get a cup of coffee while it loads again. having the logs cached somehow would help mitigate the initial load time • it would be nice to be able to pan and zoom around the op view. we have a job with hundreds of ops and particularly the ability to zoom in and out on different parts of the job would be really useful for exploring the generated op graph.
👍 1
r

rex

05/31/2022, 2:51 PM
For your third bullet, I believe you can already pan/zoom in the op view using the slider in the top pane
you can also use op selection within the runs view to highlight different parts of your job
z

Zach

05/31/2022, 2:58 PM
ah okay I missed that. I think the issue then is that the slider doesn't really let you zoom out to get a high-level overview - the default is already the most zoomed out. op selection is awesome, and I probably could use it more to help navigate larger graphs. I didn't realize until now that you can do some pattern-matching within an ops name, which is super handy for subselecting large fan-outs. seems like it's just basic wildcard matching that's supported?
r

rex

05/31/2022, 3:03 PM
z

Zach

05/31/2022, 3:04 PM
also seems like the slider is only available in the timed view
r

rex

05/31/2022, 3:05 PM
I think the issue then is that the slider doesn’t really let you zoom out to get a high-level overview - the default is already the most zoomed out.
To the left of the slider, there are other views. Have you seen the waterfall view? I wonder if that meets your needs for an overview
gotcha, you want zoom/pan even on the waterfall view
z

Zach

05/31/2022, 3:15 PM
yeah I think so, let me explain the use case a little more - we've got a dynamic graph with a few hundred ops in the same fan-out, so basically a big vertically-stacked list of ops in the UI. our fan-out ops are generated as clusters of ops that are responsible for processing a domain entity (a chromosome). because dagit displays the ops in sort-order, we're able to index them alphanumerically and have ops responsible for processing the same chromosome displayed together. what would be helpful is a way to quickly see if there are clusters of ops failing, which may indicate a failure specific to a specific chromosome that requires specific tuning or diagnostics. being able to zoom out farther on both the timed view and the waterfall view might help identify these clusters of failures from a high view point, then zoom in or subselect them to start reviewing logs.
I do think I could probably also use the summary on the right a little more too to get some of this information
a

alex

05/31/2022, 4:28 PM
what set-up do you have for logging to generate this volume of log output? Are you configuring things so all python logging is sent to dagsters structured log? Have you considered leveraging the “raw” stdout/stderr log viewing?
z

Zach

05/31/2022, 4:40 PM
right now everything is using Dagster's default logger. I hadn't really considered using the raw viewer, do you mean having logs not go to the standard dagster logger and just having regular python loggers whose output can be viewed in the raw viewer?
a

alex

05/31/2022, 5:01 PM
right now everything is using Dagster’s default logger
just to make sure i understand precisely, how exactly are you configuring things to achieve this?
I hadn’t really considered using the raw viewer, do you mean having logs not go to the standard dagster logger and just having regular python loggers whose output can be viewed in the raw viewer?
Ya this is what I was referring to. The product experience around this is lacking so was curious to understand if you even knew it was there let alone considered it. The structured event log that you are adding all these log messages to is built focusing on the structured events. You can see in the open source implementation that each entry maps to a row in the database. In the long term ideally we will handle this all seamlessly much better, but in the short term its unlikely these things will change drastically and your best lever for improving the experience is to only include high signal log messages in the main stream and leverage the raw viewer for high volume low signal information.
z

Zach

05/31/2022, 5:13 PM
right now there's zero configuration of the dagster logger, everything within ops is being logged using the logger provided on the OpExecutionContext. so it does make sense that the logs could be somewhat overloading. to be clear, is it any log event emitted to the Dagster structured event logger that will cause this strain? or just those above the configured log level? that workaround makes sense, I'll think about how I might be able to separate these log streams a bit more. is this mostly achieved through logging levels? or is there a way to get dagster to pick up logging output from another logger external to dagster and only display it in the raw viewer?
a

alex

05/31/2022, 6:15 PM
is it any log event emitted to the Dagster structured event logger that will cause this strain? or just those above the configured log level?
all
context.log
calls will cause this strain
is this mostly achieved through logging levels? or is there a way to get dagster to pick up logging output from another logger external to dagster and only display it in the raw viewer?
any output to stdout / stderr should be caught by log capture, so that could be direct use of python
logger
,
print
, or even the output of non python subprocesses.
z

Zach

05/31/2022, 6:19 PM
okay that all helps me understand quite a bit better how logging works under the hood, thanks so much I think I've got some paths forward for mitigating this a bit
I had a context.log.debug() statement in a long polling loop on every op in my big fan out 😳 I bet that's a huge contributor to if not the problem here as otherwise the ops really aren't logging that much even though there's a lot of them.
:blob_woah: 1
a

alex

05/31/2022, 6:36 PM
ya that sounds like it has a good chance to be the sole cause of it being so unreasonably bad
thanks for the follow up. We should in some more obvious way surface information to make stuff like this easier to catch.
z

Zach

05/31/2022, 7:09 PM
no problem, really appreciate you being so responsive as always. it's a tough thing to catch in the platform due to it being more of a user-side issue. I think even a little banner that appeared when components were taking a long time to load that had some contextual info around the graphql calls that are taking longer than normal might help. tough to know where that threshold is though
r

rex

06/01/2022, 7:29 AM
we’ll be able to catch this in the future (and let other users be more aware) by surfacing information of the count of events - thanks to dish and https://github.com/dagster-io/dagster/pull/8141
😛artydagster: 1