Hey, we're getting "Agent e4cc09c5 detected that r...
# dagster-plus
e
Hey, we're getting "Agent e4cc09c5 detected that run worker failed: The code location server that was hosting this run is not responding. It may have crashed or been OOM killed." when trying to materialize assets in a job by running said job manually. However, if we try to materialize asset by asset, all is well. How can we debug this?
Other times when running the same job, we're getting "[DagsterApiServer] Run execution process for af60a491-8955-45fc-86c9-3654677b35e9 unexpectedly exited with exit code -9." "This run has been marked as failed from outside the execution context.".
d
Hi Emilja - it sounds from this error message like the job needs more memory - you may want to turn off non-isolated runs which will make the run take a bit longer to start up, but give it a lot more memory. There are instructions here for turning it off for a particular run from the launchpad or for all runs: https://docs.dagster.io/dagster-cloud/deployment/serverless#run-isolation
e
hmm, the assets in questions aren't that heavy, though - just dbt (which takes up to 5 mins) and some external api (airbyte) and when I select all the assets manually from asset window and click to materialize, all is well.
d
Is that something you’ve reproduced multiple times? it’s consistently crashing when you press materialize all, but never crashing when you select all the assets manually and materialize them all in a single run?
e
for selecting assets under a job, yes, tried several times. for materializing all assets via Asset tab, only tried once. Waiting until a few of our own internal jobs to finish, then will try again.
I'm not sure on the best way to get help for this 😅 I could send a link to the job in Dagster Cloud for reference. Should I do it here or using the chat in Dagster Cloud?
d
Posting or DMing a link to the run in cloud here is fine. I’m pretty confident that turning off non isolated runs when you launch the run will fix this, we can dig into why it’s using more memory than you expect though
e
alright, will see about turning off non-isolated runs. As for the memory usage, might be good to know in the future, but if the other solution works then it's probably not as urgent.
d
We did some digging and I think we have a lead on why this is using so much memory - we think its running more steps in parallel at once by default than it should be. We'll keep you posted when we have a fix out to make the default more memory-safe.
e
Thanks!
d
In the meantime, you could launch it with a config like this:
Copy code
execution:
  config:
    multiprocess:
      max_concurrent: 2
(Right now I think it is trying to do them all in parallel at once - even if each one individually doesn't use large amounts of memory, running ~15 of them at once could still cause memory problems)
e
2 concurrent steps will take too long for the job to execute 🫤 at least for now, but we're gonna try to reduce the number to something More reasonable
d
Non-isolated runs give you much more memory, so if running 15 in parallel is what's needed then that may be the way to go
e
Disabling non isolated runs helped. Thanks!