Hey we re getting Agent e4cc09c5 detected that run worker fa dagster #dagster-plus

Hey, we're getting "Agent e4cc09c5 detected that r...

Emilja Dankevičiūtė

12/13/2022, 9:13 AM

Hey, we're getting "Agent e4cc09c5 detected that run worker failed: The code location server that was hosting this run is not responding. It may have crashed or been OOM killed." when trying to materialize assets in a job by running said job manually. However, if we try to materialize asset by asset, all is well. How can we debug this?

Emilja Dankevičiūtė

12/13/2022, 11:34 AM

Other times when running the same job, we're getting "[DagsterApiServer] Run execution process for af60a491-8955-45fc-86c9-3654677b35e9 unexpectedly exited with exit code -9." "This run has been marked as failed from outside the execution context.".

daniel

12/13/2022, 12:56 PM

Hi Emilja - it sounds from this error message like the job needs more memory - you may want to turn off non-isolated runs which will make the run take a bit longer to start up, but give it a lot more memory. There are instructions here for turning it off for a particular run from the launchpad or for all runs: https://docs.dagster.io/dagster-cloud/deployment/serverless#run-isolation

Emilja Dankevičiūtė

12/13/2022, 1:38 PM

hmm, the assets in questions aren't that heavy, though - just dbt (which takes up to 5 mins) and some external api (airbyte) and when I select all the assets manually from asset window and click to materialize, all is well.

daniel

12/13/2022, 1:40 PM

Is that something you’ve reproduced multiple times? it’s consistently crashing when you press materialize all, but never crashing when you select all the assets manually and materialize them all in a single run?

Emilja Dankevičiūtė

12/13/2022, 1:42 PM

for selecting assets under a job, yes, tried several times. for materializing all assets via Asset tab, only tried once. Waiting until a few of our own internal jobs to finish, then will try again.

Emilja Dankevičiūtė

12/13/2022, 1:44 PM

I'm not sure on the best way to get help for this 😅 I could send a link to the job in Dagster Cloud for reference. Should I do it here or using the chat in Dagster Cloud?

daniel

12/13/2022, 1:46 PM

Posting or DMing a link to the run in cloud here is fine. I’m pretty confident that turning off non isolated runs when you launch the run will fix this, we can dig into why it’s using more memory than you expect though

Emilja Dankevičiūtė

12/13/2022, 1:50 PM

alright, will see about turning off non-isolated runs. As for the memory usage, might be good to know in the future, but if the other solution works then it's probably not as urgent.

daniel

12/13/2022, 3:13 PM

We did some digging and I think we have a lead on why this is using so much memory - we think its running more steps in parallel at once by default than it should be. We'll keep you posted when we have a fix out to make the default more memory-safe.

Emilja Dankevičiūtė

12/13/2022, 3:13 PM

Thanks!

daniel

12/13/2022, 3:19 PM

In the meantime, you could launch it with a config like this:

Copy code

execution:
  config:
    multiprocess:
      max_concurrent: 2

(Right now I think it is trying to do them all in parallel at once - even if each one individually doesn't use large amounts of memory, running ~15 of them at once could still cause memory problems)

Emilja Dankevičiūtė

12/13/2022, 3:22 PM

2 concurrent steps will take too long for the job to execute 🫤 at least for now, but we're gonna try to reduce the number to something More reasonable

daniel

12/13/2022, 3:23 PM

Non-isolated runs give you much more memory, so if running 15 in parallel is what's needed then that may be the way to go

Emilja Dankevičiūtė

12/13/2022, 7:30 PM

Disabling non isolated runs helped. Thanks!

Open in Slack

Previous Next