Hi I think Dagster is really missing something like a `Multi dagster #dagster-feedback

Hi. I think Dagster is really missing something li...

Guillaume Onfroy

03/17/2023, 2:31 PM

Hi. I think Dagster is really missing something like a

MultithreadedExecutor

which would allow executing steps in parallel within the same process. Currently, when running on Kubernetes using the

MultiprocessExecutor

, there's a massive cold start delay for each subprocess being started because, from what I understand, it has to reload the entire project and assets, which can lead to extremely long jobs even when the tasks are very lightweight.

Chris Histe

03/17/2023, 2:54 PM

I asked a related question in #dagster-support yesterday: https://dagster.slack.com/archives/C01U954MEER/p1678979490771579 I’m not quite sure what the cold start is about either but I think it’s a bug.

Guillaume Onfroy

03/17/2023, 2:59 PM

My understanding is that it's not a bug, it's how executors are designed. For each of your assets on your graph, meaning for each subprocess, your entire project (all of the assets) are loaded. Hence the proposal of introducing a multithreaded executor instead, where everything is loaded once instead.

👀 1

alex

03/17/2023, 3:36 PM

tracking task is https://github.com/dagster-io/dagster/issues/4041 one mitigation for heavy import time cost thats possible in the interim is to use the

forkserver

start method https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor

👍 1

Guillaume Onfroy

03/17/2023, 3:45 PM

I tried that option earlier but saw only very marginal speed gain (a handful of seconds).

Danny Steffy

03/17/2023, 4:35 PM

I saw mentioned here that we might need to add the

preload_modules

in the config if the default doesn't load the necessary modules as expected: https://github.com/dagster-io/dagster/discussions/7338 Is that in the

forkserver

start method config? How can we determine which modules were loaded correctly via

forkserver

and which we need to list explicitly?

Guillaume Onfroy

03/17/2023, 4:54 PM

Just did a test with:

Copy code

execution:
  config:
    multiprocess:
      start_method:
        forkserver:
          preload_modules:
          - dagster
          - dagster_dbt
          - dagster_k8s
          - dagster_postgres
          - ...

I see no difference...

Guillaume Onfroy

03/17/2023, 5:27 PM

Actually that's not true. I did some more tests and I do see up to some minus 20-25% delay between steps when listing modules. But in terms of absolute figures, the cold start is still annoyingly long between steps.

alex

03/17/2023, 5:48 PM

the preload module resolution is here https://github.com/dagster-io/dagster/blame/master/python_modules/dagster/dagster/_core/executor/multiprocess.py#L154-L171 The best case scenario is to pre-load the module that defines all your definitions, assuming the common pattern of defining them at import time (and that dependent libraries are fork safe). In the end I agree that an in-process async/threaded executor is really what is needed to make lightweight concurrency work better

Harrison Conlin

03/18/2023, 1:47 AM

out of interest, I know you can use async ops, but how are they handled in dagster? I assume they are just passed individually to asyncio.run

alex

03/20/2023, 2:53 PM

just passed individually to asyncio.run

correct

Abhishek Agrawal

03/23/2023, 4:23 AM

@alex I wasn't able to understand how to make use of pre-load definitions.. our code location code is pretty full-on and it takes a lot of time at the start of each process. Can I make use of what you're referring here? for reference - https://dagster.slack.com/archives/C02LJ7G0LAZ/p1679446117427139?thread_ts=1679377131.710799&cid=C02LJ7G0LAZ

alex

03/23/2023, 3:00 PM

module preloads is specific to multiprocess executor

alex

03/23/2023, 3:07 PM

https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html#fn:2

Abhishek Agrawal

03/23/2023, 8:56 PM

We had to switch to in_process as multiprocess was taking too long for each step to execute. If not preload, would you have any other suggestion for the issue I have raised in my comment?

Chris Histe

03/23/2023, 9:03 PM

On my side, I had some higher priority work. But I’m back on this, I’m going to try out the forkserver. Is this the correct YAML config? I’m not sure about the null

Copy code

execution:
  config:
    multiprocess:
      max_concurrent: 15
      start_method:
        forkserver: null

Chris Histe

03/23/2023, 9:05 PM

The best case scenario is to pre-load the module that defines all your definitions

what do you mean by definitions? And do you have a recommendation to know whether or not dependent libraries are fork safe? @alex

alex

03/23/2023, 9:14 PM

would you have any other suggestion for the issue I have raised in my comment?

probably need to consider optimizations specific to how you are building up your code location

forkserver: null

not sure about null working,

forkserver: {}

should work

what do you mean by definitions?

@op

@asset

, etc https://docs.dagster.io/concepts/code-locations

know whether or not dependent libraries are fork safe?

No special advice, try it out and/or search the web.

👌 1

Chris Histe

03/23/2023, 9:48 PM

forkserver

gave us a 30 to 40% faster run. Pretty nice. Multithreading would be nicer but for now this will do for our use case.

Jordan

03/23/2023, 11:58 PM

Hi! 👋 I face a similar problem where at the beginning of each execution the whole project is reloaded and this is quite costly (40 seconds of waiting). My project has about 50 repositories (1 repository per client) of variable size (between 5 and 70 jobs and 20 and 1000 assets). I am aware that the ideal solution would be to use several workspaces to have much shorter reloads. In reality, I don't really see any reason to do that, except to reduce the load, especially since I would have to use a more powerful cluster in terms of CPU and memory resources. @alex Do you know if in the future it is planned that this reloading will not be done anymore or that the reloading will be for a smaller part (just the repository and not the whole workspace for example)?

alex

03/24/2023, 2:55 PM

Do you know if in the future it is planned that this reloading will not be done anymore or that the reloading will be for a smaller part (just the repository and not the whole workspace for example)?

Directionally we are moving away from

repository

Definitions

which are limited to one per code location https://docs.dagster.io/concepts/code-locations https://github.com/dagster-io/dagster/discussions/10772 so i don’t expect much change in

repository

since I would have to use a more powerful cluster in terms of CPU and memory resources

I wouldn’t expect much total CPU cost shift going from one to N code servers. The fixed memory overhead will accumulate across N servers but I would still be surprised if this was meaningful relative to any data computation happening in ops/assets (unless none of the assets/ops do meaningful local compute).

9 Views

Open in Slack

Previous Next