Hi Team, I am currently using dagster's `execute_i...
# ask-community
s
Hi Team, I am currently using dagster's
execute_in_process
API to process a graph with ~15 ops. I am observing significant performance drop when switching to a persistent dagster instance with SQLLite/Postgres based run & event log storages. The execution time increases to
4 seconds
which was earlier taking around
250ms
. The executor is still the in-process one, so I guess its DB writes that are causing this overhead. Is that expected? Below are screenshots for execution times.
s
@Sandeep Aggarwal - based on your description of the situation, it does sound like the event log storage is a likely source of the slowdown. @daniel - do you know if this is typical?
a
You could use a profiler like
py-spy
to determine where exactly the slowdown is.
SQLLite/Postgres
The details here will have a big impact. Is it sqlite or postgres? If posgres where is the DB running?
s
Thanks for your reply @sandy @alex I am currently trying different configuration on my local system. So far I have been running the whole pipeline using default executor but would like to switch to dask as we get ready to deploy to prod. I tried profiling using py-spy as you suggested and speedscope format profile data is attached. I profiles all the configurations i.e. 1. Ephemeral - ephemeral.ss.json 2. Persistent with SQLLite - persistent.sl.ss.json 3. Persistent with Postgres persistent.pg.ss.json A quick look tells that call to
log_dagster_event
is taking significantly more time,
20ms
with in-memory compared to
800ms - 1200ms
with persistent storage. You might have more insights. I am attaching the files for your reference. Can you please take a look?
a
Thanks for taking the time to capture and send over these samples
These results are not terribly surprising, we have not optimized the system for very small/fast `op`s. The common usage patterns we observe typically have
op
runtimes on the order of minutes or even hours, so these 100s of millisecond overheads per event have not yet been a focus.
s
Sure @alex that makes sense. My use-case involves running on-demand workflows with minimum latency. So, far I have been able to achieve it using Dagster's in-process executor. The only thing I am lacking is parallel execution of tasks which I hoped to achieve by integrating Dask executor. However, looks like in order to use dask executor, I need to trigger workflows using Dagster's GraphQL API and that is the reason I was trying to setup a persistent Dagster instance. Is that right or I am missing something here? Basically, what i want to know if it is possible to use multiprocess_executor or dask_executor with ephemeral Dagster instance?
a
is possible to use multiprocess_executor or dask_executor with ephemeral Dagster instance?
Not easily, and i expect you to hit further latency issues from the per-process overhead. I believe https://github.com/dagster-io/dagster/issues/4041 is what you would need for your use case.
s
Exactly, this is what i am looking for. Gave my thumbs up. Thanks.
a
if you want to share some context on that issue about your use case and what value you are getting out of dagster in that use case it would be appreciated