https://dagster.io/ logo
#announcements
Title
# announcements
c

Cris

07/02/2020, 3:53 PM
Hi! I was wondering if theres a way to limit the maximum number of processes running solids in dagster using the multiprocess executor. I noticed that when running multiple pipelines with schedules, you could end up with much more processes than cpus. For us this is particularly an issue because we end up overloading our db with multiple heavy queries that end up waiting forever and never end.
a

alex

07/02/2020, 3:59 PM
Nothing in the multiprocess executor yet, though maybe in the future. For now using
celery
is likely your best bet. You can control number of workers and even set up different queues and tag solids to submit themselves in to those queues.
Oh wait, to clarify there is no global limiting in the multiprocess executor, but you can set it per run the maximum simultaneous subprocess
Another technique, if you are executing all these pipelines on the same box, is to build a
resource
that uses the local filesystem to coordinate
c

Cris

07/02/2020, 4:06 PM
Yes, setting per pipeline does not work too well because if for example one scheduling of the pipeline is running and another starts then anyway it starts new processes
Could you elaborate on the resource. I believe one huge issue we have is that each time the resource to the DB is created it creates a new connection. It would be nice to have some kind of pool. However, it seems that when using the multiprocessor, the resources are not reused. One instance is created again
a

alex

07/02/2020, 4:09 PM
in the multiprocess executor each step happens in its own isolated process
c

Cris

07/02/2020, 4:11 PM
I see, how would you suggest to go about a shared resource that could allow to limit the connections created to the DB?
a

alex

07/02/2020, 4:11 PM
so the idea behind using the local filesystem to coordinate is to use something like a file lock to ensure only one (or N) processes can connect to the DB at the same time
c

Cris

07/02/2020, 4:11 PM
also, is there a way to avoid a scheduled pipeline to start if the previous has not finished?
a

alex

07/02/2020, 4:13 PM
there is a
should_execute
function you can override on the schedule, and in there you can query the instance to see if there is run still going. A related example can be seen here https://github.com/dagster-io/dagster/blob/master/examples/legacy_examples/dagster_examples/schedules.py
c

Cris

07/02/2020, 4:13 PM
I see, would you have an example at hand? I follow what it should do. But I have no idea how it might be implemented
The message did not send earlier but i was refering to the file lock
Also, thanks for the should execute example
d

Danny

07/02/2020, 4:25 PM
a

alex

07/02/2020, 4:27 PM
this library has cross-process / named
Semaphore
if you want to allow N simultaneous things to go http://semanchuk.com/philip/posix_ipc/ otherwise there are lots of file lock libraries,
portalocker
is one i saw repeatedly
Copy code
@resource
def my_db():
  with Semaphore(name='/some/unique/path', initial_value=MAX_CONCURRENT_CONNECTIONS) as sem:
  yield make_database_connection()
c

Cris

07/02/2020, 4:49 PM
Thanks guys! I'll take a look. This semaphore, i did not see it looking in google haha