Hi I was wondering if theres a way to limit the maximum numb dagster #announcements

Hi! I was wondering if theres a way to limit the m...

Cris

07/02/2020, 3:53 PM

Hi! I was wondering if theres a way to limit the maximum number of processes running solids in dagster using the multiprocess executor. I noticed that when running multiple pipelines with schedules, you could end up with much more processes than cpus. For us this is particularly an issue because we end up overloading our db with multiple heavy queries that end up waiting forever and never end.

alex

07/02/2020, 3:59 PM

Nothing in the multiprocess executor yet, though maybe in the future. For now using

celery

is likely your best bet. You can control number of workers and even set up different queues and tag solids to submit themselves in to those queues.

alex

07/02/2020, 4:00 PM

Oh wait, to clarify there is no global limiting in the multiprocess executor, but you can set it per run the maximum simultaneous subprocess

alex

07/02/2020, 4:03 PM

Another technique, if you are executing all these pipelines on the same box, is to build a

resource

that uses the local filesystem to coordinate

Cris

07/02/2020, 4:06 PM

Yes, setting per pipeline does not work too well because if for example one scheduling of the pipeline is running and another starts then anyway it starts new processes

Cris

07/02/2020, 4:07 PM

Could you elaborate on the resource. I believe one huge issue we have is that each time the resource to the DB is created it creates a new connection. It would be nice to have some kind of pool. However, it seems that when using the multiprocessor, the resources are not reused. One instance is created again

alex

07/02/2020, 4:09 PM

in the multiprocess executor each step happens in its own isolated process

Cris

07/02/2020, 4:11 PM

I see, how would you suggest to go about a shared resource that could allow to limit the connections created to the DB?

alex

07/02/2020, 4:11 PM

so the idea behind using the local filesystem to coordinate is to use something like a file lock to ensure only one (or N) processes can connect to the DB at the same time

Cris

07/02/2020, 4:11 PM

also, is there a way to avoid a scheduled pipeline to start if the previous has not finished?

alex

07/02/2020, 4:13 PM

there is a

should_execute

function you can override on the schedule, and in there you can query the instance to see if there is run still going. A related example can be seen here https://github.com/dagster-io/dagster/blob/master/examples/legacy_examples/dagster_examples/schedules.py

Cris

07/02/2020, 4:13 PM

I see, would you have an example at hand? I follow what it should do. But I have no idea how it might be implemented

Cris

07/02/2020, 4:14 PM

The message did not send earlier but i was refering to the file lock

Cris

07/02/2020, 4:14 PM

Also, thanks for the should execute example

Danny

07/02/2020, 4:25 PM

@Cris we used the above example to build this: https://dagster.slack.com/archives/CCCR6P2UR/p1593532476362500?thread_ts=1593530862.360700&cid=CCCR6P2UR Might help you

alex

07/02/2020, 4:27 PM

this library has cross-process / named

Semaphore

if you want to allow N simultaneous things to go http://semanchuk.com/philip/posix_ipc/ otherwise there are lots of file lock libraries,

portalocker

is one i saw repeatedly

alex

07/02/2020, 4:32 PM

Copy code

@resource
def my_db():
  with Semaphore(name='/some/unique/path', initial_value=MAX_CONCURRENT_CONNECTIONS) as sem:
  yield make_database_connection()

Cris

07/02/2020, 4:49 PM

Thanks guys! I'll take a look. This semaphore, i did not see it looking in google haha

Open in Slack

Previous Next