I try to proces with a sensor 1 op 1 file a folder that cont dagster #ask-community

I try to proces with a sensor 1 op 1 file a folde...

Francesco Piccoli

05/24/2023, 9:20 PM

I try to proces with a sensor 1 op 1 file a folder that contain about 50k files but the dagster seam to stop at 195. Each file could from few kb to 1Mb I woudl say avrage about 20k. Also I got this error on the logged in the cosol after the list of requestKeys https://sqlalche.me/e/14/e3q8

owen

05/24/2023, 11:10 PM

hi @Francesco Piccoli! what computation is happening in the sensor and what computation is happening in the op? sensors have a timeout of 60 seconds to complete their execution, so if any heavy computation is happening in there, then I could imagine thing breaking down.

Francesco Piccoli

05/25/2023, 3:15 PM

Thanks @owen, my sensor just iterate over the folder and buld a run_request for each file it find.

Francesco Piccoli

05/25/2023, 3:17 PM

should I consider the 60 second up to the yield of the request or overr all 10000s files it can go throught?

owen

05/25/2023, 6:12 PM

hi @Francesco Piccoli! here's a bit of information detailing how to break up the work that your sensor does on each iteration: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors

Francesco Piccoli

05/25/2023, 7:54 PM

Thanks @owen I also notice that a simple read count write op can of a some small sized file can take almost 1s, is this due to some dagster overhead?

owen

05/25/2023, 8:31 PM

hm yeah if you're using the (default) multiprocess_executor, then each op will run in its own subprocess (which has some overhead). I'd recommend batching together the processing of multiple files at once into a single op

Francesco Piccoli

05/25/2023, 8:49 PM

Do you suggest I do it at sensor level, I should batch there and send bachted run requests. Are there any alternative where I can conseve the 1 to 1 op to document relation. It would make for me more easy to handle the pipeline.

owen

05/25/2023, 9:12 PM

yeah the original suggestion would be to create run requests that look something like

RunRequest(run_config={"ops": {"process_op": {"config": {"files_to_process": [...]}}}})

. I get what you mean about wanting to preserve the 1 to 1 relationship is totally fair, but in that case there will be that subprocess initialization overhead for each op unfortunately.

Francesco Piccoli

05/25/2023, 9:18 PM

Yes having the 1 to 1 would make many thing more simple, I need to understand thent the cost is worth of better go on the batching solution (tahnks for that). Do I'm right if I say that the use case of processing many small parallel jobs in the pipeline is not in the use case scoped by dagster at moment?

owen

05/25/2023, 9:23 PM

I think it'd be fair to say that dagster isn't designed around highly parallel low-latency (~a few seconds) operations. It can handle highly parallel operations, but there are latency limitations at the moment, where the overhead of launching a bunch of tasks can add up if each individual task is quite fast.

owen

05/25/2023, 9:24 PM

so 1000 tasks at once where each task takes a minute to complete: yes but 1000 tasks at once where each task takes a second to complete: not ideal

Francesco Piccoli

05/25/2023, 9:46 PM

Apart for latecy let's say we are in the case of 10k tasks that take a minute, is then the UI experince to handle this runs considerd or usually dagsters main use case are for lower 100s tasks ?

Francesco Piccoli

05/25/2023, 9:47 PM

I'm asking since I wonder if even if I want pay the price of overhead I then can have difficult to handle this number of tasks from a user exeprince perspective

owen

05/25/2023, 11:25 PM

100s of tasks is definitely more of Dagster's main focus. It can handle running 10k or so concurrent tasks, but for that scale you'd need to beef up your instance database and likely use something like the k8s executor to handle that amount of compute (as you're now probably beyond the realm of a single machine's capabilities regardless of the orchestration framework). In terms of the UI experience, that'd also get somewhat tough, views would likely be somewhat slow to load and aren't particularly geared towards visualizing that scale of info

Francesco Piccoli

05/26/2023, 7:45 PM

Thanks @owen that's really useful information. Do you know if there are other orchestration that try to address exactly this use case lot of small jobs? Or if dagster plan to adress this use case in future?

3 Views

Open in Slack

Previous Next