I try to proces with a sensor 1 op 1 file a folde...
# ask-community
f
I try to proces with a sensor 1 op 1 file a folder that contain about 50k files but the dagster seam to stop at 195. Each file could from few kb to 1Mb I woudl say avrage about 20k. Also I got this error on the logged in the cosol after the list of requestKeys https://sqlalche.me/e/14/e3q8
o
hi @Francesco Piccoli! what computation is happening in the sensor and what computation is happening in the op? sensors have a timeout of 60 seconds to complete their execution, so if any heavy computation is happening in there, then I could imagine thing breaking down.
f
Thanks @owen, my sensor just iterate over the folder and buld a run_request for each file it find.
should I consider the 60 second up to the yield of the request or overr all 10000s files it can go throught?
o
hi @Francesco Piccoli! here's a bit of information detailing how to break up the work that your sensor does on each iteration: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors
f
Thanks @owen I also notice that a simple read count write op can of a some small sized file can take almost 1s, is this due to some dagster overhead?
o
hm yeah if you're using the (default) multiprocess_executor, then each op will run in its own subprocess (which has some overhead). I'd recommend batching together the processing of multiple files at once into a single op
f
Do you suggest I do it at sensor level, I should batch there and send bachted run requests. Are there any alternative where I can conseve the 1 to 1 op to document relation. It would make for me more easy to handle the pipeline.
o
yeah the original suggestion would be to create run requests that look something like
RunRequest(run_config={"ops": {"process_op": {"config": {"files_to_process": [...]}}}})
. I get what you mean about wanting to preserve the 1 to 1 relationship is totally fair, but in that case there will be that subprocess initialization overhead for each op unfortunately.
f
Yes having the 1 to 1 would make many thing more simple, I need to understand thent the cost is worth of better go on the batching solution (tahnks for that). Do I'm right if I say that the use case of processing many small parallel jobs in the pipeline is not in the use case scoped by dagster at moment?
o
I think it'd be fair to say that dagster isn't designed around highly parallel low-latency (~a few seconds) operations. It can handle highly parallel operations, but there are latency limitations at the moment, where the overhead of launching a bunch of tasks can add up if each individual task is quite fast.
so 1000 tasks at once where each task takes a minute to complete: yes but 1000 tasks at once where each task takes a second to complete: not ideal
f
Apart for latecy let's say we are in the case of 10k tasks that take a minute, is then the UI experince to handle this runs considerd or usually dagsters main use case are for lower 100s tasks ?
I'm asking since I wonder if even if I want pay the price of overhead I then can have difficult to handle this number of tasks from a user exeprince perspective
o
100s of tasks is definitely more of Dagster's main focus. It can handle running 10k or so concurrent tasks, but for that scale you'd need to beef up your instance database and likely use something like the k8s executor to handle that amount of compute (as you're now probably beyond the realm of a single machine's capabilities regardless of the orchestration framework). In terms of the UI experience, that'd also get somewhat tough, views would likely be somewhat slow to load and aren't particularly geared towards visualizing that scale of info
f
Thanks @owen that's really useful information. Do you know if there are other orchestration that try to address exactly this use case lot of small jobs? Or if dagster plan to adress this use case in future?