Hi if I have list of data in initial solid and I want to chu dagster #announcements

Hi, if I have list of data in initial solid and I ...

borgdrone7

06/19/2020, 11:35 AM

Hi, if I have list of data in initial solid and I want to chunk it by X rows and send each of the chunks to the next solid for processing (process many chunks in parallel for example with the next solid), how can I do it? I created simple test where I have get_data, chunk_data solids. Chunk_data solid yields X number of rows and it receives data from get_data solid. However I cannot connect data from chunk_data to process_chunk solid as Dagster doesn't allow that setup as yield seems to be considered as multiple outputs from a solid. I cannot add definitions for multiple outputs as I don't know until runtime how much there will be and it is inconvenient anyway. So I guess I am approaching the complete problem in a wrong way. What would be correct way to do this? I want to speed up processing of 50+ million records of data I need to process from txt files and then through several steps. Steps can be process in parallel as data is independent on each other but I shouldn't process the same record more then once.

D 1

sandy

06/19/2020, 5:50 PM

Dagster currently doesn't support this pattern directly, and our recommendation right now would be to have a solid launch a job in a parallel compute framework like Spark or Dask. We may add support for that pattern in the future, but we're also wary of the slippery slope of rebuilding Spark/Dask

borgdrone7

06/19/2020, 6:47 PM

Thank you for your reply. So basically if I understand correctly, I need to process whole set of data in one solid then pass it to another, and so on. However I see people are using Dagster with Celery and other frameworks to execute solids in parallel and I actually consider this as the biggest advantage of using Dagster instead of simply chaining regular functions. If there is no way to chunk big data and process it in parallel, is the use case everyone is using for parallel processing actually doing more different things on the same set of data at the same time instead of making big data input process faster by splitting it up? If I for example first prepared the file and made 50 files out of one big file, could I run in parallel 50 instances of the same pipeline each executing different file? I ask because in our case executing different tasks in parallel would not help to speed stuff so much, instead chunking data and process each chunk in parallel is what would make us profit in terms of speed.

sandy

06/19/2020, 11:47 PM

Yes - if you split apart your data ahead of time and define a solid that processes one file, your pipeline definition function could then invoke that solid 50 times - one for each split

borgdrone7

06/20/2020, 12:17 PM

Thanks sandy!

6 Views

Open in Slack

Previous Next