Hi I need to process a large amount of data files in S3 Thes dagster #ask-community

Hi, I need to process a large amount of data file...

Binoy Shah

03/03/2023, 2:32 PM

Hi, I need to process a large amount of data files in S3. These files although scattered in different paths, they still can be grouped in various different ways. In past I have achieved this with Celery, but it was like I was going in blind with no observability. but since Dagster will give me good observability on smaller chunks too, i am looking for recommendations on approach Foremost I am looking to collect them every hour and then combine*[bulk via jsonl]* them on a per day basis and then process each day’s bulked files. I need to be able to check on each hour’s execution and finally validate counts on the bulk file. Would assets + partitions be more suitable for such work or would it be better with Ops ? I had asked this question here but it got buried under other messages

🤖 1

Tobias Pankrath

03/03/2023, 2:41 PM

I am trying to do something similar like this with a DynamicPartition. I think you currently cannot have custom parition dependencies, but as long as you have 1 on 1 dependencies it should work.

Binoy Shah

03/03/2023, 2:44 PM

by 1-1 dependencies, you mean 1-1 dependencies between upstream and downstream assets ?

Tobias Pankrath

03/03/2023, 2:51 PM

yes

Tobias Pankrath

03/03/2023, 2:51 PM

although you mentioned hourly / day partition, I think that's already implemented in dagster itself

Binoy Shah

03/03/2023, 2:53 PM

Yeah, I see that it has Hourly and Daily partitions, but does it appear as fan-in as in 24 hourly Upstream assets to 1 daily downstream asset ? I am still building the strategy to parse and process my data files

Tobias Pankrath

03/03/2023, 2:54 PM

https://docs.dagster.io/concepts/partitions-schedules-sensors/partitions#partition-dependencies

Tobias Pankrath

03/03/2023, 2:54 PM

seems to be the case

Binoy Shah

03/03/2023, 2:55 PM

Oh cool thank you, let me check that out

8 Views

Open in Slack

Previous Next