https://dagster.io/ logo
Title
b

Binoy Shah

03/03/2023, 2:32 PM
Hi, I need to process a large amount of data files in S3. These files although scattered in different paths, they still can be grouped in various different ways. In past I have achieved this with Celery, but it was like I was going in blind with no observability. but since Dagster will give me good observability on smaller chunks too, i am looking for recommendations on approach Foremost I am looking to collect them every hour and then combine*[bulk via jsonl]* them on a per day basis and then process each day’s bulked files. I need to be able to check on each hour’s execution and finally validate counts on the bulk file. Would assets + partitions be more suitable for such work or would it be better with Ops ? I had asked this question here but it got buried under other messages
:dagster-bot-resolve: 1
t

Tobias Pankrath

03/03/2023, 2:41 PM
I am trying to do something similar like this with a DynamicPartition. I think you currently cannot have custom parition dependencies, but as long as you have 1 on 1 dependencies it should work.
b

Binoy Shah

03/03/2023, 2:44 PM
by 1-1 dependencies, you mean 1-1 dependencies between upstream and downstream assets ?
t

Tobias Pankrath

03/03/2023, 2:51 PM
yes
although you mentioned hourly / day partition, I think that's already implemented in dagster itself
b

Binoy Shah

03/03/2023, 2:53 PM
Yeah, I see that it has Hourly and Daily partitions, but does it appear as fan-in as in 24 hourly Upstream assets to 1 daily downstream asset ? I am still building the strategy to parse and process my data files
seems to be the case
b

Binoy Shah

03/03/2023, 2:55 PM
Oh cool thank you, let me check that out