Hi, I need to process a large amount of data file...
# ask-community
b
Hi, I need to process a large amount of data files in S3. These files although scattered in different paths, they still can be grouped in various different ways. In past I have achieved this with Celery, but it was like I was going in blind with no observability. but since Dagster will give me good observability on smaller chunks too, i am looking for recommendations on approach Foremost I am looking to collect them every hour and then combine*[bulk via jsonl]* them on a per day basis and then process each day’s bulked files. I need to be able to check on each hour’s execution and finally validate counts on the bulk file. Would assets + partitions be more suitable for such work or would it be better with Ops ? I had asked this question here but it got buried under other messages
🤖 1
t
I am trying to do something similar like this with a DynamicPartition. I think you currently cannot have custom parition dependencies, but as long as you have 1 on 1 dependencies it should work.
b
by 1-1 dependencies, you mean 1-1 dependencies between upstream and downstream assets ?
t
yes
although you mentioned hourly / day partition, I think that's already implemented in dagster itself
b
Yeah, I see that it has Hourly and Daily partitions, but does it appear as fan-in as in 24 hourly Upstream assets to 1 daily downstream asset ? I am still building the strategy to parse and process my data files
seems to be the case
b
Oh cool thank you, let me check that out