Hi,
I need to process a large amount of data files in S3. These files although scattered in different paths, they still can be grouped in various different ways.
In past I have achieved this with Celery, but it was like I was going in blind with no observability. but since Dagster will give me good observability on smaller chunks too, i am looking for recommendations on approach
Foremost I am looking to collect them every hour and then combine*[bulk via jsonl]* them on a per day basis and then process each day’s bulked files.
I need to be able to check on each hour’s execution and finally validate counts on the bulk file.
Would assets + partitions be more suitable for such work or would it be better with Ops ?
I had asked this question
here but it got buried under other messages