Hello there I m wondering how to best create a pipeline that dagster #ask-community

Hello there, I'm wondering how to best create a pi...

moritz

03/05/2023, 9:34 AM

Hello there, I'm wondering how to best create a pipeline that performs some work on each new element (like a new entry in a database) but then I want to run downstream analyses on an aggregate over all such entries. Is the way to go to have a sensor that creates a run request for each new entry, that run request starts a job that somehow summarizes this entry and appends the summary to the aggregate table? Then I define an asset over this aggregate table that my downstream depends on? I'm struggling a little bit with seeing the best way to make updates to a large asset without recomputing it as a whole every time.

Mark

03/05/2023, 1:12 PM

Do you want process data in streaming mode? I think dagster more suitable for batch processing. In batch mode you can use sensor to load delta of data using something like id, load_date or updated_at fields (dbt will be useful in this way)

moritz

03/05/2023, 2:32 PM

No, I wouldn't call it streaming. It is a growing data set (new entries every couple of days), however, most important to me is to only ever process the batches of new entries but then add their result to an overall list (table) of results. So I don't necessarily want to partition the data but just keep adding small batches of new results to a growing table.

chris

03/06/2023, 6:46 PM

might be a good fit for the new experimental dynamic partitions fxnality

claire

03/06/2023, 8:31 PM

Yep, sounds like dynamic partitions might work for you. You could represent each new entry as a partition, and create a run request for each new partition.

moritz

03/06/2023, 11:50 PM

Interesting, thank you. I'll check that out.

7 Views

Open in Slack

Previous Next