I'm wondering how to best create a pipeline that performs some work on each new element (like a new entry in a database) but then I want to run downstream analyses on an aggregate over all such entries. Is the way to go to have a sensor that creates a run request for each new entry, that run request starts a job that somehow summarizes this entry and appends the summary to the aggregate table? Then I define an asset over this aggregate table that my downstream depends on? I'm struggling a little bit with seeing the best way to make updates to a large asset without recomputing it as a whole every time.
03/05/2023, 1:12 PM
Do you want process data in streaming mode? I think dagster more suitable for batch processing. In batch mode you can use sensor to load delta of data using something like id, load_date or updated_at fields (dbt will be useful in this way)
03/05/2023, 2:32 PM
No, I wouldn't call it streaming. It is a growing data set (new entries every couple of days), however, most important to me is to only ever process the batches of new entries but then add their result to an overall list (table) of results. So I don't necessarily want to partition the data but just keep adding small batches of new results to a growing table.