The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Is it a good idea to write something to the database in each solid? I have feature extraction pipeline, a depends from b and c. I want to calculate b and write it to db, calculate c and write it to db, query all changed items in db and calculate a. Or should I calculate b and c, then query missing data to complement it (ie b was calculated for 10 items but c for 22, still have to query this 12 missing)?

I'm not sure i could suggest one over the other without more details. I will say abstractly that while its possible to pass the data directly from solid to solid in dagster, it can be very useful to instead pass a "pointer" of sorts. Some custom type referencing where the real data lives and any other useful metadata.

also depends on what database you’re using—with “big data” datastores like Presto, BigQuery, etc. you’re better off batching your workloads into very coarse-grained, large batches, whereas if you’re using a traditional DB like PG/MySQL doing the small incremental might be fine.

I’d also consider erring towards larger batches because it’s easier to recover when something goes wrong—you can just re-run the entire job and know that the data is in a good state, vs. having to carefully craft the right set of rows to update