Is it a good idea to write something to the database in each solid? I have feature extraction pipeline, a depends from b and c. I want to calculate b and write it to db, calculate c and write it to db, query all changed items in db and calculate a. Or should I calculate b and c, then query missing data to complement it (ie b was calculated for 10 items but c for 22, still have to query this 12 missing)?
09/30/2019, 3:19 PM
I'm not sure i could suggest one over the other without more details. I will say abstractly that while its possible to pass the data directly from solid to solid in dagster, it can be very useful to instead pass a "pointer" of sorts. Some custom type referencing where the real data lives and any other useful metadata.
09/30/2019, 3:31 PM
also depends on what database you’re using—with “big data” datastores like Presto, BigQuery, etc. you’re better off batching your workloads into very coarse-grained, large batches, whereas if you’re using a traditional DB like PG/MySQL doing the small incremental might be fine.
I’d also consider erring towards larger batches because it’s easier to recover when something goes wrong—you can just re-run the entire job and know that the data is in a good state, vs. having to carefully craft the right set of rows to update