Would be great if Dagster employees could blog a little more about good pipeline / Data Engineering practice:
Things like:
• How partitions help you to achieve idempotency and as a result more robust pipelines that are easier to debug and refactor (for performance improvements with identical expected data outputs) + parallelise execution for backfills.
• How software defined assets enable you to increase the transparency of your pipelines so that data consumers can debug / understand problems and their downstream consquences.
• How extracting and loading to S3 with a separate sensor to run the load to target datasource for when the data becomes available can lead to more efficient resource use (/the extraction step, if mostly just running a slow DB query doesn't use much RAM but still can block other processes if the two pipelines are kept together).
• The value of type annotation and mypy for catching bugs early (don't think prefect has that?)
These things were not obvious to me when I started and I'm not even sure they're all correct/true now / I'd write more about it but I'm not sure that what I'd be writing would be correct. One of the things that dagster has helped me most with is how to structure work so that it is maintainable.
Edit: + More thought leadership on how to test pipelines, you've done a lot for unit testing in data engineering, but Data Engineers tend to also use Prod databases -> Dev DataWarehouse, to check that the whole pipeline works properly, which your Yaml Configs handle very nicely. Or ways to check that an asset is the same after a refactor (checksum / some form of hashing?)