The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hey Dagster team, very inspirational medium post. We're a new ML startup and many primary pain points were discussed. Looking for advice/examples to get up and running, our needs:
• Primary data Postgres + Spark
• Secondary data flow is ingestion of large corpus of scanned text pages (order of tens of mlilions), Need to ocr, extract, de-identify, store as unstructured (evaluating spark tesseract and GCP DLP)
•   evaluating SparkNLP for further text preprocessing / reduction
• Spark SQL DataFrames for aggregation pipeline (order of millions or tens of million of aggregate rows generated, offline batch)
• Have not selected "learning framework" yet, but likely TFX or Torch. TFX runs on AirFlow so could be a fit.

our most substantive multi-tech example like this is the “airline demo”

we still need to write some prose describing what is going on

Screenshot 2019-07-13 10.39.10.png

Screenshot 2019-07-13 10.39.27.png

Screenshot 2019-07-13 10.38.55.png

and the code lives in examples/dagster_examples/airline_demo