Hey Dagster team, very inspirational medium post. We're a new ML startup and many primary pain points were discussed. Looking for advice/examples to get up and running, our needs:
• Primary data Postgres + Spark
• Secondary data flow is ingestion of large corpus of scanned text pages (order of tens of mlilions), Need to ocr, extract, de-identify, store as unstructured (evaluating spark tesseract and GCP DLP)
• evaluating SparkNLP for further text preprocessing / reduction
• Spark SQL DataFrames for aggregation pipeline (order of millions or tens of million of aggregate rows generated, offline batch)
• Have not selected "learning framework" yet, but likely TFX or Torch. TFX runs on AirFlow so could be a fit.