Hey folks, I’m new to dagster, learning a lot, and enjoying the experience so far!
I have a CI pipeline where many jobs output .csv files as artifacts. These pipelines are usually triggered manually, resulting in hundreds of jobs, each with a small summary csv artifact (all with the same Schema). I want to use Dagster to extract these artifacts from GitLab, store them as 1 global parquet file in S3 as a data lake, then scrub the data and present a curated mart. A few questions:
• Are dagster Sensor’s well suited for this kind of work?
• Do I need some event-driven framework (e.g., Kafka) to handle appending records to the global parquet file on S3 (or another landing location for new data)? Maybe this is overkill, but I’m confused about how to present a new batch of data for dagster in this flow so that the dagster pipeline is always fed with data as it arrives at the landing location.
This is my first ELT-style pipeline I’ve built, so I’m looking forward to some great feedback from all the experts here 🙂
03/06/2023, 8:38 PM
Sensors can be well suited for this - a sensor can be set up to trigger run requests when additional / new files appear in your directory
Regarding how to feed data into dagster - what about partitions? Can use this concept to represent the latest “chunk” of data in dagster - and make sure that the processing code only retrieves data relevant to that location. What you’re partitioning on probably depends on how you want to model things, but I could see either time-window based partitions or dynamic partitions making sense here