Can dagster be used for spark streaming jobs? We'r...
# announcements
j
Can dagster be used for spark streaming jobs? We're currently using dagster for our analytics report pipelines, works like a charm, debugging is so easy and straightforward. Will be building a bunch of internal data products with it in the near future, so will update you guys
a
Hmm - how do you imagine using dagster in the context of a streaming job?
Dagster is quite batch oriented at this point, so its hard for me to imagine how this would work, but I think it depends on how exactly you are using the streaming jobs
you could have dagster pick up the batched chunks that come out of the streaming job and manage the subsequent processing.
j
Im using the spark structured streaming API - so its essentially a continuous running batch job done in increments I have a job that pulls data from AWS SQS and then destructures the JSON then loads the data into a delta table/S3
hmm, how would dagster pick up the batched chunks?
a
depends on how you have dagster deployed -
dagit
has a GraphQL API you can use to start runs, can also invoke via CLI
another option would be to have a scheduled job that uses
should_execute
to skip if there are no batches present and other wise runs, then set the schedule to run as frequently as desired
j
ahh right, so is GraphQL subscription and streaming pipelines on your guys' vision/timetable
a
dagit
already uses GraphQL subscriptions to stream the progress of a pipeline run
as for how we will support very long lived streaming pipelines - nothing in the plans at this time
👍 1