As I understand it, Dagster does not (yet?) have s...
# announcements
As I understand it, Dagster does not (yet?) have support for a streaming pipeline. Nevertheless, i'm looking to deploy dagster in a production environment where I need to process inputs as they arrive. I can think of two ways of doing this. One is to have a scheduled pipeline that runs periodically to process all new inputs. The other is to have another python script running as a daemon that listens for new inputs and invokes the dagster function to execute a pipeline as inputs arrive. Does anyone have experience with either of these approaches or have a suggestion of their own? I think Dagster's great and I'd like to use it in production if possible.
๐Ÿค” 1
The downside of having another daemon python script is that it's a separate system and I can't monitor its status easily within the dagit UI.
ideally, I would be able to write a solid that subscribes to some input source and yields outputs as they are available but I guess we can't do that just yet
that all sounds right, i am curious what environment you're hoping to deploy to
I think I would recommend the โ€œfastโ€ running scheduler and using the
function on
to check if new inputs are available.
โœ… 1
in a slightly longer time horizon, we've played around in the past with notions of deploying dagster pipeline to FaaS platforms where some of the streaming/trigger logic would be available from platform facilities, e.g. triggering a pipeline based on an S3 bucket change
would love your thoughts on whether something like that would be useful
a sort of related example of using a schedule to control a backfill can be seen here$9-66
ah, i'm going to deploy it to a kubernetes cluster with a bunch of celery workers, and my input sources in my pipelines would either be files on some cloud storage (s3) or rows in a database table.
๐Ÿ‘ 2
and in the horizon, i see a use case for inputs coming in via a TCP connection but I don't think I can use dagster directly for that. It looks like I will need an intermediate server to accept data from that connection and store it somewhere before being processed by a pipeline
max: the FaaS stuff would be useful but it depends on how people subscribe to new files on S3. For instance, they might choose to have new file notifications sent to an SQS queue in which case, we need a pipeline being driven off that queue. Based on my own research, I've also found that people have some general discomfort over things like AWS Lambda, mostly over concerns over its reliability and testability.
alex: thanks for the references! I'll go take a look
max: to go a little further on the FaaS discussion, if people have concerns over reliability (e.g.: "what if S3 fails to trigger my lambda? How do I make sure I don't ever miss anything?"), they might decide to keep a record of processed files and also turn on S3 inventory reports. That way, they can periodically get a list of all the objects in their bucket and compare with their records to see if there were files that were not processed yet. It'll be interesting to think about how dagster fits in here. I'm not suggesting dagster has to (or should!) solve every problem under the sun. I'm just pointing this out in case it helps the overall design of the feature you have in mind .
๐Ÿ‘€ 1