As I understand it Dagster does not yet have support for a s dagster #announcements

As I understand it, Dagster does not (yet?) have s...

alir

05/19/2020, 6:08 PM

As I understand it, Dagster does not (yet?) have support for a streaming pipeline. Nevertheless, i'm looking to deploy dagster in a production environment where I need to process inputs as they arrive. I can think of two ways of doing this. One is to have a scheduled pipeline that runs periodically to process all new inputs. The other is to have another python script running as a daemon that listens for new inputs and invokes the dagster function to execute a pipeline as inputs arrive. Does anyone have experience with either of these approaches or have a suggestion of their own? I think Dagster's great and I'd like to use it in production if possible.

🤔 1

alir

05/19/2020, 6:13 PM

The downside of having another daemon python script is that it's a separate system and I can't monitor its status easily within the dagit UI.

alir

05/19/2020, 6:15 PM

ideally, I would be able to write a solid that subscribes to some input source and yields outputs as they are available but I guess we can't do that just yet

max

05/19/2020, 6:20 PM

that all sounds right, i am curious what environment you're hoping to deploy to

alex

05/19/2020, 6:20 PM

I think I would recommend the “fast” running scheduler and using the

should_execute

function on

ScheduleDefinition

to check if new inputs are available.

✅ 1

max

05/19/2020, 6:21 PM

in a slightly longer time horizon, we've played around in the past with notions of deploying dagster pipeline to FaaS platforms where some of the streaming/trigger logic would be available from platform facilities, e.g. triggering a pipeline based on an S3 bucket change

max

05/19/2020, 6:21 PM

would love your thoughts on whether something like that would be useful

alex

05/19/2020, 6:22 PM

a sort of related example of using a schedule to control a backfill can be seen here https://dagster.phacility.com/source/dagster/browse/master/examples/dagster_examples/schedules.py$9-66

alir

05/19/2020, 6:22 PM

ah, i'm going to deploy it to a kubernetes cluster with a bunch of celery workers, and my input sources in my pipelines would either be files on some cloud storage (s3) or rows in a database table.

👍 2

alir

05/19/2020, 6:23 PM

and in the horizon, i see a use case for inputs coming in via a TCP connection but I don't think I can use dagster directly for that. It looks like I will need an intermediate server to accept data from that connection and store it somewhere before being processed by a pipeline

alir

05/19/2020, 6:28 PM

max: the FaaS stuff would be useful but it depends on how people subscribe to new files on S3. For instance, they might choose to have new file notifications sent to an SQS queue in which case, we need a pipeline being driven off that queue. Based on my own research, I've also found that people have some general discomfort over things like AWS Lambda, mostly over concerns over its reliability and testability.

alir

05/19/2020, 6:29 PM

alex: thanks for the references! I'll go take a look

alir

05/19/2020, 6:33 PM

max: to go a little further on the FaaS discussion, if people have concerns over reliability (e.g.: "what if S3 fails to trigger my lambda? How do I make sure I don't ever miss anything?"), they might decide to keep a record of processed files and also turn on S3 inventory reports. That way, they can periodically get a list of all the objects in their bucket and compare with their records to see if there were files that were not processed yet. It'll be interesting to think about how dagster fits in here. I'm not suggesting dagster has to (or should!) solve every problem under the sun. I'm just pointing this out in case it helps the overall design of the feature you have in mind .

👀 1

3 Views

Open in Slack

Previous Next