can anyone here confirm that dagster is not a suitable platf dagster #ask-community

Join Slack

can anyone here confirm that dagster is not a suit...

# ask-community

Matt Fysh

08/01/2022, 5:49 PM

can anyone here confirm that dagster is not a suitable platform for streaming and real-time use cases?

🤖 1

owen

08/01/2022, 6:09 PM

hi @Matt Fysh, that's basically correct. Dagster is not designed for constantly-running streaming tasks, so for use cases which cannot be adapted to micro batches it won't be a great fit.

Matt Fysh

08/01/2022, 6:10 PM

ok thanks @owen - its a shame, I really enjoyed learning dagster these past few days (especially the lineage features and dagit UI), but streaming is my primary req. just curious, when you say micro batches, do you mean each micro batch is run on a partition?

owen

08/01/2022, 6:13 PM

I don't think you'd need to use partitions (although that's often a good idea). I could also imagine a cursor-based approach, where a sensor runs every N seconds, and launches a job to process the new events that occurred in those N seconds.

Matt Fysh

08/01/2022, 6:19 PM

oh right, is there a minimum for N? I had tried this, with a sensor plus

define_asset_job

but every time new files were detected by the sensor and the job was run, it would overwrite the assets generated by previous runs. I assume this is my fault tho as Im new to dataops so I probably didn’t set things up correctly

owen

08/01/2022, 6:29 PM

Ah I see -- there's no hard minimum for N, but the default IO manager behavior will not work well once runs start overlapping with each other (as you observed). To avoid the overwriting issue, using partitions would be the quick fix (as each partition of the asset gets written to a different filepath), but I could definitely be off base depending on your end goal here. Do you mind giving a quick description of your use case (what type of data you're passing through the steps, latency requirements)?

Matt Fysh

08/01/2022, 6:33 PM

Sure thing, I’m uploading archives to S3 that contain network packets (pcap), as these come in they need to be decompressed, decrypted, parsed and converted into a list of http transactions, partitioned by date + domain, and indexed by url I then have a dynamic number of services that * might * be interested in the indexed data as it comes in, and it converts semi-structured html/json/etc into structured documents Then there is another layer above that listening for structured documents coming in and performs things like entity normalization and ML-driven matching etc. re: latency, I’m hoping for these to be near real time all the way through to the final layer

owen

08/01/2022, 6:45 PM

So what are the assets here, in your mind? Would it be a) the list of http transactions b) whatever structured documents are created and c) some output of the normalization / matching layer? Also, do the steps downstream of the list of http transactions require all of the historical data, or just the new stuff since the last run?

Matt Fysh

08/01/2022, 6:48 PM

All three are assets and would need to be persisted to disk, because new ETL jobs can be added at any time at the (b) and (c) layers e.g. an ETL job added at (b) could be introduced at any time, as an example a conversion from HTML to an extract a list of

ld+json

entities. Once it’s added, it would be run against all historical data (and in this case, scoped to all text/html traffic), beginning with the most recent http traffic

Matt Fysh

08/01/2022, 6:48 PM

and i guess its not just “three assets”, its a dynamic tree of downstream assets, better perhaps to think of it as three layers of asset types?

owen

08/01/2022, 6:50 PM

makes sense -- and by "dynamic", would this still involve a code change to add a new consumer, or would there be some different process?

Matt Fysh

08/01/2022, 6:52 PM

on that I’m not too sure, I haven’t reached that part yet… I imagine the data platform having an API for creating new jobs at runtime?

Matt Fysh

08/01/2022, 7:09 PM

ok, it looks like Prefect will support my use case, but if you’re interested in chatting more on Dagster or Prefect I’m happy to write up some feedback

👍 1

owen

08/01/2022, 7:15 PM

that makes sense -- I think for real-time streaming workloads, you'd end up battling against the Dagster API a fair amount. I do think it would be possible to get something that basically worked, but it'd likely be pretty challenging, and Prefect is probably better suited for that use case (although I'm not very familiar with it myself, believe it or not 🙂). Definitely interested in your feedback regardless!

Jan Damm

08/06/2022, 4:48 PM

@owen are there any plans to support streaming use cases in the near’ish future? I love the Software defined asset approach, however, I have certain assets that need a “real-time” freshness of ~1s

2 Views

Open in Slack

Previous Next