can anyone here confirm that dagster is not a suit...
# ask-community
m
can anyone here confirm that dagster is not a suitable platform for streaming and real-time use cases?
🤖 1
o
hi @Matt Fysh, that's basically correct. Dagster is not designed for constantly-running streaming tasks, so for use cases which cannot be adapted to micro batches it won't be a great fit.
m
ok thanks @owen - its a shame, I really enjoyed learning dagster these past few days (especially the lineage features and dagit UI), but streaming is my primary req. just curious, when you say micro batches, do you mean each micro batch is run on a partition?
o
I don't think you'd need to use partitions (although that's often a good idea). I could also imagine a cursor-based approach, where a sensor runs every N seconds, and launches a job to process the new events that occurred in those N seconds.
m
oh right, is there a minimum for N? I had tried this, with a sensor plus
define_asset_job
but every time new files were detected by the sensor and the job was run, it would overwrite the assets generated by previous runs. I assume this is my fault tho as Im new to dataops so I probably didn’t set things up correctly
o
Ah I see -- there's no hard minimum for N, but the default IO manager behavior will not work well once runs start overlapping with each other (as you observed). To avoid the overwriting issue, using partitions would be the quick fix (as each partition of the asset gets written to a different filepath), but I could definitely be off base depending on your end goal here. Do you mind giving a quick description of your use case (what type of data you're passing through the steps, latency requirements)?
m
Sure thing, I’m uploading archives to S3 that contain network packets (pcap), as these come in they need to be decompressed, decrypted, parsed and converted into a list of http transactions, partitioned by date + domain, and indexed by url I then have a dynamic number of services that * might * be interested in the indexed data as it comes in, and it converts semi-structured html/json/etc into structured documents Then there is another layer above that listening for structured documents coming in and performs things like entity normalization and ML-driven matching etc. re: latency, I’m hoping for these to be near real time all the way through to the final layer
o
So what are the assets here, in your mind? Would it be a) the list of http transactions b) whatever structured documents are created and c) some output of the normalization / matching layer? Also, do the steps downstream of the list of http transactions require all of the historical data, or just the new stuff since the last run?
m
All three are assets and would need to be persisted to disk, because new ETL jobs can be added at any time at the (b) and (c) layers e.g. an ETL job added at (b) could be introduced at any time, as an example a conversion from HTML to an extract a list of
ld+json
entities. Once it’s added, it would be run against all historical data (and in this case, scoped to all text/html traffic), beginning with the most recent http traffic
and i guess its not just “three assets”, its a dynamic tree of downstream assets, better perhaps to think of it as three layers of asset types?
o
makes sense -- and by "dynamic", would this still involve a code change to add a new consumer, or would there be some different process?
m
on that I’m not too sure, I haven’t reached that part yet… I imagine the data platform having an API for creating new jobs at runtime?
ok, it looks like Prefect will support my use case, but if you’re interested in chatting more on Dagster or Prefect I’m happy to write up some feedback
đź‘Ť 1
o
that makes sense -- I think for real-time streaming workloads, you'd end up battling against the Dagster API a fair amount. I do think it would be possible to get something that basically worked, but it'd likely be pretty challenging, and Prefect is probably better suited for that use case (although I'm not very familiar with it myself, believe it or not 🙂). Definitely interested in your feedback regardless!
j
@owen are there any plans to support streaming use cases in the near’ish future? I love the Software defined asset approach, however, I have certain assets that need a “real-time” freshness of ~1s