Hello there community! Thanks for the great work i...
# ask-community
j
Hello there community! Thanks for the great work in the framework. I have a question regarding scaling dagster: • As an example I want to process a dataframe/table and produce a different output for each row (i.e., one file per row) • I have an op that finds the universe of rows to be calculated and produces one dynamicoutput for each row • Each row requires a pretty complex series of operations and takes quite some time, can fail in diffferent steps, etc. This is why I'd like to keep each row as a different work unit, giving more visiblity and fault isolation between processing each row • Ideally I'd like to process millions of isolated rows in flows that can take a long time to run (>24h) When I implement a prototype to test this just the collection of all the dynamic outputs takes a very long time which blocks the execution of the flow during that time. The UI struggles a bit as well but that is not a problem at all. One obvious answer here could be to manually partition these rows and use one op per partition, at the expense of some visibility and isolation / fault tolerance, but I first want to understand if there are other ways to look at the issue. In essence: • Is there a working way to implement dynamic graphs that fan out into millions of ops using Dagster? Does this type of workload not suit dagster? Is there a suggested way to do something like this? I've seen some light discussion around this in github issues but no obvious conclusions. Thanks a lot for the help in advance
z
I also struggle with the performance of dynamic outputs. My team has jobs that scale to tens of thousands of dynamic outputs and it takes forever just to generate the outputs, and it makes the UI mostly unusable due to the number of events that get generated and presented. I think you'll have better luck by doing one dynamic output per partition. Honestly in my opinion Dagster struggles to work well with really large data or flows where you want to have really granular ops / assets. It seems more geared toward the average use case, which is totally understandable from a design standpoint
j
Yep that's my impression from my experimentations. I'd love to hear if you perhaps looked into any other scheduling/workflow engines with support for handling larger quantities of tasks?
z
I haven't, my team has been able to make it work for our more extreme workloads and most of our workloads don't need tons of tasks. We feel the metadata and software engineering principles baked into the tool are worth a not-perfect experience for large jobs
j
Thanks, appreciate the feedback. With further tests with other tools I'm starting to realize that I'm probably approaching this from the wrong mental model for this kind of toolkit. I'll wait for a few more answers, and I'll post in this thread some conclusions from trying equally large datasets (1,000,000 very simple tasks in dagster, prefect and airflow)
b
Likewise, I'd view dagster as an orchestration tool first, and a high throughout scheduler second.
z
Excited to hear the results! It's really common for folks to want to use ops as these super-granular replacements for vanilla python functions, @Ion Koutsouris has been after the same thing, but the reality of the overhead that Dagster introduces is that it just doesn't work very well to have really granular ops. I think of ops as a higher-level container for describing business processes, not a replacement for every operation in a transformation
b
I'd look definitely look for other solutions (that can still use dagster for broad orchestration) to handle high granularity or throughout workloads.
z
Yeah I'm really curious to know what's out there as well for high granularity / throughput workflows. Particularly tools that might have more first-class support for big data frameworks like spark
b
To be honest, there's probably space in the market for a high throughput equivalent of dagster with similar abstractions. I suspect python is not the language for this.
z
Yeah I agree that I find it unlikely such a tool could be implemented efficiently in python. Maybe rust masquerading as python via bindings, but just the initialization overhead of Dagster has made me much more aware of how slow python is at scale.
b
I've used pyO3 before and it's great. I'd definitely do it that way
The dream would be dagster style definitions in python with rust bindings to convert that into something more scaleable across many partitions.
(which is where we all hit the scaling limits)
I don't imagine it's easy though.
i
Dagster should rewrite some of these core components in rust to improve concurrency
z
Yeah if it were easy we'd probably see more tools that can scale to that level. The reality is that most teams don't need to scale super large, so most tools don't get developed to handle super large scale
i
I work on delta-rs, and it's pretty fun to built rust stuff that interacts with Python
👍 1
z
I agree it'd be great if they rewrote some of it in rust. Or any other faster language. But it's a lot easier to say that from the sidelines than it is to grow a company while doing a huge refactor into a new language to serve a small portion of your user base
👍 1
a
Your use-case looks more like a stream processing use-case for me. In Python ecosystem, none of the workflow engines (airflow, prefect, dagster) I've tested provide stream processing. I know that Apache Nifi is well suited for these use-cases. It is designed for efficient stream and batch processing. But from my (little) experience, it is more limited in term of UI. And also, it is Java ecosystem. There is also Spring cloud dataflow that shares approximately the same strength and limitations of Apache Nifi, plus the con that it is extremely difficult to dig in when you're not familiar with Spring tooling.