Hey im enjoying using dagster so far but a questio...
# random
s
Hey im enjoying using dagster so far but a question has come up of comparing dagster to AWS step functions and the pros and cons between each. Anybody have any experience with both or know of some less vendor influenced documentation comparing the 2?
z
I've used step functions a bit, and Lambda a lot more, but it's been a little while. Some things that stand out to me: • step functions is more of a general task-orchestration framework. Dagster is a very opinionated data-orchestration framework which can also orchestrate tasks. The difference is that the latter has strong first-class support for abstractions which generalize common problems in data engineering and ML ops (i.e. the whole Software-Defined Asset paradigm, things like partitioned datasets, auto-materialization). This is a very big difference in tooling and ethos, but may or may not be interesting / helpful to your use case. • step functions will be much lower latency than Dagster when it comes to task startup (at least when orchestrating Lambda tasks). • step functions have similar limitations to Lambda when it comes to maximum cpu / memory - Dagster is only limited by the technology you deploy it on, ECS Fargate and Kubernetes being popular although deploying on EC2 or compute on other platforms is supported (except Lambda) • step functions integrate deeply with different AWS services and have a myriad of ways to trigger different services using event-based workflows. Dagster doesn't have a true event-based trigger mechanism, although you can simulate this if you build external tools to react to events and trigger the GraphQL API to submit jobs. In Dagster you can also respond to events with a polling mechanism (Sensors) • Dagster integrates your data processing logic with your workflow logic. In step functions you're generally operating at the level where you're orchestrating services, inside of which your data processing is going to occur. • I believe step functions allow for cycles in workflows, which is something Dagster does not (hence the DAG part of the name) • step functions is language agnostic, Dagster is mostly for running Python code, although it's possible to orchestrate arbitrary docker images / kubernetes jobs / shell commands The first difference is really the biggest by far. I think of step functions as a tool with no real opinions on how you use it. It doesn't provide abstractions as part of a framework to achieve specific tasks or goals. Whereas I think of Dagster as a highly opinionated framework for processing data, with abstractions built to push your projects to adopt certain common best practices in software / data engineering (for example, dependency injection, which is implemented through Dagster's Resource mechanism, and validation of data and configuration, implemented through the config system and type loaders). Dagster also enables characteristics of your data to influence scheduling, such as scheduling jobs based on data partitioning. In other words, Dagster is specialized for data engineering, whereas step functions is extremely generalized
plus1 2
👍 3
s
Wow thanks this was extremely insightful
From a data engineering standpoint, do you see them as being toolkits that can be used in parallel? Or is their differences in philosophy too vast to be bridged?
z
I don't see any specific reason why one couldn't integrate both tools, it's just a matter of the level of complexity you / your team want to take on. There's a lot of redundancy in general between the two tools with regards to task management, so I think it'd be important to delineate early on what types of tasks you orchestrate with Dagster and which you orchestrate with step functions. If deciding to use both, you might also consider a third-party monitoring / observability solution like Datadog to achieve a unified view of tasks orchestrated across both tools, as jumping back and forth between Dagster and different AWS service consoles sounds to track workflows sounds like a pain. But I could certainly imagine Dagster triggering step function workflows or vice versa, or even just using them for different types of tasks. One concrete example in particular I'm imagining could be having a step functions workflow for low-latency response to high-volume S3 events (or other eventbridge events, then having the result of such processing be ingested into Dagster as an asset for your data lake / data warehouse. I could also see the two tools being used in parallel to support a Lambda architecture (although I suggest you don't go down that rabbit hole unless you absolutely need a real-time view). Personally I'd rather choose one or the other as they'll both require pretty deep knowledge and training and using two orchestration tools adds significantly to cognitive and coordination overhead. In my view Dagster seems to be able to do most anything step functions can do, it just might take some custom code to integrate with some services step functions natively integrates with. True event-based workflows, streaming pipelines, and tasks with low-latency requirements might be the exception to this, although as I stated in my previous message most users seem to be able to achieve what they need in this capacity with Dagster's Sensor concept. However, the opposite does not seem to be true - step functions doesn't natively support many of the value-addons that Dagster brings to the table; things like partitioned assets / jobs, assets in general, 1st class dependency injection support (Resources), automatic data validation, etc. And replicating those things would require essentially rebuilding Dagster on top of step functions, whereas replicating AWS service integrations in Dagster is usually just a couple boto3 API calls (if an integration doesn't already exist for it).