does dagster deployed on a user's kubernetes insta...
# deployment-kubernetes
b
does dagster deployed on a user's kubernetes installation really work? i am evaluating dagster versus airflow to improve my company's analytics product. i'm a sophisticated user. my experience with Elastic and TimescaleDB is that a lot of vendors have kind of farcical support for community users. as soon as someone does something with even a tiny bit of scale or reality to it, the default configuration never supports it, the right configuration is really arcane, and it feels designed basically to break and get people to acquiesce to the cloud offering. i'm sorting of looking for a straight answer that dagster Really Works for more than toy workloads in its default configuration. airflow has been around long enough and used by enough free users that, by comparison, it has arcane configuration that is at least discoverable.
s
We just ran backfill spanning 70,000 jobs on our EKS cluster with no issues yesterday. I know what you're talking about wrt false advertising / immaturity from vendors, but in my experience, Dagster is pretty battle-hardened. Definitely fewer teams that have pushed it to crazy limits compared to Airflow just from sheer adoption perspective, but pretty sure they are out there. There's a lot of possible configuration through the
k8s_executor
, op tags, and the helm deployment chart.
❤️ 1
D 1
👍 1
UI is probably the place where things would break down before anything else (e.g. partition backfill screens, job selection menu, search)
s
I'm going to give you an answer that might not be as cheery, a little more heterodox. I'm in the midst of migrating a rather pretty large-scale DAG (not humongous but large) to Dagster (for context, we've outgrown single k8s clusters in gke). After evaluating other orchestrators, I found the Dagster was most promising with it's ability to scale, both in UX and conceptual design and I actively evangelized it internally. However, with the load and configurations I'm on, it's apparent that it's not a smooth sail. Synthetic large scale tests went swimmingly, but combining all sorts of data issues, network failures, stuck threads, zombie subprocesses, index errors, loading speeds, there's a whole host of issues that are going to become apparent at scale. At the moment, I'm still bullish on Dagster's ability to meet our needs, and in part because of how responsive the folks in this slack group are, and in part because i'm betting they will resolve the bugs as they come. Where I'll be in a couple months is another story - either i'll be one of Dagster's top fans, or i'll be a rather bitter critic. Hard to tell when you're in the storm.
👍 1
s
yeah, i have no calibration for "tiny bit of scale or reality". however it turns out @Simon Frid, i'll buy your memoir
hat tip 1
a
I can also add my personal +1 for Dagster. I’ve been working with this software for the past ~2 years and deployed quite some pipelines on it (nothing like the 70k jobs mentioned, but still some large high-resolution geospatial analysis). Things just got better since we started. The team is responsive, issues get solved fast, the community is growing and many design decisions just make sense. Keep in mind that Dagster is as good as your underlying infrastructure is in terms of processing power. Bad design decisions infrastructure-side will inevitably make Dagster runs fail and become unreliable, like any other software. With data we are very often pushing the limits of IO and network. That’s why good infra and good pipelines are core to a healthy orchestrator.
3
👍 2
b
but combining all sorts of data issues, network failures, stuck threads, zombie subprocesses, index errors, loading speeds
@Simon Frid how much of this do you attribute to (1) dagster specifically (2) something else, like python (3) and the alternative, dagster's cloud offering? is 3 knowable?
my experience with Elastic is that the modern version's default configuration doesn't work for an application. it is essentially lead gen for their cloud offering
same with timescale. all the things that were around to make it look like it was easy to deploy / encoded sane configuration was a lie
s
This is great thread and thanks for all the testimonials and context. @Ben Berman we are definitely committed to supporting our OSS community and we have many pure OSS users that operate at large scale. However, I would frame it as OSS dagster gives you the opportunity to operate at scale, but as @Andrea Giardini said, it can only work as well as your underlying infrastructure and requires a lot of sophistication and work. For larger organizations becoming a Cloud user is the way to ensure a successful outcome at scale. This isn’t because we don’t support our OSS users or that it is “designed to break” — we invest a lot in that and do the best we can — but when you are Cloud user we have complete visibility into operational issues with daemons/metadata db etc in way that is not possible at all for OSS users and can fix those on your behalf directly.
b
i appreciate the note
part of what i'm reacting to is, i see there's a helm chart. i know i am signing up for a "low budget kubernetes operator", there's only so much it can do. i see that it has configuration. you guys will also be using kubernetes and automate your deployments somehow. and if you're not doing that already, you're migrating to it. if the application only "works" with a certain configuration, which is my expectation, is that at least reflected in e.g. helm charts? or somewhere?
that is what i found most frustrating about elastic. there's clearly a file that they use for their configuration, which is not at all proprietary. it would cost them nothing to share
For larger organizations becoming a Cloud user is the way to ensure a successful outcome at scale.
we're not at a point yet where we know if there's any ROI to any of this. if we did, yeah, i'd just use the cloud offering, because then the math would be straightforward
i have heard good things btw from my buddies who use dagster in biotech too, who followed up with me
s
In terms of numbers, on our current orchestrator (not dagster), one of our devs just initiated a backfill triggering ~100k worth of pods in the span of 30 min. The load is way too much in our current framework and our larger design is way too coupled to the orchestrator. I haven't run anything of this scale on Dagster yet, but also i don't intend to replicate a 1-to-1 mapping as it's currently implemented, but I'm weary of where the edges are.
The challenge with any scaling paradigm is that for every 2-3 orders of magnitude, new problems emerge where you once took it for granted - that goes for both data volume/velocity and UX based considerations. It's the creep that ussually gets ya.
I may have shared this already, but I remember when I first got elasticsearch v0.* running in '13, pre-k8s, early docker. It was JVM errors galore - Dagster is no where near that.
a
that is what i found most frustrating about elastic. there’s clearly a file that they use for their configuration, which is not at all proprietary. it would cost them nothing to share
IMO it won’t be very helpful for them to share. Sure, they have a config file, and sure it’s optimized, but it’s optimized to their infrastructure and their setup. The same config file in your infra will most probably perform badly on not perform altogether. I worked for open-source companies before (companies doing OSS software with an enterprise offering), and this is the most difficult message to come across to OSS community: when you pay for a cloud solution, you are paying for a highly-optimized environment built by the same developers that are providing you the software. These people know the ins and outs of the software and know how to tune every single knob. They simply know the conditions in which it will perform best, and they know where to look whenever things go south. Could they share the configuration of our cloud offering with the OSS community? Sure. Would it be helpful or useful to somebody? Most definitely no because your infrastructure is different