https://dagster.io/ logo
#community-showcase
Title
# community-showcase
t

Tobias Südkamp

01/21/2022, 3:13 PM
Hi everyone, I've read the documentation and played with some test dags but, currently, I struggle with my imagination how an enterprise scale dagster repo looks like. Are there any public repositorys on github where I can see how others managed bigger repos? I'm thinking about a couple hundred Ops and hundred jobs with heavy usage of global parameters and Input/output parameters
a

Alex Service

01/21/2022, 3:30 PM
I don’t have an example for you, but here are my thoughts in case it’s helpful: When doing a deployment of dagster, pipelines can be grouped by repo, which in practice are called User Code Deployments and each one results in a gRPC server being ran. As a result, I’d expect to see the repos themselves created in some logical manner that would prevent any one repo from becoming too big (or small, unless your ops team likes high cloud costs 🙂). As for an individual repo, my general approach has been to create a submodule each for various dagster concepts, namely
resources
,
repo
,
graphs
. For configs at a small-medium scale, I’ve found it sufficient to use a naming convention where each config is named the same as its job, so I can just point to a folder and get the appropriate config. That approach could be extended to merge with global yaml (or python-defined) configs
❤️ 3
Again, haven’t worked at enterprise scale, so take it all with a grain of salt 🙂
m

Marc Keeling

01/21/2022, 10:58 PM
I have been looking at this for some guidance. It's a bit over my head, but I am learning as I go. https://docs.dagster.io/guides/dagster/example_project
Here is the Dagster example repo that has TONS of example as well. https://github.com/dagster-io/dagster/tree/master/examples
👍 1
h

Huib Keemink

01/27/2022, 10:11 PM
Been pushing hard to get dagster implemented at large corporates (with little luck), but I can talk a bit about the usual problems these orgs have with Airflow. This is relevant (I think) because Dagster solves some of the problems larger orgs have, but it also suffers from some of the same problems - so you’d probably end up with a similar situation. So. Larger Airflow installs I’ve seen eventually ran into the problem of centrally hosting the application by a platform team (because hosting Airflow is hard). This team got frustrated that datascience teams (it’s usually the scientists) break the instance by doing something like using an “unsupported” version of pandas. The responses I’ve seen are one of two flavours: “we need more control of the user code” < usually through a rigid CI/CD pipeline and process, or “not my problem” < usually through self-service-airflow, where you request an installation per team and then you’re on your own. Of course, this is not all bad, because it also means you can only see / control your own jobs, but this is a downside at the same time: you risk implementing the same jobs over and over again in different teams. I believe Airflow has made some progress in this area, but I don’t think they completely solved it yet (could be wrong here). With dagster this separation between dagster / dagit and user-deployments is much stricter, making it super hard to crash the scheduler by having a faulty job. However, since there is no hierarchy in the jobs that are shown, no access control for teams, etc, I think having a separate install for teams probably still makes sense. So for instance, one install for the customer behaviour team, one for the financial analytics team, one for the engineering team doing the integrations of the legacy systems, etc. This has the added benefit that the individual installs remain relatively manageable, and things like naming conventions can work (because you can walk up to someone who just called his job “process data”)
☝️ 3
So my biggest takeway: I think that when your scheduler gets used by many people, it might make sense to break it up. You will lose some synergy, but it gets more manageable