https://dagster.io/ logo
#ask-community
Title
# ask-community
n

Nathan Skone

05/17/2022, 10:48 PM
Hello everyone! We are jumping headfirst into a greenfield data pipeline built on Dagster and have a few questions about designing the whole thing. Our goals: • One git monorepo containing many (50+) data ETL jobs created by multiple teams • Each job can define its own unique dependencies via docker/poetry/etc • Each job can have independent least privileged access permissions via AWS IAM • Ops will be a mix of pure Python, Spark, DBT, and potentially others • Dagster deployed on k8s (AWS EKS) Questions: • Do we need one persistent User Code Deployment (gRPC) pod running for each Dagster job given that we want each job to define its own unique dependencies? • Can jobs contain a heterogeneous mix of Python, Spark, DBT, and other types of ops with independent requirements? • Are we thinking about this correctly or missing something? (tagging my team members: @Sanjay Sagar @Eddie Carlson)
🤖 1
d

daniel

05/18/2022, 3:08 AM
Hi Nathan - you'll need a unique user code deployment for each Python environment/Docker image that you want to run your code in. So if each job has its own Python environment / Dockerfile, then each job would have its own user code deployment, yeah. Jobs can absolutely include a variety of ops of different types. If a particular op needs to use a different Python environment / different image: It's possible, if a bit tricky, to specify different Docker images for ops within a job by tagging each op with the image that you want to use. You'll need to set it up so that it's using the k8s_job_executor, which runs each op in its own k8s pod, and all the images need to include the same Dagster repository definition at the same path. If you go with this approach, you’ll definitely want to look at some CI/CD process to help with this. On the ops you want to override, you can specify this tag:
Copy code
@op(
  tags = {
    'dagster-k8s/config': {
      'container_config': {
        'image': 'new-image'
      },
    },
  },
)
def my_op(context):
  ...
(If by 'independent requirements' you meant something different than different Python environments, let me know)
g

geoHeil

05/18/2022, 7:03 AM
Could this be alleviated by using conda envs? But perhaps separate containers is better / offers better isolation
n

Nathan Skone

05/18/2022, 3:56 PM
@daniel Thank you for the detailed answer! One item I do not currently understand is why we need persistent (always running) user code deployments for each job if the individual ops can run on on-demand k8s pods with unique docker images. What is the purpose of the user code deployment in that model?
e

Eddie Carlson

05/18/2022, 3:58 PM
thanks for the responses! ha, i was just about to ask the same question, nathan. having separate code deployments for each python environment does offer nice isolation, but the cluster may become a bit cluttered if we have many (we expect 50+ eventually), long-lived code deployment pods. could an alternative be to run a single code deployment server with the
dagster api grpc
flag --use-python-environment-entry-point and have jobs specify their environment on each run (or am i misunderstanding how that works)? this option has its own problems (poor isolation with one mega-image for all user code), though neither choice seems a perfect fit.
d

daniel

05/18/2022, 6:34 PM
Yeah, it's a great question and a totally fair point that the current architecture results in a ton of standing server for setups like this - there are a few things that the system uses that expect to have quick access to your code via the server, but this doesn't mean the system can't be made more serverless: • sensors is one big one - those run your code, often pretty frequently, and spinning up a new k8s pod for each tick could increase latency a lot (that said, not every job has a sensor) • in Dagit, when you launch a run, we do some validation against your code, to verify things like that any environment variables that you specified are actually present. Here's a question for you - how would you feel about a world where it spun up one of these servers for you lazily whenever one of those operations that required access to it ran, then left it up for a while with a TTL before spinning it back down? The big downside there would be additional latency on the first call while it waits to spin up the server, but repeated calls would stay fast. We don't have a setup like that today, but it could be possible to add (particularly in our cloud product, where we have an agent service that manages the user code servers rather than relying on a helm chart to spin them up for you)
👍 1
n

Nathan Skone

05/18/2022, 6:49 PM
@daniel Interesting. Thank you again for a thoughtful answer. I think that would definitely help for our case, and in particular for jobs that only need to run once a day. Do you have any thoughts on @Eddie Carlson’s question about using a single code deployment server for lots of different and heterogeneous jobs?
d

daniel

05/18/2022, 6:50 PM
Oh, right, missed that, sorry - that would help you create fewer images (you could have a different entrypoint for each code location, and they could share an image), but won't help you create fewer servers unfortunately. Each server corresponds to one environment.
👍 2
8 Views