Hi Dagster Team! :wave::skin-tone-2: I had a few ...
# dagster-plus
s
Hi Dagster Team! 👋🏻 I had a few questions regarding Serverless Cloud deployments. 1. Which cloud provider and region is the instance hosted in? Would we be able to choose? 2. Looking at the Limitations mentioned, "4500 step-minutes per day" is considering the compute time on the 4 vCPUs? (~19 step-hours per vCPU?) 3. Looking at the 2nd pricing FAQ, what is the unit for "up to 10K", "10K to 100K", "100K+"? Thanks for your input!
🤖 1
d
Hi Soroush - happy to answer these. 1. These are currently hosted in AWS in us-west-2. We'd like to add multi-region support in the future, but it's not currently available 2. vCPUs are not currently taken into consideration when computing step minutes. It's just the total amount of time the step takes. 3. The units there are step-minutes.
s
Thanks for the quick clarification @daniel With the step-minutes, if I understood correctly, there should be maximum 1440 (24*60) step-minutes per day. Correct? Or is the definition different? I'm wondering how 4500 or 100K step minutes occur per day 🤔
d
That’s the number of minutes in a day, but you can have multiple jobs happening simultaneously, or multiple steps happening at the same time within a single run
s
Ah yes, I overlooked that! So, with a 4500 step-minute limitation, why are step-minutes greater that than mentioned in the pricing? (e.g. 10K step-minutes)
d
That pricing also includes hybrid deployment that doesn’t have a limit, and you can also request a quota increase for a serverless deployment
s
👍🏻 Thanks Daniel for your help!
condagster 1
j
apologies for hijacking this thread but wanted to clarify #2: I believe a step could take varying amount of time depending on how much resource is available. a) are we guaranteed at least 1vcpu or could that be sharing a single vcpu with other steps ? b) if our function runs threads in the Rust or C++ layer, would it be able to utilize more than 1 vcpu? and if so is it limited by a ECS / EKS pod limit or EC2 limit? c) what is the max memory the process has? d) are there any SLAs on how quickly you can scale up / out steps if I require burst capacity to run 1000 steps (e.g. partitions) simultaneously? e) is there a limit to how many steps I can run concurrently?
d
Each run happens in its own isolated ECS fargate task with these limitations https://docs.dagster.io/dagster-cloud/deployment/serverless#limitations and each step happens in a subprocess within that task. Based on that information i believe the answers to your questions are: a) steps can share CPUs with other steps within the same run since they are subprocesses within the same ECS task b) and c) CPU and memory limits are in the link I shared d) In serverless it all has to fit within that ECS task. In hybrid kubernetes you have more options for e.g. running each step in its own kubernetes pod e) no limit on the number of steps specifically, the limit is on the overall memory/CPU usage
There's also a limit of 50 concurrent in-progress runs in serverless deployments
j
thanks, to clarify: a) if I had 100 partitions, only 50 of them would run at a time correct? but each of them would have a 4vcpu 16GB ram process to work with? b) apologies I can't find the definition of a step. it doesn't appear in the jobs or ops sections within concepts (and the search for "step" produces too much noise)
d
a) I think this is actually an option you can select in the UI when doing a backfill (see "Single run" vs. "Multiple runs" here: https://docs.dagster.io/concepts/partitions-schedules-sensors/backfills#launching-backfills) b) A step is synonymous with an op for the purpose of these questions (technically the op is the thing you write in code, the step is what is executed)
❤️ 1
but by default, each partition is its own run, yeah
j
thank you ❤️ sorry for belabouring the point, so theoretically in Dagster Serverless, I could spawn 50 runs, each with 4 vcpus to give me 200 vcpus to run my workload simultaneously if I structured my job correctly.
d
That's correct, yeah. And the 50 run limit quota is potentially liftable, just a conversation with support to bump that quota
❤️ 1
j
so to reword my original question in (d) above: how quickly will my 50 runs receive an ECS task each to run in when it's triggered by a sensor? presumably there is some delay for the ECS autoscaler to provision 50 pods with my image.
d
ECS can take some time to provision a new task, yeah. I think the average is about 30 seonds to a minute but i've seen it take a few minutes in the worst case. If run start latency is a concern then you could also consider a hybrid deployment running in kubernetes which generally has lower task startup times
j
could you share your ECS auto scaling settings so I have an idea of roughly what I'm working with? I'd very much prefer to use the Serverless offering if possible, as we're a lean team. under a minute average case and under 5 mins worst case is acceptable.
d
ECS Fargate doesn't actually have auto-scaling settings per se
🫣 1
j
🙃 oops. let me tap my contacts at AWS who can tell me more about fargate then thank you. I take it that means you're just using their "defaults"
d
I think everybody who is using fargate is using their defaults - they don't really expose configuration options there. The main thing that I've seen affect fargate startup time is the size of the Docker image being used.