Hi has anyone observed any issues when scaling up User Code dagster #ask-community

Hi, has anyone observed any issues when scaling up...

Tomas Gatial

01/10/2023, 12:15 PM

Hi, has anyone observed any issues when scaling up User Code Deployments? After scaling my UC Deployment to 3 replicas i observe following: • In Dagit, i Preiodically get

Definitions Reloaded

popup. • In Dagit I also I observe error message:

Enum 'LocationStateChangeEventType' cannot represent value: <LocationStateChangeEventType instance>

• Dagit becomes less responsive • Also I notice slow linear increase of memory footprint of

dagit

and

daemon

pods (Helm Deployment v 1.1.9, tested in 2 different clusters -> Azure & Rancher Desktop)

Adam Bloom

01/10/2023, 2:26 PM

Curious what your use case for running multiple replicas of user code is. The helm chart does not make the number of replicas configurable. I’ve always assumed more than 1 is not supported (or necessary) https://github.com/dagster-io/dagster/blob/master/helm/dagster/charts/dagster-user-deployments/templates/deployment-user.yaml#L13

Tomas Gatial

01/10/2023, 3:37 PM

Having enabled DefaultRunLauncher enables me to handle low resource / high frequency / time sensitive jobs directly on the user code, without the overhead of k8s orchestrator. I am aiming to have robust Dagster setup, able to handle workloads on both ephemeral and non ephemeral resources, as discussed here: https://dagster.slack.com/archives/C01U954MEER/p1669137566368989?thread_ts=1668960151.191789&cid=C01U954MEER Documentation (page Deployment->Open Source) says code location replicas are supported. https://docs.dagster.io/deployment/overview#long-running-services

daniel

01/10/2023, 3:39 PM

I think adding the following to dagsterApiGrpcArgs will help with most of these issues (but replicas on the user code deployments aren't officially supported and I can't promise you won't run into other weirdness)

Copy code

--fixed-server-id <some unique string for your user code deployment here>

daniel

01/10/2023, 3:39 PM

looking into that error now, which is not expected

daniel

01/10/2023, 3:40 PM

setting the fixed-server-id field will help indicate to dagit that each of the replicas represent the same location - right now its getting confused because each replica has its own server ID so it thinks the code is constantly updating

Tomas Gatial

01/10/2023, 3:42 PM

Thanks Daniel! I am testing the arg now.

daniel

01/10/2023, 3:43 PM

the other big downside i think you'll run into right now if you use the default run launcher with replicas is that any runs that are still happening whenever you upgrade your code will be interrupted

Tomas Gatial

01/10/2023, 3:50 PM

Thanks for noting! Will the

sensor

runs be interrupted too?

daniel

01/10/2023, 4:01 PM

sensors should be fine

daniel

01/10/2023, 4:02 PM

er sorry - to clarify, any runs would be interrupted, yeah, including runs launched from sensors

daniel

01/10/2023, 4:02 PM

but running the sensors themselves should be fine - they will stop too but can pick up where they left off

4 Views

Open in Slack

Previous Next