Hi, has anyone observed any issues when scaling up...
# ask-community
Hi, has anyone observed any issues when scaling up User Code Deployments? After scaling my UC Deployment to 3 replicas i observe following: • In Dagit, i Preiodically get
Definitions Reloaded
popup. • In Dagit I also I observe error message:
Enum 'LocationStateChangeEventType' cannot represent value: <LocationStateChangeEventType instance>
• Dagit becomes less responsive • Also I notice slow linear increase of memory footprint of
pods (Helm Deployment v 1.1.9, tested in 2 different clusters -> Azure & Rancher Desktop)
Curious what your use case for running multiple replicas of user code is. The helm chart does not make the number of replicas configurable. I’ve always assumed more than 1 is not supported (or necessary) https://github.com/dagster-io/dagster/blob/master/helm/dagster/charts/dagster-user-deployments/templates/deployment-user.yaml#L13
Having enabled DefaultRunLauncher enables me to handle low resource / high frequency / time sensitive jobs directly on the user code, without the overhead of k8s orchestrator. I am aiming to have robust Dagster setup, able to handle workloads on both ephemeral and non ephemeral resources, as discussed here: https://dagster.slack.com/archives/C01U954MEER/p1669137566368989?thread_ts=1668960151.191789&amp;cid=C01U954MEER Documentation (page Deployment->Open Source) says code location replicas are supported. https://docs.dagster.io/deployment/overview#long-running-services
I think adding the following to dagsterApiGrpcArgs will help with most of these issues (but replicas on the user code deployments aren't officially supported and I can't promise you won't run into other weirdness)
Copy code
--fixed-server-id <some unique string for your user code deployment here>
looking into that error now, which is not expected
setting the fixed-server-id field will help indicate to dagit that each of the replicas represent the same location - right now its getting confused because each replica has its own server ID so it thinks the code is constantly updating
Thanks Daniel! I am testing the arg now.
the other big downside i think you'll run into right now if you use the default run launcher with replicas is that any runs that are still happening whenever you upgrade your code will be interrupted
Thanks for noting! Will the
runs be interrupted too?
sensors should be fine
er sorry - to clarify, any runs would be interrupted, yeah, including runs launched from sensors
but running the sensors themselves should be fine - they will stop too but can pick up where they left off