Hi team :wave: We have recently deployed Dagster ...
# ask-community
m
Hi team 👋 We have recently deployed Dagster to our k8s cluster using Helm. The deployment went very smooth and we are really loving the product! One thing we noticed is that our
dagster-dagit
deployments are slowly eating up the working memory. Initially the pods have about 50% free working memory, but this continuously decreases over time. The decrease is more pronounced when more jobs are running. This seems to suggest that job related artifacts are kept in memory (e.g. logs). For the storage database (event_log, run_storage, schedule_storage) we use postgres. For the compute logs we use the noop compute log manager, and we don’t persist the python logs. Additionally, we have set the
DAGSTER_HOME
environment variable, thus the event log storage should not be kept in-memory. Any ideas what might be causing this memory decrease? 🙂
s
@prha - any idea what could be causing dagit to increase its memory consumption over time?
p
Hmm, is this on the latest version of Dagster? We’d need to spend some time digging to see what might be leaking memory (cc @alex)
a
Any ideas what might be causing this memory decrease?
We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.
50% free working memory, but this continuously decreases over time
What are the absolute values here for reference? What further details can you share?
The decrease is more pronounced when more jobs are running
Is usage of the webserver increased during this time as well, ie users watching their actively executing jobs?
m
Hi all, thanks for investing your time! Currently we are running Dagster
0.14.3
. In terms of absolute values, we have 3 replicas of the Dagit deployment. Each requests 100M CPU and 128Mi Memory, with a limit of 200M CPU and 256Mi Memory. I think this might be on the low side. But considering that the memory decrease is rather consistent, increasing the requested resources will only help for some time. See the image for the available memory of a single Dagit pod. Generally we are not actively observing the jobs, the more pronounces decrease during running jobs definitely also occurs when we are not using the UI.
We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.
Do you have a (rough) estimate of the memory foodprint of these artifacts? Let me know if you need any more details 🙏
a
Do you have a (rough) estimate of the memory foodprint of these artifacts?
Its a function of your workspace - the number and complexity of all of your jobs, schedules, etc. I don’t have a rough ballpark top of mind. Are the user code deployments constant through this time or are you redeploying them? The graph you have here does seem to indicate a leak of some kind. I don’t observe the same pattern in our internal deployments. This will be difficult to track down if we cant reproduce it.
Generally we are not actively observing the jobs
One thing to note is that before
0.14.4
, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.
Do you or anyone on your team have experience using memory profiling tools ?
the more pronounces decrease during running jobs definitely also occurs when we are not using the UI
Whats kicking off the job? Unless its manually from dagit, the webserver should not be involved in the execution at all. Are you certain this memory graph is just for the dagit pod, or is it the underlying node?
m
Are the user code deployments constant through this time or are you redeploying them?
They are almost constant, at most one redeployment in a day.
One thing to note is that before
0.14.4
, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.
That is good to know! thanks for the heads up 🙂
Do you or anyone on your team have experience using memory profiling tools ?
I have looked into some general profiling tools like
ps
and
top
, and also this python specific profiler. The results showed that it’s the general python process that requires more and more virtual memory. However, I did not find anything specific that could help me pinpoint the problem.
Whats kicking off the job?
We launch the jobs generally sing the python GraphQL client. And I’m sure it’s the memory consumption of the specific pod (and even the python process in the pod). I understand that this is also very difficult for you to do anything about, since you can not reproduce the problem. Although the situation is not ideal, it’s also not causing immediate problems so I wil let the problem rest for the time being. Thanks a lot for your help! and please let me know if you ever find something that might be causing these issues 💪💪
a
We launch the jobs generally sing the python GraphQL client
ok interesting, those will be generating requests against dagit. Can you go in to more detail about what interactions you do from the GraphQL client? Do you do any custom queries?
m
We use the two available endpoints from the provided
DagsterGraphQLClient
: • submit_job_execution: we use this to start the job. At the moment we have around 6 different jobs that we execute from here • get_run_status: we use this to get the status of running jobs, we send a request for every running job once a second. Approximately, these jobs run for 1 minute. So at the moment we don’t use any custom queries
a
hmmm, nothing surprising being fetched in those
we send a request for every running job once a second
i assume this code is pretty straight forward? no possibility these polling requests are running longer than expected?
m
Nope, after the jobs are finished the polling stops 🙂