Mathijs Pieters04/11/2022, 10:00 AM
deployments are slowly eating up the working memory. Initially the pods have about 50% free working memory, but this continuously decreases over time. The decrease is more pronounced when more jobs are running. This seems to suggest that job related artifacts are kept in memory (e.g. logs). For the storage database (event_log, run_storage, schedule_storage) we use postgres. For the compute logs we use the noop compute log manager, and we don’t persist the python logs. Additionally, we have set the
environment variable, thus the event log storage should not be kept in-memory. Any ideas what might be causing this memory decrease? 🙂
sandy04/11/2022, 3:28 PM
prha04/11/2022, 4:02 PM
alex04/11/2022, 4:41 PM
Any ideas what might be causing this memory decrease?We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.
50% free working memory, but this continuously decreases over timeWhat are the absolute values here for reference? What further details can you share?
The decrease is more pronounced when more jobs are runningIs usage of the webserver increased during this time as well, ie users watching their actively executing jobs?
Mathijs Pieters04/12/2022, 7:32 AM
. In terms of absolute values, we have 3 replicas of the Dagit deployment. Each requests 100M CPU and 128Mi Memory, with a limit of 200M CPU and 256Mi Memory. I think this might be on the low side. But considering that the memory decrease is rather consistent, increasing the requested resources will only help for some time. See the image for the available memory of a single Dagit pod. Generally we are not actively observing the jobs, the more pronounces decrease during running jobs definitely also occurs when we are not using the UI.
We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.Do you have a (rough) estimate of the memory foodprint of these artifacts? Let me know if you need any more details 🙏
alex04/12/2022, 4:08 PM
Do you have a (rough) estimate of the memory foodprint of these artifacts?Its a function of your workspace - the number and complexity of all of your jobs, schedules, etc. I don’t have a rough ballpark top of mind. Are the user code deployments constant through this time or are you redeploying them? The graph you have here does seem to indicate a leak of some kind. I don’t observe the same pattern in our internal deployments. This will be difficult to track down if we cant reproduce it.
Generally we are not actively observing the jobsOne thing to note is that before
, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.
the more pronounces decrease during running jobs definitely also occurs when we are not using the UIWhats kicking off the job? Unless its manually from dagit, the webserver should not be involved in the execution at all. Are you certain this memory graph is just for the dagit pod, or is it the underlying node?
Mathijs Pieters04/14/2022, 10:41 AM
Are the user code deployments constant through this time or are you redeploying them?They are almost constant, at most one redeployment in a day.
One thing to note is that beforeThat is good to know! thanks for the heads up 🙂, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.
Do you or anyone on your team have experience using memory profiling tools ?I have looked into some general profiling tools like
, and also this python specific profiler. The results showed that it’s the general python process that requires more and more virtual memory. However, I did not find anything specific that could help me pinpoint the problem.
Whats kicking off the job?We launch the jobs generally sing the python GraphQL client. And I’m sure it’s the memory consumption of the specific pod (and even the python process in the pod). I understand that this is also very difficult for you to do anything about, since you can not reproduce the problem. Although the situation is not ideal, it’s also not causing immediate problems so I wil let the problem rest for the time being. Thanks a lot for your help! and please let me know if you ever find something that might be causing these issues 💪💪
alex04/14/2022, 2:36 PM
We launch the jobs generally sing the python GraphQL clientok interesting, those will be generating requests against dagit. Can you go in to more detail about what interactions you do from the GraphQL client? Do you do any custom queries?
Mathijs Pieters04/14/2022, 2:42 PM
: • submit_job_execution: we use this to start the job. At the moment we have around 6 different jobs that we execute from here • get_run_status: we use this to get the status of running jobs, we send a request for every running job once a second. Approximately, these jobs run for 1 minute. So at the moment we don’t use any custom queries
alex04/14/2022, 3:08 PM
we send a request for every running job once a secondi assume this code is pretty straight forward? no possibility these polling requests are running longer than expected?
Mathijs Pieters04/14/2022, 3:10 PM