Hi team wave We have recently deployed Dagster to our k8s cl dagster #ask-community

Hi team :wave: We have recently deployed Dagster ...

Mathijs Pieters

04/11/2022, 10:00 AM

Hi team 👋 We have recently deployed Dagster to our k8s cluster using Helm. The deployment went very smooth and we are really loving the product! One thing we noticed is that our

dagster-dagit

deployments are slowly eating up the working memory. Initially the pods have about 50% free working memory, but this continuously decreases over time. The decrease is more pronounced when more jobs are running. This seems to suggest that job related artifacts are kept in memory (e.g. logs). For the storage database (event_log, run_storage, schedule_storage) we use postgres. For the compute logs we use the noop compute log manager, and we don’t persist the python logs. Additionally, we have set the

DAGSTER_HOME

environment variable, thus the event log storage should not be kept in-memory. Any ideas what might be causing this memory decrease? 🙂

sandy

04/11/2022, 3:28 PM

@prha - any idea what could be causing dagit to increase its memory consumption over time?

prha

04/11/2022, 4:02 PM

Hmm, is this on the latest version of Dagster? We’d need to spend some time digging to see what might be leaking memory (cc @alex)

alex

04/11/2022, 4:41 PM

Any ideas what might be causing this memory decrease?

We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.

50% free working memory, but this continuously decreases over time

What are the absolute values here for reference? What further details can you share?

The decrease is more pronounced when more jobs are running

Is usage of the webserver increased during this time as well, ie users watching their actively executing jobs?

Mathijs Pieters

04/12/2022, 7:32 AM

Hi all, thanks for investing your time! Currently we are running Dagster

0.14.3

. In terms of absolute values, we have 3 replicas of the Dagit deployment. Each requests 100M CPU and 128Mi Memory, with a limit of 200M CPU and 256Mi Memory. I think this might be on the low side. But considering that the memory decrease is rather consistent, increasing the requested resources will only help for some time. See the image for the available memory of a single Dagit pod. Generally we are not actively observing the jobs, the more pronounces decrease during running jobs definitely also occurs when we are not using the UI.

We do cache artifacts related to job structure and metadata in memory, but everything else should be a function of requests made to the server.

Do you have a (rough) estimate of the memory foodprint of these artifacts? Let me know if you need any more details 🙏

alex

04/12/2022, 4:08 PM

Do you have a (rough) estimate of the memory foodprint of these artifacts?

Its a function of your workspace - the number and complexity of all of your jobs, schedules, etc. I don’t have a rough ballpark top of mind. Are the user code deployments constant through this time or are you redeploying them? The graph you have here does seem to indicate a leak of some kind. I don’t observe the same pattern in our internal deployments. This will be difficult to track down if we cant reproduce it.

Generally we are not actively observing the jobs

One thing to note is that before

0.14.4

, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.

alex

04/12/2022, 4:09 PM

Do you or anyone on your team have experience using memory profiling tools ?

alex

04/12/2022, 4:17 PM

the more pronounces decrease during running jobs definitely also occurs when we are not using the UI

Whats kicking off the job? Unless its manually from dagit, the webserver should not be involved in the execution at all. Are you certain this memory graph is just for the dagit pod, or is it the underlying node?

Mathijs Pieters

04/14/2022, 10:41 AM

Are the user code deployments constant through this time or are you redeploying them?

They are almost constant, at most one redeployment in a day.

One thing to note is that before
0.14.4
, pages would poll even if the tabs were not focused, so someone with a random background tab could be causing traffic.

That is good to know! thanks for the heads up 🙂

Do you or anyone on your team have experience using memory profiling tools ?

I have looked into some general profiling tools like

ps

and

top

, and also this python specific profiler. The results showed that it’s the general python process that requires more and more virtual memory. However, I did not find anything specific that could help me pinpoint the problem.

Whats kicking off the job?

We launch the jobs generally sing the python GraphQL client. And I’m sure it’s the memory consumption of the specific pod (and even the python process in the pod). I understand that this is also very difficult for you to do anything about, since you can not reproduce the problem. Although the situation is not ideal, it’s also not causing immediate problems so I wil let the problem rest for the time being. Thanks a lot for your help! and please let me know if you ever find something that might be causing these issues 💪💪

alex

04/14/2022, 2:36 PM

We launch the jobs generally sing the python GraphQL client

ok interesting, those will be generating requests against dagit. Can you go in to more detail about what interactions you do from the GraphQL client? Do you do any custom queries?

Mathijs Pieters

04/14/2022, 2:42 PM

We use the two available endpoints from the provided

DagsterGraphQLClient

: • submit_job_execution: we use this to start the job. At the moment we have around 6 different jobs that we execute from here • get_run_status: we use this to get the status of running jobs, we send a request for every running job once a second. Approximately, these jobs run for 1 minute. So at the moment we don’t use any custom queries

alex

04/14/2022, 3:08 PM

hmmm, nothing surprising being fetched in those

we send a request for every running job once a second

i assume this code is pretty straight forward? no possibility these polling requests are running longer than expected?

Mathijs Pieters

04/14/2022, 3:10 PM

Nope, after the jobs are finished the polling stops 🙂

Open in Slack

Previous Next