Hi folks my company has been using Dagster for about a year dagster #ask-community

Hi folks, my company has been using Dagster for ab...

Andy McLaughlin

08/28/2023, 6:44 PM

Hi folks, my company has been using Dagster for about a year now and we're considering moving some of our workloads out of Dagster because it doesn't feel like the right fit. Hoping for some validation of that feeling or some guidance on what we can try to get around our biggest pain point. We run about 10k jobs per day in our Dagster instance, which generates about 100k - 150k run_tags per day and so many event_logs entries that querying it takes a very long time. We self-host and use ECS / RDS / S3 to do so. We have found that we need to delete from the

runs

run_tags

, and

event_logs

tables in RDS in order to keep the volume of data small enough so that Dagit/webserver can query the data without timing out. We keep 2 weeks worth of data in those tables, and delete the rest on a nightly basis (in a dagster job that targets its own DB). This cleanup job used to take 5-10 minutes, but now takes about an hour to run each night, due to the volume of data it is responsible for deleting. It feels like this won't scale very well if we continue to push the dagster DB to its limits. The DB is a db.m6i.xlarge instance in RDS. One of the reasons we're considering going a different direction is that we have 2 jobs that run very frequently in order to poll an external data source for new data, and then to run ML models on that new data. The expectation is that the ML model output will reflect the most up-to-date data in near-real-time. Our feeling is that since these jobs run so frequently, they make more sense as continuously-running services reading off of a queue, rather than discrete job runs. Thoughts?

👀 1

alex

08/30/2023, 5:10 PM

• what dagster version are you on and have you been running

dagster instance migrate

to keep up on schema / index changes? • do you have the retention settings for cleaning up the “ticks” tables https://docs.dagster.io/deployment/dagster-instance#data-retention • if you have performance insights enabled on RDS, you can see the most expensive queries, and share a

explain (analyze, buffers)

to see how they are being executed. Its possible that there are indexes that would improve the situation that we should include by default or that would make sense to apply yourself if they are unique to specific use characteristics. You could also reference queries that are timing out to source explains. > make more sense as continuously-running services reading off of a queue, rather than discrete job runs Ya it does seem like there may be a more optimal set-up for this type of work. As you are experiencing Dagster records a lot of explicit event / meta data to provide a rich experience for understanding what has happened, which may not be the right trade-off when the volume greatly exceeds how often its useful.

Philippe Laflamme

08/31/2023, 12:07 AM

I don’t know how your cleanup job is implemented, but if it’s issuing a lot of requests and not reusing the database connection somehow, then this may be caused by the lack of application-level connection pooling Maybe adding an external connection pool might help? Also related is this comment I made a while ago about how the vast majority of the requests sent to my instance were not productive, also probably caused by this. Lastly, for Postgres (not sure if that’s what you’re using) running on SSDs, it is usually a good idea to lower the

random_page_cost

setting (here). Not sure if RDS does this by default, but Cloud SQL (GCP) doesn’t. I did this a while ago, but if I remember correctly, it made a good difference.

Andy McLaughlin

08/31/2023, 3:17 PM

we're on dagster version 1.3.11, so a couple months out of date, but not too far. we have been keeping up with

dagster instance migrate

commands whenever we do version updates. we don't have the

ticks

retention settings configured. can definitely look into that. we also don't have performance monitoring enabled for this instance, I can look into that. and to answer Philippe's question, it is a Postgres DB on version 11.18

Andy McLaughlin

09/01/2023, 1:09 PM

if either of you were curious, here's how the DB cleanup job is implemented:

Philippe Laflamme

09/01/2023, 1:49 PM

I’m not a dagster engineer, but according to this

delete_run

requires a connection on every call. As per the issue mentioned above, Dagster doesn’t do connection pooling, this will likely need to open a new connection every time. On top of that, Dagster issues a lot of queries before doing useful work on new connections, making this prohibitively expensive. So yeah, I suspect that the issue you’re having is caused by the lack of connection pooling. IMO, if you batch your queries, you should see a dramatic performance difference.

Philippe Laflamme

09/01/2023, 1:50 PM

Adding an external connection pool might help as well…

2 Views

Open in Slack

Previous Next