Hi folks, my company has been using Dagster for ab...
# ask-community
a
Hi folks, my company has been using Dagster for about a year now and we're considering moving some of our workloads out of Dagster because it doesn't feel like the right fit. Hoping for some validation of that feeling or some guidance on what we can try to get around our biggest pain point. We run about 10k jobs per day in our Dagster instance, which generates about 100k - 150k run_tags per day and so many event_logs entries that querying it takes a very long time. We self-host and use ECS / RDS / S3 to do so. We have found that we need to delete from the
runs
,
run_tags
, and
event_logs
tables in RDS in order to keep the volume of data small enough so that Dagit/webserver can query the data without timing out. We keep 2 weeks worth of data in those tables, and delete the rest on a nightly basis (in a dagster job that targets its own DB). This cleanup job used to take 5-10 minutes, but now takes about an hour to run each night, due to the volume of data it is responsible for deleting. It feels like this won't scale very well if we continue to push the dagster DB to its limits. The DB is a db.m6i.xlarge instance in RDS. One of the reasons we're considering going a different direction is that we have 2 jobs that run very frequently in order to poll an external data source for new data, and then to run ML models on that new data. The expectation is that the ML model output will reflect the most up-to-date data in near-real-time. Our feeling is that since these jobs run so frequently, they make more sense as continuously-running services reading off of a queue, rather than discrete job runs. Thoughts?
đź‘€ 1
a
• what dagster version are you on and have you been running
dagster instance migrate
to keep up on schema / index changes? • do you have the retention settings for cleaning up the “ticks” tables https://docs.dagster.io/deployment/dagster-instance#data-retention • if you have performance insights enabled on RDS, you can see the most expensive queries, and share a
explain (analyze, buffers)
to see how they are being executed. Its possible that there are indexes that would improve the situation that we should include by default or that would make sense to apply yourself if they are unique to specific use characteristics. You could also reference queries that are timing out to source explains. > make more sense as continuously-running services reading off of a queue, rather than discrete job runs Ya it does seem like there may be a more optimal set-up for this type of work. As you are experiencing Dagster records a lot of explicit event / meta data to provide a rich experience for understanding what has happened, which may not be the right trade-off when the volume greatly exceeds how often its useful.
p
I don’t know how your cleanup job is implemented, but if it’s issuing a lot of requests and not reusing the database connection somehow, then this may be caused by the lack of application-level connection pooling Maybe adding an external connection pool might help? Also related is this comment I made a while ago about how the vast majority of the requests sent to my instance were not productive, also probably caused by this. Lastly, for Postgres (not sure if that’s what you’re using) running on SSDs, it is usually a good idea to lower the
random_page_cost
setting (here). Not sure if RDS does this by default, but Cloud SQL (GCP) doesn’t. I did this a while ago, but if I remember correctly, it made a good difference.
a
we're on dagster version 1.3.11, so a couple months out of date, but not too far. we have been keeping up with
dagster instance migrate
commands whenever we do version updates. we don't have the
ticks
retention settings configured. can definitely look into that. we also don't have performance monitoring enabled for this instance, I can look into that. and to answer Philippe's question, it is a Postgres DB on version 11.18
if either of you were curious, here's how the DB cleanup job is implemented:
p
I’m not a dagster engineer, but according to this
delete_run
requires a connection on every call. As per the issue mentioned above, Dagster doesn’t do connection pooling, this will likely need to open a new connection every time. On top of that, Dagster issues a lot of queries before doing useful work on new connections, making this prohibitively expensive. So yeah, I suspect that the issue you’re having is caused by the lack of connection pooling. IMO, if you batch your queries, you should see a dramatic performance difference.
Adding an external connection pool might help as well…