Danny Jackowitz
02/16/2022, 6:17 PMevent_logs
and compute_logs
. We’ve encountered a case where every ~12m our Dagster database CPUs are nearly pegged for ~10m and have quickly tracked it down to the following query:
SELECT event_logs.id, event_logs.event
FROM event_logs ORDER BY event_logs.timestamp DESC, event_logs.id DESC
LIMIT ?
I see a fix for this just got merged yesterday(!) (there’s no index on timestamp
, so makes sense that the current query is so, so slow):
https://github.com/dagster-io/dagster/pull/6620
While investigating, though, I noticed that our event_logs
table has accumulated many millions of rows, seemingly retained forever. I see this related issue as well:
https://github.com/dagster-io/dagster/issues/4497
We’ve also noticed similar seemingly-infinite retention of our compute_logs
(we use S3).
Finally, my question. Until such cleanup is a first-class feature of Dagster, what is safe for us to prune out-of-band? Can we just periodically delete rows older than a given timestamp
from event_logs
and use S3 lifecycle rules for compute_logs
? Or will attempting to do so violate internal consistency assumptions made by Dagster and result in a bad time? Thanks for any guidance here.prha
02/16/2022, 6:59 PMno compute logs available
message.
The event log table powers two main views: the individual Run view, and the asset details view. The first one is self-explanatory: we need to fetch the events to display on the run page, including all of the op timing. The second one are all the cross-run materialization events going back in time.Danny Jackowitz
02/16/2022, 7:09 PMevent_logs
and compute_logs
are strictly for “human” consumption via the Dagit UI? As in, Dagster isn’t using them to make any scheduling decisions? That’s the particular case that I’m worried about (more for event_logs
), where we DELETE some old rows and then scheduling goes haywire because Dagster needs the full event history from the beginning of time to decide what to do.compute_logs
the concern is more whether there’s also some associated metadata so Dagster/Dagit thinks there should be compute logs and then fails when it can’t find them because the S3 objects were deleted, but not the metadata.)prha
02/16/2022, 7:11 PMDanny Jackowitz
02/16/2022, 7:18 PMprha
02/16/2022, 7:19 PMDanny Jackowitz
02/16/2022, 7:23 PMevent_logs
query). That said, we’re still seeing many GiB/month of growth in event_logs
, and presumably that will only ramp up further as we migrate more jobs (we’re currently only running a tiny fraction within Dagster), so having an an official, out-of-the-box way to manage retention would be a great feature to have.prha
02/16/2022, 7:27 PMDanny Jackowitz
02/16/2022, 7:39 PM