https://dagster.io/ logo
Title
d

Danny Jackowitz

02/16/2022, 6:17 PM
👋 Question regarding retention for
event_logs
and
compute_logs
. We’ve encountered a case where every ~12m our Dagster database CPUs are nearly pegged for ~10m and have quickly tracked it down to the following query:
SELECT event_logs.id, event_logs.event 
FROM event_logs ORDER BY event_logs.timestamp DESC, event_logs.id DESC 
 LIMIT ?
I see a fix for this just got merged yesterday(!) (there’s no index on
timestamp
, so makes sense that the current query is so, so slow): https://github.com/dagster-io/dagster/pull/6620 While investigating, though, I noticed that our
event_logs
table has accumulated many millions of rows, seemingly retained forever. I see this related issue as well: https://github.com/dagster-io/dagster/issues/4497 We’ve also noticed similar seemingly-infinite retention of our
compute_logs
(we use S3). Finally, my question. Until such cleanup is a first-class feature of Dagster, what is safe for us to prune out-of-band? Can we just periodically delete rows older than a given
timestamp
from
event_logs
and use S3 lifecycle rules for
compute_logs
? Or will attempting to do so violate internal consistency assumptions made by Dagster and result in a bad time? Thanks for any guidance here.
Also noting that I have searched here and have seen the previous threads regarding how to use Dagster itself to periodically run the delete operations, but my question here is less the “how?” and more about “is this a safe operation that we should be doing?“.
p

prha

02/16/2022, 6:59 PM
The answer to this kind of depends on your jobs and what views you’re relying on in dagit. It’s generally pretty safe to delete compute logs if you don’t have a need to read the stdout/stderr. In dagit, this will result in just a
no compute logs available
message. The event log table powers two main views: the individual Run view, and the asset details view. The first one is self-explanatory: we need to fetch the events to display on the run page, including all of the op timing. The second one are all the cross-run materialization events going back in time.
d

Danny Jackowitz

02/16/2022, 7:09 PM
Thanks, @prha. I think that answers my question, but just to confirm, both
event_logs
and
compute_logs
are strictly for “human” consumption via the Dagit UI? As in, Dagster isn’t using them to make any scheduling decisions? That’s the particular case that I’m worried about (more for
event_logs
), where we DELETE some old rows and then scheduling goes haywire because Dagster needs the full event history from the beginning of time to decide what to do.
(For
compute_logs
the concern is more whether there’s also some associated metadata so Dagster/Dagit thinks there should be compute logs and then fails when it can’t find them because the S3 objects were deleted, but not the metadata.)
p

prha

02/16/2022, 7:11 PM
For compute logs, I think we’re robust to failures in fetching them from S3.
For event_logs, they shouldn’t ordinarily affect schedules unless your schedules have custom logic to read from them. We do have schedule reconciliation, so that if the last 5 schedule ticks did not successfully fire (e.g. the daemon was stopped for some reason), we try to “catch up” by kicking off runs for the last 5 ticks. But that mechanism reads from run storage, not event log storage
One common instigation method that does read from the event log are asset sensors. But they are typically querying for new materialization events for an asset. If you have an asset sensor that is listening for asset materializations and you delete the corresponding event logs before the sensor is able to fire, then you might skip a job execution.
d

Danny Jackowitz

02/16/2022, 7:18 PM
Ohh, that’s good to be aware of, I’ll definitely have to check for that case before actually deleting anything. Thanks, @prha.
p

prha

02/16/2022, 7:19 PM
All that being said, aside from the sheer storage, we are striving to make dagster / dagit performant so that you will not have to delete metadata in order to keep it functionally operational
(hence adding the correct indices to keep queries performant)
d

Danny Jackowitz

02/16/2022, 7:23 PM
Great to hear! I think https://github.com/dagster-io/dagster/pull/6620 will help with the immediate issue that surfaced this for us (that 100%-CPU-for-10-minutes
event_logs
query). That said, we’re still seeing many GiB/month of growth in
event_logs
, and presumably that will only ramp up further as we migrate more jobs (we’re currently only running a tiny fraction within Dagster), so having an an official, out-of-the-box way to manage retention would be a great feature to have.
(It is also quite possible that we are doing something incorrect that is resulting in that growth rate, of course. 😅)
p

prha

02/16/2022, 7:27 PM
We have this issue that talks about pruning old runs / event logs, that you could upvote and track: https://github.com/dagster-io/dagster/issues/4100
d

Danny Jackowitz

02/16/2022, 7:39 PM
Upvoted.