Hi, how do we prevent unbounded growth of our Post...
# ask-community
b
Hi, how do we prevent unbounded growth of our PostgreSQL Dagster backend? The
event_logs
table in our production environment grew by 150 GB in one month. We’re heavily dependent on Dagster now for many of our ETL pipelines.
a
https://github.com/dagster-io/dagster/discussions/12047 reducing use of
context.log
or any related log capturing that reports in to the event stream (and writes to the event log table) can help
b
@alex thank you for your reply. This might help slow down the growth - so we’ll definitely make changes based on this recommendation. But in order to stop the unbounded growth we also need a way to archive old
event_logs
entries without losing information about materialised asset partitions - and I should mention that we rely on time-partitioned assets quite a bit - so it seems that deleting old runs is not an option. Is there anything you can recommend to that end?
We need a way to archive/delete old
event_logs
entries without losing information about materialised asset partitions. Can anyone advise on whether the solution proposed here or running a SQL query directly against the
event_logs
table will impact dagster’s “knowledge” of materialized asset partitions?
a
ya deleting entire runs will remove the asset materialization events which currently directly power the knowledge of which partitions have been materialized we are early on in planning some changes to replicate that information out of the event_logs table to address this in the mean time if you are willing to target with SQL directly, the events you need to leave alone in
event_logs
: • have
dagster_event_type = 'ASSET_MATERIALIZATION
(or
ASSET_OBSERVATION
if you are using those) • have non null values for
asset_key
b
Awesome, thank you for the SQL tip, Alex!
c
I took a show at translating that to postgres SQL:
Copy code
delete from event_logs
where (dagster_event_type is null or dagster_event_type not in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION'))
and asset_key is null
and timestamp < (now() - INTERVAL '7 days');
Can follow that with additional cleanup for historical runs, but it needs a condition to preserve the latest events for each (asset_key, partition), so you don't accidentally lose sight of long-standing failures.
Copy code
-- using CTE for faster deletes, keep 30 days of history before last materialization
with watermarks as ( 
  select asset_key, partition, max(timestamp) as max_timestamp from event_logs 
  where dagster_event_type in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION')
  group by asset_key, partition 
), 
rows_to_delete as (
  select id from event_logs l inner join watermarks on (
    (l.asset_key = watermarks.asset_key) 
    and (l.partition is not distinct from watermarks.partition) 
    and (l.timestamp < watermarks.max_timestamp - interval '30 days')
  )
)
delete from event_logs 
using rows_to_delete
where event_logs.id = rows_to_delete.id;
b
Thanks Chris. Just for the benefit of others, here is the exact query we’re using (
IS DISTINCT FROM
allows us to match on
NULL
as in your query):
Copy code
DELETE FROM event_logs 
WHERE timestamp < NOW() - interval '14 DAYS' 
AND asset_key IS NULL 
AND dagster_event_type IS DISTINCT FROM 'ASSET_MATERIALIZATION' AND dagster_event_type IS DISTINCT FROM 'ASSET_OBSERVATION';
👍 1