Hi how do we prevent unbounded growth of our PostgreSQL Dags dagster #ask-community

Hi, how do we prevent unbounded growth of our Post...

Bartek Marczak

09/01/2023, 4:20 PM

Hi, how do we prevent unbounded growth of our PostgreSQL Dagster backend? The

event_logs

table in our production environment grew by 150 GB in one month. We’re heavily dependent on Dagster now for many of our ETL pipelines.

alex

09/01/2023, 5:27 PM

https://github.com/dagster-io/dagster/discussions/12047 reducing use of

context.log

or any related log capturing that reports in to the event stream (and writes to the event log table) can help

Bartek Marczak

09/04/2023, 11:55 AM

@alex thank you for your reply. This might help slow down the growth - so we’ll definitely make changes based on this recommendation. But in order to stop the unbounded growth we also need a way to archive old

event_logs

entries without losing information about materialised asset partitions - and I should mention that we rely on time-partitioned assets quite a bit - so it seems that deleting old runs is not an option. Is there anything you can recommend to that end?

Bartek Marczak

09/07/2023, 3:44 PM

We need a way to archive/delete old

event_logs

entries without losing information about materialised asset partitions. Can anyone advise on whether the solution proposed here or running a SQL query directly against the

event_logs

table will impact dagster’s “knowledge” of materialized asset partitions?

alex

09/07/2023, 8:28 PM

ya deleting entire runs will remove the asset materialization events which currently directly power the knowledge of which partitions have been materialized we are early on in planning some changes to replicate that information out of the event_logs table to address this in the mean time if you are willing to target with SQL directly, the events you need to leave alone in

event_logs

: • have

dagster_event_type = 'ASSET_MATERIALIZATION

(or

ASSET_OBSERVATION

if you are using those) • have non null values for

asset_key

Bartek Marczak

09/08/2023, 10:38 AM

Awesome, thank you for the SQL tip, Alex!

Chris Comeau

10/04/2023, 7:13 PM

I took a show at translating that to postgres SQL:

Copy code

delete from event_logs
where (dagster_event_type is null or dagster_event_type not in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION'))
and asset_key is null
and timestamp < (now() - INTERVAL '7 days');

Can follow that with additional cleanup for historical runs, but it needs a condition to preserve the latest events for each (asset_key, partition), so you don't accidentally lose sight of long-standing failures.

Copy code

-- using CTE for faster deletes, keep 30 days of history before last materialization
with watermarks as ( 
  select asset_key, partition, max(timestamp) as max_timestamp from event_logs 
  where dagster_event_type in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION')
  group by asset_key, partition 
), 
rows_to_delete as (
  select id from event_logs l inner join watermarks on (
    (l.asset_key = watermarks.asset_key) 
    and (l.partition is not distinct from watermarks.partition) 
    and (l.timestamp < watermarks.max_timestamp - interval '30 days')
  )
)
delete from event_logs 
using rows_to_delete
where event_logs.id = rows_to_delete.id;

Bartek Marczak

10/13/2023, 9:03 AM

Thanks Chris. Just for the benefit of others, here is the exact query we’re using (

IS DISTINCT FROM

allows us to match on

NULL

as in your query):

Copy code

DELETE FROM event_logs 
WHERE timestamp < NOW() - interval '14 DAYS' 
AND asset_key IS NULL 
AND dagster_event_type IS DISTINCT FROM 'ASSET_MATERIALIZATION' AND dagster_event_type IS DISTINCT FROM 'ASSET_OBSERVATION';

👍 1

9 Views

Open in Slack

Previous Next