Bartek Marczak
09/01/2023, 4:20 PMevent_logs
table in our production environment grew by 150 GB in one month. We’re heavily dependent on Dagster now for many of our ETL pipelines.alex
09/01/2023, 5:27 PMcontext.log
or any related log capturing that reports in to the event stream (and writes to the event log table) can helpBartek Marczak
09/04/2023, 11:55 AMevent_logs
entries without losing information about materialised asset partitions - and I should mention that we rely on time-partitioned assets quite a bit - so it seems that deleting old runs is not an option. Is there anything you can recommend to that end?Bartek Marczak
09/07/2023, 3:44 PMevent_logs
entries without losing information about materialised asset partitions. Can anyone advise on whether the solution proposed here or running a SQL query directly against the event_logs
table will impact dagster’s “knowledge” of materialized asset partitions?alex
09/07/2023, 8:28 PMevent_logs
:
• have dagster_event_type = 'ASSET_MATERIALIZATION
(or ASSET_OBSERVATION
if you are using those)
• have non null values for asset_key
Bartek Marczak
09/08/2023, 10:38 AMChris Comeau
10/04/2023, 7:13 PMdelete from event_logs
where (dagster_event_type is null or dagster_event_type not in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION'))
and asset_key is null
and timestamp < (now() - INTERVAL '7 days');
Can follow that with additional cleanup for historical runs, but it needs a condition to preserve the latest events for each (asset_key, partition), so you don't accidentally lose sight of long-standing failures.
-- using CTE for faster deletes, keep 30 days of history before last materialization
with watermarks as (
select asset_key, partition, max(timestamp) as max_timestamp from event_logs
where dagster_event_type in ('ASSET_OBSERVATION','ASSET_MATERIALIZATION')
group by asset_key, partition
),
rows_to_delete as (
select id from event_logs l inner join watermarks on (
(l.asset_key = watermarks.asset_key)
and (l.partition is not distinct from watermarks.partition)
and (l.timestamp < watermarks.max_timestamp - interval '30 days')
)
)
delete from event_logs
using rows_to_delete
where event_logs.id = rows_to_delete.id;
Bartek Marczak
10/13/2023, 9:03 AMIS DISTINCT FROM
allows us to match on NULL
as in your query):
DELETE FROM event_logs
WHERE timestamp < NOW() - interval '14 DAYS'
AND asset_key IS NULL
AND dagster_event_type IS DISTINCT FROM 'ASSET_MATERIALIZATION' AND dagster_event_type IS DISTINCT FROM 'ASSET_OBSERVATION';