I m looking into setting up data retention limits Upon furth dagster #ask-community

I'm looking into setting up data retention limits....

Philip Strnad

08/17/2023, 10:42 PM

I'm looking into setting up data retention limits. Upon further reading it sounds like these settings are only applicable to schedule and sensor tick data, but looking at my Postgres db it's pretty clear that we have a lot more data in

event_logs

than

job_ticks

(which is where I believe the tick data is stored - but either way,

event_logs

uses about 21GB of space whereas the next largest table is

asset_event_tags

at 1.2GB). It sounds like the only way to clean up

event_logs

is to create a Dagster job that either executes a SQL statement directly on the table or calls

context.instance.delete_run(...)

based on some list of run ID's that I would generate based on data in

event_logs

- are there other options I missed? I'd like to keep the size of the db from growing unnecessarily large, and I'm pretty sure we don't really need to look at run data that's older than a few weeks.

🤖 1

Philip Strnad

08/17/2023, 10:42 PM

As far as deleting using SQL goes, I see these

dagster_event_types

in the table:

Copy code

count  |      dagster_event_type
---------+-------------------------------
       6 | HOOK_COMPLETED
     202 | HOOK_SKIPPED
     330 | PIPELINE_CANCELING
     342 | PIPELINE_CANCELED
    2998 | PIPELINE_ENQUEUED
    6524 | STEP_SKIPPED
    9896 | PIPELINE_SUCCESS
   10788 | PIPELINE_FAILURE
   19334 | PIPELINE_START
   19398 | PIPELINE_STARTING
   49717 | STEP_FAILURE
  148834 | ENGINE_EVENT
  584840 | STEP_INPUT
  586052 | LOADED_INPUT
  669303 | ASSET_OBSERVATION
  829815 | ASSET_MATERIALIZATION
  831232 | HANDLED_OUTPUT
  831407 | STEP_OUTPUT
  835837 | STEP_SUCCESS
  885650 | STEP_START
  885653 | LOGS_CAPTURED
  885656 | RESOURCE_INIT_SUCCESS
  885841 | RESOURCE_INIT_STARTED
  885955 | STEP_WORKER_STARTED
  894480 | STEP_WORKER_STARTING
 1109948 | ASSET_MATERIALIZATION_PLANNED
 2556012 |

Sort of surprised to see so many null rows, which I think are produced by

context.log()

? I didn't think we do that much logging of our own, but I should check that. It does seem that deleting based on event type will leave behind a fragmented run history so it's probably better to delete by run, for data consistency at least.

prha

08/17/2023, 11:18 PM

This is right. It should generally be safe to delete the null dagster_event_type rows… they do correspond to the

context.log

calls. The input/output events are used to power the “Retry” feature, so might be okay to remove after some amount of time. The asset events (materialization, observation) power the asset catalog / asset history, so deleting those will have impact on history.

Philip Strnad

08/17/2023, 11:49 PM

Thanks for the info! Feels a little hacky to be running SQL statements behind the scenes - going through a published interface like

context.instance.delete_run()

, running that on a weekly or monthly basis and feeding in run ID's obtained from the

runs

table is probably a better solution. Which event types does

delete_run()

remove? I haven't had a chance to look at the code.

prha

08/18/2023, 12:01 AM

Calling

delete_run

will delete all the events for the run (including materialization history, etc).

Philip Strnad

08/21/2023, 9:38 PM

Still mulling this one over - I could probably live without materialization history for runs older than a month or two. It is nice to see which partitions were materialized successfully, but in practice I typically only look at the more recent partition materialization history, and at the end of the day our source of truth is what gets uploaded to S3.

Philip Strnad

08/21/2023, 9:40 PM

Related to this, could you elaborate on how schedule and sensor data purging works? When is the purging executed and which tables does the purge delete from?

prha

08/22/2023, 12:35 AM

Sensor/schedule purging just deletes ticks…

Philip Strnad

08/22/2023, 4:43 PM

And those are not stored in the database?

prha

08/22/2023, 4:45 PM

The are, in the

job_ticks

table

Philip Strnad

08/22/2023, 4:46 PM

Right, ok. When does the tick deletion actually run?

prha

08/22/2023, 4:52 PM

The tick deletion happens with the iteration of the sensor/schedule daemon

7 Views

Open in Slack

Previous Next