I'm looking into setting up data retention limits....
# ask-community
p
I'm looking into setting up data retention limits. Upon further reading it sounds like these settings are only applicable to schedule and sensor tick data, but looking at my Postgres db it's pretty clear that we have a lot more data in
event_logs
than
job_ticks
(which is where I believe the tick data is stored - but either way,
event_logs
uses about 21GB of space whereas the next largest table is
asset_event_tags
at 1.2GB). It sounds like the only way to clean up
event_logs
is to create a Dagster job that either executes a SQL statement directly on the table or calls
context.instance.delete_run(...)
based on some list of run ID's that I would generate based on data in
event_logs
- are there other options I missed? I'd like to keep the size of the db from growing unnecessarily large, and I'm pretty sure we don't really need to look at run data that's older than a few weeks.
🤖 1
As far as deleting using SQL goes, I see these
dagster_event_types
in the table:
Copy code
count  |      dagster_event_type
---------+-------------------------------
       6 | HOOK_COMPLETED
     202 | HOOK_SKIPPED
     330 | PIPELINE_CANCELING
     342 | PIPELINE_CANCELED
    2998 | PIPELINE_ENQUEUED
    6524 | STEP_SKIPPED
    9896 | PIPELINE_SUCCESS
   10788 | PIPELINE_FAILURE
   19334 | PIPELINE_START
   19398 | PIPELINE_STARTING
   49717 | STEP_FAILURE
  148834 | ENGINE_EVENT
  584840 | STEP_INPUT
  586052 | LOADED_INPUT
  669303 | ASSET_OBSERVATION
  829815 | ASSET_MATERIALIZATION
  831232 | HANDLED_OUTPUT
  831407 | STEP_OUTPUT
  835837 | STEP_SUCCESS
  885650 | STEP_START
  885653 | LOGS_CAPTURED
  885656 | RESOURCE_INIT_SUCCESS
  885841 | RESOURCE_INIT_STARTED
  885955 | STEP_WORKER_STARTED
  894480 | STEP_WORKER_STARTING
 1109948 | ASSET_MATERIALIZATION_PLANNED
 2556012 |
Sort of surprised to see so many null rows, which I think are produced by
context.log()
? I didn't think we do that much logging of our own, but I should check that. It does seem that deleting based on event type will leave behind a fragmented run history so it's probably better to delete by run, for data consistency at least.
p
This is right. It should generally be safe to delete the null dagster_event_type rows… they do correspond to the
context.log
calls. The input/output events are used to power the “Retry” feature, so might be okay to remove after some amount of time. The asset events (materialization, observation) power the asset catalog / asset history, so deleting those will have impact on history.
p
Thanks for the info! Feels a little hacky to be running SQL statements behind the scenes - going through a published interface like
context.instance.delete_run()
, running that on a weekly or monthly basis and feeding in run ID's obtained from the
runs
table is probably a better solution. Which event types does
delete_run()
remove? I haven't had a chance to look at the code.
p
Calling
delete_run
will delete all the events for the run (including materialization history, etc).
p
Still mulling this one over - I could probably live without materialization history for runs older than a month or two. It is nice to see which partitions were materialized successfully, but in practice I typically only look at the more recent partition materialization history, and at the end of the day our source of truth is what gets uploaded to S3.
Related to this, could you elaborate on how schedule and sensor data purging works? When is the purging executed and which tables does the purge delete from?
p
Sensor/schedule purging just deletes ticks…
p
And those are not stored in the database?
p
The are, in the
job_ticks
table
p
Right, ok. When does the tick deletion actually run?
p
The tick deletion happens with the iteration of the sensor/schedule daemon