A question on partitioning... We have time series data coming in for a given physical asset which we are storing in a time partitioned parquet file. The thing is that within this time series there are discrete subclasses of event type. We might have something like the following:
Inactive: 1st Jan, 12:15
Loading: 1st Jan, 23:45
Active: 2nd Jan, 01:15
Inactive: 4th Jan, 14:00
The challenge is that we don't know the length of the events in advance, and need an event to have completed in order to process it properly. Within a given days data we'd be able to detect the transition between events, but being able to then fetch the requisite data to process that event is challenging.
In effect we'd want to on each day be able to say "Has an event ended today, and if so, create the corresponding 'Event' asset", which should then be processed appropriately. Sketch attached!
If anyone has any suggestions/initial thoughts on how to structure something like this I'm all ears!
08/25/2022, 7:05 PM
So is the time series data coming in being managed by dagster? Or is this some external service that you're reading from, and trying to interpret via dagster?
One thought that comes to mind is representing the events you see end as
objects - you can have assets / jobs downstream that then process when you have said observation.
08/30/2022, 8:06 AM
Depends. We have control of the upstream pipelines so could theoretically do the partitioning there. That said, it would probably be preferable to separate out pure ingestion logic from this which is a bit more interpretation based.
In our case, working out whether something is active isn't just a case of checking for a flag, but doing some rolling window type analysis to detect certain features in the time series data.
AssetObservations seems like an interesting approach, I'll have a play!