Appending when re-materializing an asset. I have a...
# ask-community
c
Appending when re-materializing an asset. I have a data asset that accumulates data with time at stochastic intervals preventing the use of partitions at regular intervals. I'd like to use a schedule to check for new data and if so add it to the asset materialization. My question is what the recommended pattern for this behaviour would be? I'm assuming I'll be writing something like the following (pseudocode):
Copy code
@asset(...)
def foo() -> pd.DataFrame:
    if asset_exists():
        data = load_data()
        new_data = fetch_new_data_since(data.last_time)
        data = data.append(new_data)
    else:
        data = fetch_all_data()
    return data
Does anyone have any suggestions on how to approach this type of problem?
v
I’m not sure if I understand it entirely, but a workaround I’ve used for assets that have irregular time partitions is to use
DynamicPartitions
. You could then aggregate everything into an unpartitioned downstream asset that runs whatever further calculations you might need
c
The object I'm collecting data on has periods of activity, sometimes none in a day, sometime many. Sometimes the activity flows over from one day to the next.
Using a dynamic partition to define each period of activity sounds like the right solution. If I find I have extra data in the latest of the partitions, is it possible to load in what I have and add more? If so, how is that done?
I could use the start of the active period as the partition key.
v
Yep, that’s exactly what I’ve been doing. I have an asset that aggregates our budget/forecasting process, which needs to be historized but is updated very irregularly (sometimes multiple times a day, sometimes none in weeks). Whenever the sensor detects changes in the data, it will add a partition and kick off a run for that partition.
c
Thanks @Vinnie. I'll try this out.
v
I think one major drawback is that the asset isn’t using information about the partition within its processing, meaning (at least in my case) accidentally backfilling would effectively make all partitions show the same data, so you should have some safeguards for that.
c
Yes, and I'll need to somehow check if the newest data is part of a continuing activity band or the beginning of a new one. Would you suggest this part of the logic live in the sensor?
That is where the dynamic add_partition portion will live if I understand everything correctly.
v
Here’s a related thread from a few months back: https://dagster.slack.com/archives/C01U954MEER/p1676890440544719
I think I could see a case for either (logic in the sensor or in the asset) but it’s hard to tell without fully understanding the use case. My initial impulse is to say that logic should be in the sensor, but I’d probably still build another type of check within the asset just to make sure everything is running as intended
🙌 1