https://dagster.io/ logo
Title
c

Chris Roth

04/14/2023, 1:33 PM
Appending when re-materializing an asset. I have a data asset that accumulates data with time at stochastic intervals preventing the use of partitions at regular intervals. I'd like to use a schedule to check for new data and if so add it to the asset materialization. My question is what the recommended pattern for this behaviour would be? I'm assuming I'll be writing something like the following (pseudocode):
@asset(...)
def foo() -> pd.DataFrame:
    if asset_exists():
        data = load_data()
        new_data = fetch_new_data_since(data.last_time)
        data = data.append(new_data)
    else:
        data = fetch_all_data()
    return data
Does anyone have any suggestions on how to approach this type of problem?
v

Vinnie

04/14/2023, 1:43 PM
I’m not sure if I understand it entirely, but a workaround I’ve used for assets that have irregular time partitions is to use
DynamicPartitions
. You could then aggregate everything into an unpartitioned downstream asset that runs whatever further calculations you might need
c

Chris Roth

04/14/2023, 1:47 PM
The object I'm collecting data on has periods of activity, sometimes none in a day, sometime many. Sometimes the activity flows over from one day to the next.
Using a dynamic partition to define each period of activity sounds like the right solution. If I find I have extra data in the latest of the partitions, is it possible to load in what I have and add more? If so, how is that done?
I could use the start of the active period as the partition key.
v

Vinnie

04/14/2023, 1:50 PM
Yep, that’s exactly what I’ve been doing. I have an asset that aggregates our budget/forecasting process, which needs to be historized but is updated very irregularly (sometimes multiple times a day, sometimes none in weeks). Whenever the sensor detects changes in the data, it will add a partition and kick off a run for that partition.
c

Chris Roth

04/14/2023, 1:53 PM
Thanks @Vinnie. I'll try this out.
v

Vinnie

04/14/2023, 1:54 PM
I think one major drawback is that the asset isn’t using information about the partition within its processing, meaning (at least in my case) accidentally backfilling would effectively make all partitions show the same data, so you should have some safeguards for that.
c

Chris Roth

04/14/2023, 1:55 PM
Yes, and I'll need to somehow check if the newest data is part of a continuing activity band or the beginning of a new one. Would you suggest this part of the logic live in the sensor?
That is where the dynamic add_partition portion will live if I understand everything correctly.
v

Vinnie

04/14/2023, 1:56 PM
Here’s a related thread from a few months back: https://dagster.slack.com/archives/C01U954MEER/p1676890440544719
I think I could see a case for either (logic in the sensor or in the asset) but it’s hard to tell without fully understanding the use case. My initial impulse is to say that logic should be in the sensor, but I’d probably still build another type of check within the asset just to make sure everything is running as intended
🙌 1