Appending when re materializing an asset I have a data asse dagster #ask-community

Appending when re-materializing an asset. I have a...

Chris Roth

04/14/2023, 1:33 PM

Appending when re-materializing an asset. I have a data asset that accumulates data with time at stochastic intervals preventing the use of partitions at regular intervals. I'd like to use a schedule to check for new data and if so add it to the asset materialization. My question is what the recommended pattern for this behaviour would be? I'm assuming I'll be writing something like the following (pseudocode):

Copy code

@asset(...)
def foo() -> pd.DataFrame:
    if asset_exists():
        data = load_data()
        new_data = fetch_new_data_since(data.last_time)
        data = data.append(new_data)
    else:
        data = fetch_all_data()
    return data

Does anyone have any suggestions on how to approach this type of problem?

Vinnie

04/14/2023, 1:43 PM

I’m not sure if I understand it entirely, but a workaround I’ve used for assets that have irregular time partitions is to use

DynamicPartitions

. You could then aggregate everything into an unpartitioned downstream asset that runs whatever further calculations you might need

Chris Roth

04/14/2023, 1:47 PM

The object I'm collecting data on has periods of activity, sometimes none in a day, sometime many. Sometimes the activity flows over from one day to the next.

Chris Roth

04/14/2023, 1:50 PM

Using a dynamic partition to define each period of activity sounds like the right solution. If I find I have extra data in the latest of the partitions, is it possible to load in what I have and add more? If so, how is that done?

Chris Roth

04/14/2023, 1:50 PM

I could use the start of the active period as the partition key.

Vinnie

04/14/2023, 1:50 PM

Yep, that’s exactly what I’ve been doing. I have an asset that aggregates our budget/forecasting process, which needs to be historized but is updated very irregularly (sometimes multiple times a day, sometimes none in weeks). Whenever the sensor detects changes in the data, it will add a partition and kick off a run for that partition.

Chris Roth

04/14/2023, 1:53 PM

Thanks @Vinnie. I'll try this out.

Vinnie

04/14/2023, 1:54 PM

I think one major drawback is that the asset isn’t using information about the partition within its processing, meaning (at least in my case) accidentally backfilling would effectively make all partitions show the same data, so you should have some safeguards for that.

Chris Roth

04/14/2023, 1:55 PM

Yes, and I'll need to somehow check if the newest data is part of a continuing activity band or the beginning of a new one. Would you suggest this part of the logic live in the sensor?

Chris Roth

04/14/2023, 1:56 PM

That is where the dynamic add_partition portion will live if I understand everything correctly.

Vinnie

04/14/2023, 1:56 PM

Here’s a related thread from a few months back: https://dagster.slack.com/archives/C01U954MEER/p1676890440544719

Vinnie

04/14/2023, 1:58 PM

I think I could see a case for either (logic in the sensor or in the asset) but it’s hard to tell without fully understanding the use case. My initial impulse is to say that logic should be in the sensor, but I’d probably still build another type of check within the asset just to make sure everything is running as intended

🙌 1

4 Views

Open in Slack

Previous Next