I think I m coming around to an answer to my question from y dagster #ask-community

I think I'm coming around to an answer to my quest...

Nathan Gould

09/18/2022, 7:43 PM

I think I'm coming around to an answer to my question from yesterday. Basically it's to divide my pipeline into two stages: "ingestion", "transformation". All the data ingestion logic will be represented as jobs and ops, to avoid the complexity of handling source system limitations using the software-defined asset model. I'll then define my "source assets" to be the output of jobs in the ingestion stage. Those will be used as inputs for defining assets in the transformation stage. In other words, use the more primitive op/job concepts for flexibility when dealing with source systems. For the transformation stuff that is more under my control, use assets. Sound about right? Anyone else using the "source asset" construct this way?

Daniel Mosesson

09/18/2022, 7:46 PM

Is there a reason that the custom io_manager can't do the difference computation and store that somewhere? For the rate limits, you might be able to put those in your resources, if you can't do it at the job level

Nathan Gould

09/18/2022, 7:57 PM

What do you mean by difference computation here?

Daniel Mosesson

09/18/2022, 7:58 PM

From what it sounds like, you want to do change data capture, but your data isn't stored in a data system that supports that. Is that right?

Nathan Gould

09/18/2022, 8:01 PM

Roughly, yes. Although accomplishing the diff computation is something I was planning to do in a downstream job. Hadn't thought of using a custom IO manager for that, although I was thinking I might need one for a completely different reason. I.e. my "asset" is effectively parameterized by two things: a

creation_date

and an

observation_date

. Each month I need to observe and store all records, so they can later be compared with previous observations.

Nathan Gould

09/18/2022, 8:02 PM

I was initially thinking of having a monthly asset partition around

creation_date

, and using a custom IO manager to update the file path based on the

observation_date

Daniel Mosesson

09/18/2022, 8:03 PM

is creation date just when you ran the job?

Nathan Gould

09/18/2022, 8:03 PM

When the source object came into existence

Daniel Mosesson

09/18/2022, 8:03 PM

so observation_date is when the job ran?

Nathan Gould

09/18/2022, 8:03 PM

Yep

Nathan Gould

09/18/2022, 8:04 PM

A user profile would be a good analogy here. I'm paginating over all records by when they were created, and observing monthly so I can capture changes

Daniel Mosesson

09/18/2022, 8:05 PM

What do you need to have happen if there are two different runs for the same month?

Nathan Gould

09/18/2022, 8:06 PM

I would prefer keeping all data, but am also happy to have it overwrite the current month if that makes things easier

Daniel Mosesson

09/18/2022, 8:23 PM

I think this is getting too complicated. Maybe this will work? • Create a partitioned asset by creation_date (asset1) ◦ If you need/want to include the current time as part of the path or track that somewhere, you can do that as part of the

handle_output

in the

io_manager

• If you need to create the difference objects, create a partitioned asset that depends on asset1, where the partitions are mapped (https://app.slack.com/client/TCDGQDUKF/C01U954MEER/thread/C01U954MEER-1658345537.413359) • If you need to do complicated transformations, multiple stages, etc, one or both of these assets can be graph backed or

multiassets

Nathan Gould

09/18/2022, 8:29 PM

Thanks for this. I'll let you know how it goes!

Open in Slack

Previous Next