I think I'm coming around to an answer to my quest...
# ask-community
n
I think I'm coming around to an answer to my question from yesterday. Basically it's to divide my pipeline into two stages: "ingestion", "transformation". All the data ingestion logic will be represented as jobs and ops, to avoid the complexity of handling source system limitations using the software-defined asset model. I'll then define my "source assets" to be the output of jobs in the ingestion stage. Those will be used as inputs for defining assets in the transformation stage. In other words, use the more primitive op/job concepts for flexibility when dealing with source systems. For the transformation stuff that is more under my control, use assets. Sound about right? Anyone else using the "source asset" construct this way?
d
Is there a reason that the custom io_manager can't do the difference computation and store that somewhere? For the rate limits, you might be able to put those in your resources, if you can't do it at the job level
n
What do you mean by difference computation here?
d
From what it sounds like, you want to do change data capture, but your data isn't stored in a data system that supports that. Is that right?
n
Roughly, yes. Although accomplishing the diff computation is something I was planning to do in a downstream job. Hadn't thought of using a custom IO manager for that, although I was thinking I might need one for a completely different reason. I.e. my "asset" is effectively parameterized by two things: a
creation_date
and an
observation_date
. Each month I need to observe and store all records, so they can later be compared with previous observations.
I was initially thinking of having a monthly asset partition around
creation_date
, and using a custom IO manager to update the file path based on the
observation_date
d
is creation date just when you ran the job?
n
When the source object came into existence
d
so observation_date is when the job ran?
n
Yep
A user profile would be a good analogy here. I'm paginating over all records by when they were created, and observing monthly so I can capture changes
d
What do you need to have happen if there are two different runs for the same month?
n
I would prefer keeping all data, but am also happy to have it overwrite the current month if that makes things easier
d
I think this is getting too complicated. Maybe this will work? • Create a partitioned asset by creation_date (asset1) ◦ If you need/want to include the current time as part of the path or track that somewhere, you can do that as part of the
handle_output
in the
io_manager
• If you need to create the difference objects, create a partitioned asset that depends on asset1, where the partitions are mapped (https://app.slack.com/client/TCDGQDUKF/C01U954MEER/thread/C01U954MEER-1658345537.413359) • If you need to do complicated transformations, multiple stages, etc, one or both of these assets can be graph backed or
multiassets
n
Thanks for this. I'll let you know how it goes!