hi loving the dynamic partitions feature I m working with we dagster #ask-community

hi! loving the dynamic partitions feature. I'm wor...

Harrison Conlin

03/16/2023, 11:24 AM

hi! loving the dynamic partitions feature. I'm working with web APIs to do some governance and oversee the use of the business intelligence platform my team manages at my day job. There are a mix of API calls, some will return all items with basic metadata (e.g. get all the reports in the organisation) but other times I need to query per item (e.g. get the developers for a report ID), As such I've got an asset

reports

which gets all the reports + basic metadata and updates the dynamic partitions for the individual asset

report

report

depends on

reports

and I then materialise all the

report

partitions which gets its metadata entry. It's a bit slow and painful. Ideally I'd get of the

reports

asset but I want to keep IO managers, so my plan was to move the API call into a job, loop through the results, create a

OutputContext

via

dagster.build_output_context()

for each report, pass that to the IO managers

handle_output

function and fire off `AssetMaterialization`s. Happy times I was hoping but as build_output_context() doesn't create a step context,

OutputContext.has_asset_partitions

fails. Admittedly I am going down an ugly route but can you see any alternatives?

claire

03/16/2023, 11:59 PM

Hi Harrison. Is there a reason why you are explicitly outputting asset materializations? Wondering if it's possible instead for you to: • use a sensor to query the API • update the partitions per report asset • kick off a run request for each new partition

Harrison Conlin

03/18/2023, 1:40 AM

Hi @claire, I was explicitly outputting asset materialization as the initial API call has all the data I need and due to some aggressive rate limiting, I can't necessarily afford to launch a new request for every new partition. However I like the sensor idea, would it be un-dagsteric (think pythonic but dagster) to save the output of the API call to a temporary directory and have the report asset read from it, if it exists. That way I can have my report asset call the GetGroup api when needed but if the results of GetGroups is available, it can use that

claire

03/20/2023, 9:30 PM

Ahhh I see. I think generally saving the output of the API call to a temp dir may be tricky as you'll also have to find a way to delete the contents after all of the runs conclude. Thinking about this more, I think a cleaner way to do this would be similar to what you initially implemented: • define an unpartitioned

reports

asset that queries for the initial API call, yielding all the data you need as output and creating all the dynamic partitions you need via

context.instance.add_dynamic_partitions(...)

• have each

report

asset with its own dynamic partitions def depend on the

reports

asset • in a schedule, update the

reports

asset as frequently as desired based on the rate limiting • in a sensor, check whenever the

reports

asset is materialized, and then kick off a run request for each different

report

asset This approach I think is cleaner and will allow you to update all of the downstream

report

assets after you update

reports

automatically. And you'll be able to load the result of the latest API call in each

report

asset.

👍 1

Harrison Conlin

03/22/2023, 5:35 AM

yeah, I was thinking about it more away from the computer and I think you're right. I think part of me was just trying to find a way to reduce the penalty that is spinning up a new process for each run.

3 Views

Open in Slack

Previous Next