Hi everyone I m seeking advice for how to structure a workfl dagster #ask-community

Hi everyone, I'm seeking advice for how to structu...

Justin Taylor

06/05/2023, 4:39 PM

Hi everyone, I'm seeking advice for how to structure a workflow. I have a job that transforms the data in a single file and dumps the output into AWS S3 via an IO manager. The job is similar to the example in the "Without software-defined assets" section here: https://docs.dagster.io/guides/dagster/enriching-with-software-defined-assets#not-all-ops-produce-assets On a regular basis, I want to aggregate the job outputs by some metadata that rides with the files -- I'll call this variable

category

. I've been able to set up a dynamic partition for

category

, populate the partition using a separate software-defined asset, run the jobs with the correct partition keys, as well as add metadata to the output of those jobs via

context.add_output_metadata

. I am thinking about defining a software-defined asset to do the aggregation step, since the output will essentially represent the endpoint that our data scientists will query by

category

, but I am not sure what the best approach is for aggregating the outputs of the jobs within the new SDA. I've thought of a couple approaches so far: 1. In the new SDA, find the job outputs by partition or metadata using some method of the

OpExecutionContext

instance or possibly the

AssetSelection

class? 2. Modify the IO manager for the job so that

category

is encoded within the AWS S3 keys and use the AWS S3 keys within the new SDA to do the aggregation. What is the best approach for this use case? I imagine there are other solutions that I haven't thought of. Also, are there any obvious problems with the architecture of this workflow?

sandy

06/06/2023, 3:24 PM

I have a job that transforms the data in a single file and dumps the output into AWS S3 via an IO manager.

Does this job use software-defined assets? Does it run on the same file each time? A new file each time?

sandy

06/06/2023, 3:25 PM

If a different file each time, the typical way to model that use case with software-defined assets would be to have dynamically-partitioned asset with a partition for each file

sandy

06/06/2023, 3:26 PM

For the downstream asset, the easiest thing to do would just make it a non-partitioned asset that depends on all partitions of the upstream asset

Justin Taylor

06/06/2023, 4:10 PM

Thanks for following up, Sandy. The job to transform the files does not currently use software-defined assets -- it is the composition of two ops triggered by a sensor and it runs on a new file each time. Translating the job into a software-defined asset with a dynamic partition for each file makes sense. Getting to the downstream asset is still a little fuzzy to me, as I'd like to have the downstream asset be partitioned by the

category

metadata associated with the files, which is also generated dynamically. Almost like I want the first asset to build up a partition dictionary so that the second downstream asset knows how the categories and files are linked, so they can be aggregated by category in the second asset:

Copy code

{'category1': ['file1', 'file2', 'file3'],
 'category2': ['file4'],
  ...
}

3 Views

Open in Slack

Previous Next