Hi everyone, I'm seeking advice for how to structu...
# ask-community
j
Hi everyone, I'm seeking advice for how to structure a workflow. I have a job that transforms the data in a single file and dumps the output into AWS S3 via an IO manager. The job is similar to the example in the "Without software-defined assets" section here: https://docs.dagster.io/guides/dagster/enriching-with-software-defined-assets#not-all-ops-produce-assets On a regular basis, I want to aggregate the job outputs by some metadata that rides with the files -- I'll call this variable
category
. I've been able to set up a dynamic partition for
category
, populate the partition using a separate software-defined asset, run the jobs with the correct partition keys, as well as add metadata to the output of those jobs via
context.add_output_metadata
. I am thinking about defining a software-defined asset to do the aggregation step, since the output will essentially represent the endpoint that our data scientists will query by
category
, but I am not sure what the best approach is for aggregating the outputs of the jobs within the new SDA. I've thought of a couple approaches so far: 1. In the new SDA, find the job outputs by partition or metadata using some method of the
OpExecutionContext
instance or possibly the
AssetSelection
class? 2. Modify the IO manager for the job so that
category
is encoded within the AWS S3 keys and use the AWS S3 keys within the new SDA to do the aggregation. What is the best approach for this use case? I imagine there are other solutions that I haven't thought of. Also, are there any obvious problems with the architecture of this workflow?
s
I have a job that transforms the data in a single file and dumps the output into AWS S3 via an IO manager.
Does this job use software-defined assets? Does it run on the same file each time? A new file each time?
If a different file each time, the typical way to model that use case with software-defined assets would be to have dynamically-partitioned asset with a partition for each file
For the downstream asset, the easiest thing to do would just make it a non-partitioned asset that depends on all partitions of the upstream asset
j
Thanks for following up, Sandy. The job to transform the files does not currently use software-defined assets -- it is the composition of two ops triggered by a sensor and it runs on a new file each time. Translating the job into a software-defined asset with a dynamic partition for each file makes sense. Getting to the downstream asset is still a little fuzzy to me, as I'd like to have the downstream asset be partitioned by the
category
metadata associated with the files, which is also generated dynamically. Almost like I want the first asset to build up a partition dictionary so that the second downstream asset knows how the categories and files are linked, so they can be aggregated by category in the second asset:
Copy code
{'category1': ['file1', 'file2', 'file3'],
 'category2': ['file4'],
  ...
}