Justin Taylor
06/05/2023, 4:39 PMcategory
. I've been able to set up a dynamic partition for category
, populate the partition using a separate software-defined asset, run the jobs with the correct partition keys, as well as add metadata to the output of those jobs via context.add_output_metadata
.
I am thinking about defining a software-defined asset to do the aggregation step, since the output will essentially represent the endpoint that our data scientists will query by category
, but I am not sure what the best approach is for aggregating the outputs of the jobs within the new SDA. I've thought of a couple approaches so far:
1. In the new SDA, find the job outputs by partition or metadata using some method of the OpExecutionContext
instance or possibly the AssetSelection
class?
2. Modify the IO manager for the job so that category
is encoded within the AWS S3 keys and use the AWS S3 keys within the new SDA to do the aggregation.
What is the best approach for this use case? I imagine there are other solutions that I haven't thought of. Also, are there any obvious problems with the architecture of this workflow?sandy
06/06/2023, 3:24 PMI have a job that transforms the data in a single file and dumps the output into AWS S3 via an IO manager.Does this job use software-defined assets? Does it run on the same file each time? A new file each time?
sandy
06/06/2023, 3:25 PMsandy
06/06/2023, 3:26 PMJustin Taylor
06/06/2023, 4:10 PMcategory
metadata associated with the files, which is also generated dynamically. Almost like I want the first asset to build up a partition dictionary so that the second downstream asset knows how the categories and files are linked, so they can be aggregated by category in the second asset:
{'category1': ['file1', 'file2', 'file3'],
'category2': ['file4'],
...
}