Chaitya
01/25/2023, 6:12 PMBinoy Shah
01/25/2023, 7:28 PMChaitya
01/25/2023, 7:57 PMroot/
folder1/
data.csv
data1.csv
folder2/
data.csv
data1.csv
Op1 is defined as:
for each folder, read folder/data.csv
and transform it, and output folder/op1_out.csv
op2 is defined as:
for each folder, read folder/data1.csv
and folder/op1_out.csv
perform a join, and output folder/op2_out.csv
.
I essentially want to run this DAG on a set of N folders.Binoy Shah
01/26/2023, 2:31 AM@ops
Chaitya
01/26/2023, 2:48 AMBinoy Shah
01/26/2023, 2:54 AM@asset
(Source) --> @asset
(Target)
That way you'll be able to trace the freshness and success/failure and dependencies and lineage of your folders.
Not sure how you're naming your folder paths but that can follow a naming pattern and that pattern can become your partition schemeChaitya
01/26/2023, 2:57 AMBinoy Shah
01/26/2023, 3:00 AMChaitya
01/26/2023, 3:07 AMroot/
folder1/
data.csv
data1.csv
folder2/
data.csv
data1.csv
with the above Ops:
op1:
for folder in root:
data = open(folder/"data.csv")
write(transform(data), folder/"op1_out.csv")
op2:
for folder in root:
data1 = open(folder/"data1.csv")
op1_out = open(folder/"op1_out.csv")
write(join(data1, op1_out), folder/"op2_out.csv")
The execution graph would simply be:
op1()
op2()
and the filesystem would look like this after op1:
root/
folder1/
data.csv
data1.csv
op1_out.csv
folder2/
data.csv
data1.csv
op1_out.csv
and after op2:
root/
folder1/
data.csv
data1.csv
op1_out.csv
op2_out.csv
folder2/
data.csv
data1.csv
op1_out.csv
op2_out.csv
So your suggestion is:
define a partitioned source asset for folder/data.csv
and folder/data1.csv
and use those paths as the partition scheme?Binoy Shah
01/26/2023, 8:04 PMall-data/2023/01/20/data_category1/data_subcategory2/data_00.csv
• all-data/2023/01/21/data_category1/data_subcategory2/data_00.csv
• all-data/2023/01/22/data_category1/data_subcategory2/data_00.csv
• all-data/2023/01/23/data_category1/data_subcategory2/data_00.csv
so then you can do DailyPartitions only or Multi Partition with Daily + Static and have Dates + categories injected into your asset definition and there-by track each partition independently
if you’re doing all at once, it can still process each “date”/“category” in separate process independently and parallellyChaitya
01/26/2023, 8:14 PM