Hi folks, I'm trying to wrap my mind around how to...
# ask-community
c
Hi folks, I'm trying to wrap my mind around how to best leverage dagster assets for my problem and found some options here and want to see if anyone has dealt with similar problems before and might have a solution. I'm trying to represent N folders on my filesystem in dagster. The operations I want to run operate individually on each folder (independently) and produce new versions of the data in each folder. On the dagster docs, a lot of the examples use <10 assets. I was curious how to scale when I've got >1000 folders in my ecosystem. A few of the options i've seen so far are: 1. partitioned asset -- maybe one single asset represents the root directory, and the individual folders are represented by partitions. 2. Asset factory -- as far as I understand, this wouldn't scale past 1000 (Dagster has only been tested to handle ~1000 assets as far as i understand?). Has anyone tried this before? Is there a better way to frame the problem?
b
Your intentions are not clear, Folders on local file system you want to treat each as an pre-existing asset ? Is there an actual concern on scalability? Are you facing a performance problem/feasibility problem ?
c
I guess I'm not clear on how to leverage dagster assets to solve the problem. In plain terms, I want to apply a DAG of data transformations on N folders containing M files each. in terms of scalability, I'd plan to maybe run these data transformations on eventually 100k folders via the same DAG. I'm unsure whether to treat each folder as an asset or treat the group of them as a single asset partitioned by directory or something similar
here might be a more concrete example: lets say I have a DAG: op1 -> op2 my directory structure is:
Copy code
root/
   folder1/
       data.csv
       data1.csv
   folder2/
       data.csv
       data1.csv
Op1 is defined as: for each folder, read
folder/data.csv
and transform it, and output
folder/op1_out.csv
op2 is defined as: for each folder, read
folder/data1.csv
and
folder/op1_out.csv
perform a join, and output
folder/op2_out.csv
. I essentially want to run this DAG on a set of N folders.
My question is really, how should I represent these folders in terms of dagster assets?
Maybe I shouldn't use the asset system at all and use the op/job orchestration system to access my assets in the external filesystem?
b
You can answer it, if you need it as assets. • Are these “folder” really assets for you/your organization identified as unique entities? • Do you care about each of them individually/categorically? • Do you need to manage and maintain relations between the folders? • Do you care about its hierarchy or want to see its lineage at any given time? If none of the answers is “yes”, then you probably just need workflows to manipulate datasets and that okay too. If they are not such “assets”, then the process of each “folder” is an DAG like workflow and would better fit under
@ops
c
Let me refer to the top level "folders" as entities from here on out. The folder would contain some unique data that is specific to that entity. Generally the data transformations run on these entities will be independent. i.e. if I run my op on N entities, I'd get N results, one per entity. So specifically to answer those questions: • Each folder is indeed a unique entity. • we would care about them individually (i.e. if an op is run on a particular folder and produces some output, we should know about that output independent of the other entities). • There aren't usually relationships between the folders/entities. • We do care about the lineage, I would want to trace back any downstream data produced to the original entity that produced it.
b
So treat the folder's csv files as
@asset
(Source) -->
@asset
(Target) That way you'll be able to trace the freshness and success/failure and dependencies and lineage of your folders. Not sure how you're naming your folder paths but that can follow a naming pattern and that pattern can become your partition scheme
c
hmm, how does Source -> Target work? is there more information about this relationship in the docs? I've read a little bit about source assets... Are you saying that I should define a source asset for each CSV file in a particular entity, and then use partitioned assets to include all of the entities with the same structure?
b
Yup, checkout the docs and examples referenced by them. Run each example to see it in action Yes, but I am basing it on your names of csv files you mentioned.. one had "output" in it's name while other didn't
c
right, In that example data flow: this might be my input (with any number of similar folders that contain unique data)
Copy code
root/
  folder1/
    data.csv
    data1.csv
  folder2/
    data.csv
    data1.csv
with the above Ops: op1:
Copy code
for folder in root:
    data = open(folder/"data.csv")
    write(transform(data), folder/"op1_out.csv")
op2:
Copy code
for folder in root:
    data1 = open(folder/"data1.csv")
    op1_out = open(folder/"op1_out.csv")
    write(join(data1, op1_out), folder/"op2_out.csv")
The execution graph would simply be:
Copy code
op1()
op2()
and the filesystem would look like this after op1:
Copy code
root/
  folder1/
    data.csv
    data1.csv
    op1_out.csv
  folder2/
    data.csv
    data1.csv
    op1_out.csv
and after op2:
Copy code
root/
  folder1/
    data.csv
    data1.csv
    op1_out.csv
    op2_out.csv
  folder2/
    data.csv
    data1.csv
    op1_out.csv
    op2_out.csv
So your suggestion is: define a partitioned source asset for
folder/data.csv
and
folder/data1.csv
and use those paths as the partition scheme?
b
Yes, so for example, depending on your case, the folder names might also be •
all-data/2023/01/20/data_category1/data_subcategory2/data_00.csv
all-data/2023/01/21/data_category1/data_subcategory2/data_00.csv
all-data/2023/01/22/data_category1/data_subcategory2/data_00.csv
all-data/2023/01/23/data_category1/data_subcategory2/data_00.csv
so then you can do DailyPartitions only or Multi Partition with Daily + Static and have Dates + categories injected into your asset definition and there-by track each partition independently if you’re doing all at once, it can still process each “date”/“category” in separate process independently and parallelly
now when these files are located in a cloud storage like S3/Azure Block Store/GCS then you can pull down data with high parallelism and process each sub part simultaneously ( depending on your server capacity )
c
gotcha, thanks!