https://dagster.io/ logo
Title
j

Jesper Bagge

10/25/2022, 6:55 AM
Hi all! I’m looking for good design patterns when ingesting data from several files in the same S3-bucket into a data warehouse. The only thing I need to do in my Software-Defined Asset is loading the file into a pandas DataFrame and letting my IOManager do the rest. It feels weird and against all manner of DRY-principles to code 20 functions identical in all but their names. Would it be possible to design a SDA factory and instantiate them through job/sensor definitions?
j

jamie

10/25/2022, 2:29 PM
Hi @Jesper Bagge an asset factory would look something like this
list_of_files = ["my_file_one", "my_file_two", ... ]
all_assets = []

for file in list_of_files:
    @asset(
        name=file
        <other params as needed>
    )
    def file_asset():
        # read file from s3 and make a df
        return df 

    all_assets.append(file_asset)
where you would run into trouble is instantiating the assets from a sensor or job. The issue would be that the assets need to be in the dagster repository in order to be picked up by dagit or materialized and a sensor or job can’t directly add new assets to the repository. There might be some kinda hacky stuff you could do with global variables and manually reloading the repository to pick up new assets, but i’m not sure if that would work or if it would introduce more problems
j

Jesper Bagge

10/26/2022, 9:36 AM
Thanks @jamie ! I managed to work around the issue with instantiating by having the list of files in my
repository.py
I did an asset factory like:
def make_asset(name):
    @asset(name=name)
    def file_asset(context):
        # load data
        return data
    
    return file_asset
Which i then called from the repository. Looped the same list to create jobs. Last i did a similar factory for sensors which i could also call from repository.
:1000: 1