Hi I m working on NMT Neural Machine Translation I wanted to dagster #ask-community

Hi, I'm working on NMT (Neural Machine Translation...

Chitreddy Sairam

05/24/2023, 5:16 AM

Hi, I'm working on NMT (Neural Machine Translation). I wanted to use Dagster to auotmate things - right from dataset creation to training etc. Presently I am stuck in loading datsets. The input training data which I recieve will be in two possible formats - 1. Moses format - Folder with 2 text files with - one with source language sentences and other with target language sentences 2. Excel format - Excel file with multiple columns - Ex; English, Spanish,French,Italian etc. If I recieve a Moses format data, I will do basic checks and upload the data into cloud. If I recieve a Excel format data, I have to convert into multiple Moses format files Ex: English-Spanish, English-French etc The languages in Excel file will vary from file to file. In the code below - 1. load_moses - Takes in a configuration dictionary, perform checks , upload the data into cloud and stores metadata in Mongo 2. load_excel - Takes in a configuration dictionary, perform checks, convert the excel data into multiple moses datasets and create multiple moses configuration which have to be passed to above load_moses graph 3. load graph- Depending on whether the format is excel or moses, it takes conditional branching. I am stuck in the final stage - I wanto call load_moses on multiple configuration dictionaries created. but it looks like I cant use loops inside. It looks like something which can be solved dynamic partitioning, but I am not sure how.

Copy code

@graph
def load_moses(configuration):
    folder = get_moses_folder(configuration)
    zip_file_path = zip_folder(configuration,folder)
    s3_location = upload_asset_to_s3(zip_file_path=zip_file_path)
    mongo_id = upload_config_to_mongo(configuration=configuration,s3_location=s3_location)
    return mongo_id

@graph
def load_excel(configuration):
    xl_file_path = get_xl_file(configuration)
    zip_file_path = zip_folder(configuration,xl_file_path)
    s3_location = upload_asset_to_s3(zip_file_path=zip_file_path)
    xl_id = upload_config_to_mongo(configuration=configuration,s3_location=s3_location)
    moses_job_configs = convert_xl_to_moses(xl_file_path,xl_id,configuration)
    return moses_job_configs
   
@graph
def load():#load_dataset
    configuration = load_config()
    moses_config,excel_config = get_format(configuration)
    mongo_id = load_moses(moses_config)
    moses_job_configs = load_excel(excel_config)
    ## This is the place where I am facing issue
    [load_moses(config) for config in moses_job_configs]

owen

05/24/2023, 11:06 PM

hi @Chitreddy Sairam! I think Dynamic Graphs are exactly what you would want here. Your code is almost there, you'd just want to model load_excel as a dynamic op, yielding one dynamic output per moses job config (rather than a single output that's a list of job configs). then, you can do moses_job_configs.map(load_moses) to run load_moses once per dynamic output

🙏 1

Chitreddy Sairam

05/26/2023, 7:45 AM

Hi @owen, Thank you for the solution. Will explore the dynamic graphs.

Open in Slack

Previous Next