Hi, I'm working on NMT (Neural Machine Translation). I wanted to use Dagster to auotmate things - right from dataset creation to training etc.
Presently I am stuck in loading datsets.
The input training data which I recieve will be in two possible formats -
1. Moses format - Folder with 2 text files with - one with source language sentences and other with target language sentences
2. Excel format - Excel file with multiple columns - Ex; English, Spanish,French,Italian etc.
If I recieve a Moses format data, I will do basic checks and upload the data into cloud.
If I recieve a Excel format data, I have to convert into multiple Moses format files Ex: English-Spanish, English-French etc
The languages in Excel file will vary from file to file.
In the code below -
1. load_moses - Takes in a configuration dictionary, perform checks , upload the data into cloud and stores metadata in Mongo
2. load_excel - Takes in a configuration dictionary, perform checks, convert the excel data into multiple moses datasets and create multiple moses configuration which have to be passed to above load_moses graph
3. load graph- Depending on whether the format is excel or moses, it takes conditional branching.
I am stuck in the final stage - I wanto call load_moses on multiple configuration dictionaries created. but it looks like I cant use loops inside. It looks like something which can be solved dynamic partitioning, but I am not sure how.
@graph
def load_moses(configuration):
folder = get_moses_folder(configuration)
zip_file_path = zip_folder(configuration,folder)
s3_location = upload_asset_to_s3(zip_file_path=zip_file_path)
mongo_id = upload_config_to_mongo(configuration=configuration,s3_location=s3_location)
return mongo_id
@graph
def load_excel(configuration):
xl_file_path = get_xl_file(configuration)
zip_file_path = zip_folder(configuration,xl_file_path)
s3_location = upload_asset_to_s3(zip_file_path=zip_file_path)
xl_id = upload_config_to_mongo(configuration=configuration,s3_location=s3_location)
moses_job_configs = convert_xl_to_moses(xl_file_path,xl_id,configuration)
return moses_job_configs
@graph
def load():#load_dataset
configuration = load_config()
moses_config,excel_config = get_format(configuration)
mongo_id = load_moses(moses_config)
moses_job_configs = load_excel(excel_config)
## This is the place where I am facing issue
[load_moses(config) for config in moses_job_configs]