Eric
03/03/2020, 10:34 PMread_excel
and read_csv
) . For example, being able to define options for read_csv
in a yaml file like this is great. Having the intellisense in Dagit with config makes it a really useful tool for beginners and others on the team developing pipelines:
solids:
employees_csv:
config:
csv:
header=true,
date_format='%m/%d/%Y',
sep='|',
...
However, this is something I keep running into repeatedly. One of the arguments to read_csv
is converters
which is defined like this:
converters: dict, optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
How would you represent an argument like this in a yaml file ? Does it even belong in a yaml file despite it being a kwarg like the rest of the args ? If it doesn't belong in a yaml config file, isn't it strange that some arguments are able to fit neatly in the config while others aren't ?
I understand these are a bit of a loaded question but what I'm getting at is, instead of yaml files is there any reason using python files to define the environment_dict
should not be used over yaml ? By using python dicts as the config instead of yaml this would allow the creation of any config dictionary, including covering the case for the converters
argument above. Thoughts ?schrockn
03/03/2020, 10:37 PMEric
03/03/2020, 10:57 PMemployee id|first name|last name|full name
0001|Bob|Jones|Jones, Bob
0002|Suzi|James|James@ Suzi
0003|Wendy|Smith|Smith, Wendy
0004|Dave|Johnson|Johnson@ Dave
Note some of the garbage @
characters in some of the names.
To clean this up we use a converter when reading the csv for that column. The example looks like this:
import pandas as pd
def remove_junk_char(s):
return s.replace("@", ",")
df = pd.read_csv("employees.csv", sep="|", converters = {"full name": remove_junk_char})
Which gives the cleaned up result of:
employee id first name last name full name
0 1 Bob Jones Jones, Bob
1 2 Suzi James James, Suzi
2 3 Wendy Smith Smith, Wendy
3 4 Dave Johnson Johnson, Dave
schrockn
03/03/2020, 10:58 PMEric
03/03/2020, 11:01 PMenvironment_dict
correct ? If that's the case, I think it would (for us and our project) make more sense to use a python dict since it might get messy defining some variables in yaml and others like this case in a python dict. yes ?schrockn
03/03/2020, 11:01 PMsolids:
employees_csv:
config:
csv:
header=true,
date_format='%m/%d/%Y',
sep='|',
convertors:
- column: "full name"
convertor:
remove_junk_chars:
solids:
employees_csv:
config:
csv:
header=true,
date_format='%m/%d/%Y',
sep='|',
convertors:
- column: "full name"
convertor:
convertor_with_args:
arg_one: "foo"
Eric
03/03/2020, 11:03 PMschrockn
03/03/2020, 11:11 PMEric
03/03/2020, 11:12 PMschrockn
03/03/2020, 11:13 PMEric
03/05/2020, 12:13 AMconvertors:
- column: "full name"
fn: "my_converter_func"
convertor_with_args:
arg_one: "foo"
I'm not sure I follow on how to express this in dagster. How would I take a string from a yaml file and treat it like a python function ?repository.yaml
schrockn
03/05/2020, 12:18 AMfor convertor in cfg['convertors']:
if convertor['fn'] == 'convertor_with_args':
# call the right thing
Eric
03/05/2020, 12:25 AM# call the right thing
"just work" ? something like context.solid_config["fn"]()
? To me that seems like I'm trying to execute a string instead of a function? but I need to treat the string as a function name.
in a python file somewhere
def my_cool_converter(s):
return s.replace('@', ',')
in the yaml file
convertors:
- column: "full name"
fn: "my_cool_converter"
convertor_with_args:
arg_one: <the "full name" string for each row in a pandas df?>
perhaps I'm confusing myself here.schrockn
03/05/2020, 12:29 AMfor convertor in cfg['convertors']:
if convertor['fn'] == 'my_cool_convertor':
my_cool_convertor(convertor['args']['s'])