hey guys, this is a bit of a repeat but am going t...
# announcements
e
hey guys, this is a bit of a repeat but am going through creating config for solids to wrap some of the pandas functions (
read_excel
and
read_csv
) . For example, being able to define options for
read_csv
in a yaml file like this is great. Having the intellisense in Dagit with config makes it a really useful tool for beginners and others on the team developing pipelines:
Copy code
solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
          ...
However, this is something I keep running into repeatedly. One of the arguments to
read_csv
is
converters
which is defined like this:
Copy code
converters: dict, optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
How would you represent an argument like this in a yaml file ? Does it even belong in a yaml file despite it being a kwarg like the rest of the args ? If it doesn't belong in a yaml config file, isn't it strange that some arguments are able to fit neatly in the config while others aren't ? I understand these are a bit of a loaded question but what I'm getting at is, instead of yaml files is there any reason using python files to define the
environment_dict
should not be used over yaml ? By using python dicts as the config instead of yaml this would allow the creation of any config dictionary, including covering the case for the
converters
argument above. Thoughts ?
s
Can you give an example of the usage of convertors in your codebase?
e
sure, say we have an employees.csv that looks like this and is pipe separated:
Copy code
employee id|first name|last name|full name
0001|Bob|Jones|Jones, Bob
0002|Suzi|James|James@ Suzi
0003|Wendy|Smith|Smith, Wendy
0004|Dave|Johnson|Johnson@ Dave
Note some of the garbage
@
characters in some of the names. To clean this up we use a converter when reading the csv for that column. The example looks like this:
Copy code
import pandas as pd

def remove_junk_char(s):
    return s.replace("@", ",")

df = pd.read_csv("employees.csv", sep="|", converters = {"full name": remove_junk_char})
Which gives the cleaned up result of:
Copy code
employee id	first name	last name	full name
0	1	Bob	Jones	Jones, Bob
1	2	Suzi	James	James, Suzi
2	3	Wendy	Smith	Smith, Wendy
3	4	Dave	Johnson	Johnson, Dave
s
got it
e
I don't think something like that is possible with a yaml file but would be perfectly fine with a python dict in the
environment_dict
correct ? If that's the case, I think it would (for us and our project) make more sense to use a python dict since it might get messy defining some variables in yaml and others like this case in a python dict. yes ?
s
at first blush, I think the most straightforward way of doing this is to customize the config. You want the config in the end to look something like (it depends on the universe of things you want to support)
Copy code
solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
        convertors:
           - column: "full name"
              convertor:
                 remove_junk_chars:
for one with args, i would strongly type the args:
Copy code
solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
        convertors:
           - column: "full name"
              convertor:
                 convertor_with_args:
                    arg_one: "foo"
if all the convertors were column-based you could lead with column names which is probably a little more understandable for someone relying on the typeahead
e
ahh, yes. I see.
what would the definition look like in the config for converters (just high level)? all of the config I've done thus far has been a flat one to one but this seems like it would be it's own "config type" ? Something along the lines of a named tuple like the "DbInfo" type from the dagster examples.
seems like a nested config I guess
s
yes nested config and Selectors
e
got it. I see now. this was really helpful. thanks nick !
s
feel free to chime in with more questions. this a really interesting use case for config and it would be cool to showcase someone using it for (or base an example off of) something like this
e
hey nick, I'm still tinkering with this. I'm having some trouble connecting the dots between how the config in the yaml file should evaluate. For example, if I have a yaml that looks like your example, or something like this:
Copy code
convertors:
  - column: "full name"
    fn: "my_converter_func"
    convertor_with_args:
      arg_one: "foo"
I'm not sure I follow on how to express this in dagster. How would I take a string from a yaml file and treat it like a python function ?
come to think of it, I think you've already written this piece. It would essentially be the same thing as looking for the function name for defining the repo in
repository.yaml
s
oh just within the solid you would consume the config
and call the appropriate function
Copy code
for convertor in cfg['convertors']:
    if convertor['fn'] == 'convertor_with_args':
       # call the right thing
e
will that
# call the right thing
"just work" ? something like
context.solid_config["fn"]()
? To me that seems like I'm trying to execute a string instead of a function? but I need to treat the string as a function name. in a python file somewhere
Copy code
def my_cool_converter(s):
  return s.replace('@', ',')
in the yaml file
Copy code
convertors:
  - column: "full name"
    fn: "my_cool_converter"
    convertor_with_args:
      arg_one: <the "full name" string for each row in a pandas df?>
perhaps I'm confusing myself here.
s
yeah i'm just saying in the body of the solid have code like:
Copy code
for convertor in cfg['convertors']:
    if convertor['fn'] == 'my_cool_convertor':
       my_cool_convertor(convertor['args']['s'])
so it is python code's job to take the verified config blob and call the correct function