hey guys this is a bit of a repeat but am going through crea dagster #announcements

hey guys, this is a bit of a repeat but am going t...

Eric

03/03/2020, 10:34 PM

hey guys, this is a bit of a repeat but am going through creating config for solids to wrap some of the pandas functions (

read_excel

and

read_csv

) . For example, being able to define options for

read_csv

in a yaml file like this is great. Having the intellisense in Dagit with config makes it a really useful tool for beginners and others on the team developing pipelines:

Copy code

solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
          ...

However, this is something I keep running into repeatedly. One of the arguments to

read_csv

converters

which is defined like this:

Copy code

converters: dict, optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

How would you represent an argument like this in a yaml file ? Does it even belong in a yaml file despite it being a kwarg like the rest of the args ? If it doesn't belong in a yaml config file, isn't it strange that some arguments are able to fit neatly in the config while others aren't ? I understand these are a bit of a loaded question but what I'm getting at is, instead of yaml files is there any reason using python files to define the

environment_dict

should not be used over yaml ? By using python dicts as the config instead of yaml this would allow the creation of any config dictionary, including covering the case for the

converters

argument above. Thoughts ?

schrockn

03/03/2020, 10:37 PM

Can you give an example of the usage of convertors in your codebase?

Eric

03/03/2020, 10:57 PM

sure, say we have an employees.csv that looks like this and is pipe separated:

Copy code

employee id|first name|last name|full name
0001|Bob|Jones|Jones, Bob
0002|Suzi|James|James@ Suzi
0003|Wendy|Smith|Smith, Wendy
0004|Dave|Johnson|Johnson@ Dave

Note some of the garbage

characters in some of the names. To clean this up we use a converter when reading the csv for that column. The example looks like this:

Copy code

import pandas as pd

def remove_junk_char(s):
    return s.replace("@", ",")

df = pd.read_csv("employees.csv", sep="|", converters = {"full name": remove_junk_char})

Which gives the cleaned up result of:

Copy code

employee id	first name	last name	full name
0	1	Bob	Jones	Jones, Bob
1	2	Suzi	James	James, Suzi
2	3	Wendy	Smith	Smith, Wendy
3	4	Dave	Johnson	Johnson, Dave

schrockn

03/03/2020, 10:58 PM

got it

Eric

03/03/2020, 11:01 PM

I don't think something like that is possible with a yaml file but would be perfectly fine with a python dict in the

environment_dict

correct ? If that's the case, I think it would (for us and our project) make more sense to use a python dict since it might get messy defining some variables in yaml and others like this case in a python dict. yes ?

schrockn

03/03/2020, 11:01 PM

at first blush, I think the most straightforward way of doing this is to customize the config. You want the config in the end to look something like (it depends on the universe of things you want to support)

Copy code

solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
        convertors:
           - column: "full name"
              convertor:
                 remove_junk_chars:

schrockn

03/03/2020, 11:02 PM

for one with args, i would strongly type the args:

Copy code

solids:
    employees_csv:
      config:
        csv:
          header=true,
          date_format='%m/%d/%Y',
          sep='|',
        convertors:
           - column: "full name"
              convertor:
                 convertor_with_args:
                    arg_one: "foo"

schrockn

03/03/2020, 11:03 PM

if all the convertors were column-based you could lead with column names which is probably a little more understandable for someone relying on the typeahead

Eric

03/03/2020, 11:03 PM

ahh, yes. I see.

Eric

03/03/2020, 11:10 PM

what would the definition look like in the config for converters (just high level)? all of the config I've done thus far has been a flat one to one but this seems like it would be it's own "config type" ? Something along the lines of a named tuple like the "DbInfo" type from the dagster examples.

Eric

03/03/2020, 11:10 PM

seems like a nested config I guess

schrockn

03/03/2020, 11:11 PM

yes nested config and Selectors

Eric

03/03/2020, 11:12 PM

got it. I see now. this was really helpful. thanks nick !

schrockn

03/03/2020, 11:13 PM

feel free to chime in with more questions. this a really interesting use case for config and it would be cool to showcase someone using it for (or base an example off of) something like this

Eric

03/05/2020, 12:13 AM

hey nick, I'm still tinkering with this. I'm having some trouble connecting the dots between how the config in the yaml file should evaluate. For example, if I have a yaml that looks like your example, or something like this:

Copy code

convertors:
  - column: "full name"
    fn: "my_converter_func"
    convertor_with_args:
      arg_one: "foo"

I'm not sure I follow on how to express this in dagster. How would I take a string from a yaml file and treat it like a python function ?

Eric

03/05/2020, 12:17 AM

come to think of it, I think you've already written this piece. It would essentially be the same thing as looking for the function name for defining the repo in

repository.yaml

schrockn

03/05/2020, 12:18 AM

oh just within the solid you would consume the config

schrockn

03/05/2020, 12:18 AM

and call the appropriate function

schrockn

03/05/2020, 12:18 AM

Copy code

for convertor in cfg['convertors']:
    if convertor['fn'] == 'convertor_with_args':
       # call the right thing

Eric

03/05/2020, 12:25 AM

will that

# call the right thing

"just work" ? something like

context.solid_config["fn"]()

? To me that seems like I'm trying to execute a string instead of a function? but I need to treat the string as a function name. in a python file somewhere

Copy code

def my_cool_converter(s):
  return s.replace('@', ',')

in the yaml file

Copy code

convertors:
  - column: "full name"
    fn: "my_cool_converter"
    convertor_with_args:
      arg_one: <the "full name" string for each row in a pandas df?>

perhaps I'm confusing myself here.

schrockn

03/05/2020, 12:29 AM

yeah i'm just saying in the body of the solid have code like:

Copy code

for convertor in cfg['convertors']:
    if convertor['fn'] == 'my_cool_convertor':
       my_cool_convertor(convertor['args']['s'])

schrockn

03/05/2020, 12:29 AM

so it is python code's job to take the verified config blob and call the correct function

Open in Slack

Previous Next