https://dagster.io/ logo
#announcements
Title
# announcements
e

Eric

02/19/2020, 10:01 PM
I'm playing around with creating a csv reader like the one shown in the Dagster tutorial except using the pandas.read_csv function. In the "Parametrizing solids with config" of the tutorial here: https://dagster.readthedocs.io/en/0.7.0/sections/tutorial/config.html it explains:
Copy code
We obviously don't want to have to write a separate solid for each permutation of these parameters that we use in our pipelines – especially because, in more realistic cases like configuring a Spark job or even parametrizing the read_csv function from a popular package like Pandas, we might have dozens or hundreds of parameters like these.

But hoisting all of these parameters into the signature of the solid function as inputs isn't the right answer either:

...

The solution is to define a config schema for our solid
@solid(
    config={
        'delimiter': Field(
            String,
            default_value=',',
            is_required=False,
            description=('A one-character string used to separate fields.'),
        ),
        'doublequote': Field(
            Bool,
            default_value=False,
            is_required=False,
            description=(
                'Controls how instances of quotechar appearing inside a field '
                'should themselves be quoted. When True, the character is '
                'doubled. When False, the escapechar is used as a prefix to '
                'the quotechar.'
            ),
        ),
        'escapechar': Field(
            String,
            default_value='\\',
            is_required=False,
            description=(
                'On reading, the escapechar removes any special meaning from '
                'the following character.'
            ),
        ),
        'quotechar': Field(
            String,
            default_value='"',
            is_required=False,
            description=(
                'A one-character string used to quote fields containing '
                'special characters, such as the delimiter or quotechar, '
                'or which contain new-line characters.'
            ),
        ),
        'quoting': Field(
            Int,
            default_value=csv.QUOTE_MINIMAL,
            is_required=False,
            description=(
                'Controls when quotes should be generated by the writer and '
                'recognised by the reader. It can take on any of the '
                'csv.QUOTE_* constants'
            ),
        ),
        'skipinitialspace': Field(
            Bool,
            default_value=False,
            is_required=False,
            description=(
                'When True, whitespace immediately following the delimiter '
                'is ignored. The default is False.'
            ),
        ),
        'strict': Field(
            Bool,
            default_value=False,
            is_required=False,
            description=('When True, raise exception on bad CSV input.'),
        ),
    }
)
def read_csv(context, csv_path: str):
    with open(csv_path, 'r') as fd:
        lines = [
            row
            for row in csv.DictReader(
                fd,
                delimiter=context.solid_config['delimiter'],
                doublequote=context.solid_config['doublequote'],
                escapechar=context.solid_config['escapechar'],
                quotechar=context.solid_config['quotechar'],
                quoting=context.solid_config['quoting'],
                skipinitialspace=context.solid_config['skipinitialspace'],
                strict=context.solid_config['strict'],
            )
        ]

    <http://context.log.info|context.log.info>('Read {n_lines} lines'.format(n_lines=len(lines)))

    return lines
This example seems to work great but what about parameters with multiple types ? For example, the
index_col
keyword arg of the pandas read_csv that has multiple types ?https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Copy code
index_colint, str, sequence of int / str, or False, default None
Is using the Any type really appropriate for these cases?
😄 2
m

max

02/19/2020, 10:03 PM
yep, i was looking for a good familiar example of a function with too many parameters 😉
very interested in your thoughts on what feels right here
one thing you can do for multiply-typed inputs, in general, is use a
Selector
that might be too clunky for something like this
i'm interested to know what it feels like you should be able to write in a case like this
e

Eric

02/19/2020, 10:06 PM
This was discussed previously that this approch doesn't seem any better since you would still have to duplicate all the arguments in the nested function. But that aside, I was curious about how to handle arguments with multiple types like many of the pandas keywords for read_csv and read_excel.
m

max

02/19/2020, 10:07 PM
it's true that writing the strongly typed schema is a bit of a hassle, but then the library solid can be reused however many times -- if you're reading a csv more than once, it's probably overall a win
e

Eric

02/19/2020, 10:10 PM
agreed. and I completely understand the need for having to define them in the first place but I'm curious how this would play out if like it mentioned in the tutorial you had the task of having to replicate all of the spark APIs and it's parameters
but it is definitely a tension
there are other places where we have deviated from the underlying APIs in search of concision and ease of use
e

Eric

02/19/2020, 10:11 PM
interesting. could the same be applied to arbitrary functions to turn them into solids with a wrapper ?
m

max

02/19/2020, 10:12 PM
definitely -- we have seen some proposals to do that and expect that people will write tools for this purpose
there are some thorny issues that are hard to resolve in the general case in a way that feels straightforwardly correct
e

Eric

02/19/2020, 10:14 PM
yep, basically this ^ is exactly what I was getting at.
it seems the issue is imposing a typed schema on parameters that could have multiple types.
m

max

02/19/2020, 10:19 PM
narrowly, for this particular problem, you can either call those params
Any
, or we do have the
Selector
facility
e

Eric

02/19/2020, 10:20 PM
so like you mentioned if you wanted to be specific perhaps the
Selector
would help out, or the
Any
type.
m

max

02/19/2020, 10:20 PM
id be interested in how, in a perfect world, youd like to be able to represent that multiple typing
e

Eric

02/19/2020, 10:20 PM
ya, ok. apologies for the duplicitive and lengthy post. most just thinking out loud.
m

max

02/19/2020, 10:21 PM
yep np
s

schrockn

02/19/2020, 10:21 PM
this post is great
m

max

02/19/2020, 10:21 PM
many eyes
s

schrockn

02/19/2020, 10:21 PM
exactly the type of thing we want to hash out with user feedback and questions
1
@Eric one point here is that the config system is totally opt-in and there’s nothing that prevents you from writing a solid that takes in a file path and just executes arbitrary code to load dataframe, using whatever APIs you see fit
e

Eric

02/19/2020, 10:24 PM
Something like this would be great from an end user experience. Obviously, a bit more involved implementation wise.
Copy code
'index_col': Field(
            [int, str, Array],
            default_value=(1, '1', []),
            is_required=False,
            description=(
                'Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.'),
),
@schrockn you're absolutely right. This is largely an attempt to get all my ducks in a row before implementation. I think Dagster is a great framework that's maturing quickly and has been a missing link from this space for far too long.
dagsir 1
❤️ 2
s

schrockn

02/19/2020, 10:34 PM
that’s great to hear
and we are still working on exact best practices what to put in config and what to put in code, so thanks for bearing with us on this!