Eric
02/19/2020, 10:01 PMWe obviously don't want to have to write a separate solid for each permutation of these parameters that we use in our pipelines – especially because, in more realistic cases like configuring a Spark job or even parametrizing the read_csv function from a popular package like Pandas, we might have dozens or hundreds of parameters like these.
But hoisting all of these parameters into the signature of the solid function as inputs isn't the right answer either:
...
The solution is to define a config schema for our solid
@solid(
config={
'delimiter': Field(
String,
default_value=',',
is_required=False,
description=('A one-character string used to separate fields.'),
),
'doublequote': Field(
Bool,
default_value=False,
is_required=False,
description=(
'Controls how instances of quotechar appearing inside a field '
'should themselves be quoted. When True, the character is '
'doubled. When False, the escapechar is used as a prefix to '
'the quotechar.'
),
),
'escapechar': Field(
String,
default_value='\\',
is_required=False,
description=(
'On reading, the escapechar removes any special meaning from '
'the following character.'
),
),
'quotechar': Field(
String,
default_value='"',
is_required=False,
description=(
'A one-character string used to quote fields containing '
'special characters, such as the delimiter or quotechar, '
'or which contain new-line characters.'
),
),
'quoting': Field(
Int,
default_value=csv.QUOTE_MINIMAL,
is_required=False,
description=(
'Controls when quotes should be generated by the writer and '
'recognised by the reader. It can take on any of the '
'csv.QUOTE_* constants'
),
),
'skipinitialspace': Field(
Bool,
default_value=False,
is_required=False,
description=(
'When True, whitespace immediately following the delimiter '
'is ignored. The default is False.'
),
),
'strict': Field(
Bool,
default_value=False,
is_required=False,
description=('When True, raise exception on bad CSV input.'),
),
}
)
def read_csv(context, csv_path: str):
with open(csv_path, 'r') as fd:
lines = [
row
for row in csv.DictReader(
fd,
delimiter=context.solid_config['delimiter'],
doublequote=context.solid_config['doublequote'],
escapechar=context.solid_config['escapechar'],
quotechar=context.solid_config['quotechar'],
quoting=context.solid_config['quoting'],
skipinitialspace=context.solid_config['skipinitialspace'],
strict=context.solid_config['strict'],
)
]
<http://context.log.info|context.log.info>('Read {n_lines} lines'.format(n_lines=len(lines)))
return lines
This example seems to work great but what about parameters with multiple types ? For example, the index_col
keyword arg of the pandas read_csv that has multiple types ?https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
index_colint, str, sequence of int / str, or False, default None
Is using the Any type really appropriate for these cases?max
02/19/2020, 10:03 PMSelector
Eric
02/19/2020, 10:06 PMmax
02/19/2020, 10:07 PMEric
02/19/2020, 10:10 PMmax
02/19/2020, 10:10 PMEric
02/19/2020, 10:11 PMmax
02/19/2020, 10:12 PMEric
02/19/2020, 10:14 PMmax
02/19/2020, 10:19 PMAny
, or we do have the Selector
facilityEric
02/19/2020, 10:20 PMSelector
would help out, or the Any
type.max
02/19/2020, 10:20 PMEric
02/19/2020, 10:20 PMmax
02/19/2020, 10:21 PMschrockn
02/19/2020, 10:21 PMmax
02/19/2020, 10:21 PMschrockn
02/19/2020, 10:21 PMEric
02/19/2020, 10:24 PM'index_col': Field(
[int, str, Array],
default_value=(1, '1', []),
is_required=False,
description=(
'Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.'),
),
schrockn
02/19/2020, 10:34 PM