Daniel Michaelis
09/06/2022, 8:37 AM@multi_asset(
config_schema={"train_size": Field(float, is_required=True), "test_size": Field(float, is_required=True)},
ins={
"df": AssetIn(dagster_type=DataFrame, key="df")
},
outs={
"df_train": Out(dagster_type=DataFrame),
"df_test": Out(dagster_type=DataFrame),
"df_validation": Out(dagster_type=DataFrame),
},
)
def train_test_validation_split(context, df) -> Tuple[Output[DataFrame], Output[DataFrame], Output[DataFrame]]:
The train and test size should be required such that 0 < train_size < train_size + test_size <= 1
(the validation set is optional and its size is 1 - train_size - test_size
). I would expect Dagster / Dagit to yield an error if someone tries to materialize it without defining the required config parameters, and to show the correct config parameters and types in the description of the assets and the op. However, the description doesn't show the parameters, it lists the config as "Any", and the materialization can be started without setting the parameters, resulting in an error when the op tries to access the parameters from the context.
1. Am I missing some difference how assets / multi assets should be configured as opposed to ops? How can I enforce that the parameters are required and shown correctly?
2. This is less important but could be helpful too: would it be possible to define checks for valid configuration (e.g. in this case assert 0 < train_size < train_size + test_size <= 1
) which are run during the initialization process of a run instead of inside the respective op, so an invalid configuration is detected as early as possible instead of running all upstream ops first and then resulting in an error in the middle of the run?yuhan
09/06/2022, 11:27 PMtype_check_fn
on DagsterType
and supply that to ins
or out
. Here’s an example: https://docs.dagster.io/concepts/types#defining-a-dagster-type. Besides, we also have a dagster-pandera
integration which allows you to specify finer granular constraints on data frames. Here’s an guide: https://docs.dagster.io/integrations/panderayuhan
09/06/2022, 11:49 PMDaniel Michaelis
09/07/2022, 12:13 PMtype_check_fn
on DagsterType
is indeed very helpful for ins/outs. However, I was wondering if this sort of check function could also be applied to config parameters. I gave it a try and defined in the config schema a parameter with a self defined DagsterType
with a type_check_fn
, but this doesn't seem to work as only a limited number of different types are supported to be used for config parameters.
So coming back to the multi-asset example, that could mean defining a config_check_fn
that asserts that 0 < train_size < train_size + test_size <= 1
, and based on that it would be nice to have the following behavior:
ops:
train_test_validation_split:
config:
train_size: 0.7
test_size: 0.2
results in no error and the materialization can be started, while
ops:
train_test_validation_split:
config:
train_size: 0
test_size: 0.2
results in an error for invalid config and disables the "Launch Run" or "Materialize" button. This would also mean that the check for valid config can be isolated from the business logic in the op/asset, and errors are detected as early as possible.
I'm not sure if this is technically doable on your side, especially when jobs contain dynamic steps or config mappings, but if it's possible, I think it would be a nice feature to have.yuhan
09/08/2022, 1:18 AMops:
train_test_validation_split:
inputs:
train_size:
value: 0.7
test_size:
value: 0.2
• check the config values in op body and fail the run if it violates the expectation. you can raise Failure
in op body (example)
But yea neither of the workarounds would disable the launch run or materialize button.