Hi, I'm trying to write a configurable multi asset...
# ask-community
d
Hi, I'm trying to write a configurable multi asset. As an example, this is the decorator I'm trying to use for a train-test-validation split for a DataFrame:
Copy code
@multi_asset(
    config_schema={"train_size": Field(float, is_required=True), "test_size": Field(float, is_required=True)},
    ins={
        "df": AssetIn(dagster_type=DataFrame, key="df")
    },
    outs={
        "df_train": Out(dagster_type=DataFrame),
        "df_test": Out(dagster_type=DataFrame),
        "df_validation": Out(dagster_type=DataFrame),
    },
)
def train_test_validation_split(context, df) -> Tuple[Output[DataFrame], Output[DataFrame], Output[DataFrame]]:
The train and test size should be required such that
0 < train_size < train_size + test_size <= 1
(the validation set is optional and its size is
1 - train_size - test_size
). I would expect Dagster / Dagit to yield an error if someone tries to materialize it without defining the required config parameters, and to show the correct config parameters and types in the description of the assets and the op. However, the description doesn't show the parameters, it lists the config as "Any", and the materialization can be started without setting the parameters, resulting in an error when the op tries to access the parameters from the context. 1. Am I missing some difference how assets / multi assets should be configured as opposed to ops? How can I enforce that the parameters are required and shown correctly? 2. This is less important but could be helpful too: would it be possible to define checks for valid configuration (e.g. in this case
assert 0 < train_size < train_size + test_size <= 1
) which are run during the initialization process of a run instead of inside the respective op, so an invalid configuration is detected as early as possible instead of running all upstream ops first and then resulting in an error in the middle of the run?
y
1. Let me try repro on my end 2. Yes, you can specify
type_check_fn
on
DagsterType
and supply that to
ins
or
out
. Here’s an example: https://docs.dagster.io/concepts/types#defining-a-dagster-type. Besides, we also have a
dagster-pandera
integration which allows you to specify finer granular constraints on data frames. Here’s an guide: https://docs.dagster.io/integrations/pandera
1. confirmed on my end. filed a bug report: https://github.com/dagster-io/dagster/issues/9607
d
Thanks for filing the bug report! I think the
type_check_fn
on
DagsterType
is indeed very helpful for ins/outs. However, I was wondering if this sort of check function could also be applied to config parameters. I gave it a try and defined in the config schema a parameter with a self defined
DagsterType
with a
type_check_fn
, but this doesn't seem to work as only a limited number of different types are supported to be used for config parameters. So coming back to the multi-asset example, that could mean defining a
config_check_fn
that asserts that
0 < train_size < train_size + test_size <= 1
, and based on that it would be nice to have the following behavior:
Copy code
ops:
  train_test_validation_split:
    config:
      train_size: 0.7
      test_size: 0.2
results in no error and the materialization can be started, while
Copy code
ops:
  train_test_validation_split:
    config:
      train_size: 0
      test_size: 0.2
results in an error for invalid config and disables the "Launch Run" or "Materialize" button. This would also mean that the check for valid config can be isolated from the business logic in the op/asset, and errors are detected as early as possible. I'm not sure if this is technically doable on your side, especially when jobs contain dynamic steps or config mappings, but if it's possible, I think it would be a nice feature to have.
y
Yea I don’t think checks on configs are supported at the moment. But to your use case, there are some workaround options: • model the size as an input so you can perform checks. then your config will look like
Copy code
ops:
  train_test_validation_split:
    inputs:
      train_size:
      	value: 0.7
      test_size: 
      	value: 0.2
• check the config values in op body and fail the run if it violates the expectation. you can
raise Failure
in op body (example) But yea neither of the workarounds would disable the launch run or materialize button.