Hi I m trying to write a configurable multi asset As an exam dagster #ask-community

Hi, I'm trying to write a configurable multi asset...

Daniel Michaelis

09/06/2022, 8:37 AM

Hi, I'm trying to write a configurable multi asset. As an example, this is the decorator I'm trying to use for a train-test-validation split for a DataFrame:

Copy code

@multi_asset(
    config_schema={"train_size": Field(float, is_required=True), "test_size": Field(float, is_required=True)},
    ins={
        "df": AssetIn(dagster_type=DataFrame, key="df")
    },
    outs={
        "df_train": Out(dagster_type=DataFrame),
        "df_test": Out(dagster_type=DataFrame),
        "df_validation": Out(dagster_type=DataFrame),
    },
)
def train_test_validation_split(context, df) -> Tuple[Output[DataFrame], Output[DataFrame], Output[DataFrame]]:

The train and test size should be required such that

0 < train_size < train_size + test_size <= 1

(the validation set is optional and its size is

1 - train_size - test_size

). I would expect Dagster / Dagit to yield an error if someone tries to materialize it without defining the required config parameters, and to show the correct config parameters and types in the description of the assets and the op. However, the description doesn't show the parameters, it lists the config as "Any", and the materialization can be started without setting the parameters, resulting in an error when the op tries to access the parameters from the context. 1. Am I missing some difference how assets / multi assets should be configured as opposed to ops? How can I enforce that the parameters are required and shown correctly? 2. This is less important but could be helpful too: would it be possible to define checks for valid configuration (e.g. in this case

assert 0 < train_size < train_size + test_size <= 1

) which are run during the initialization process of a run instead of inside the respective op, so an invalid configuration is detected as early as possible instead of running all upstream ops first and then resulting in an error in the middle of the run?

yuhan

09/06/2022, 11:27 PM

1. Let me try repro on my end 2. Yes, you can specify

type_check_fn

DagsterType

and supply that to

ins

out

. Here’s an example: https://docs.dagster.io/concepts/types#defining-a-dagster-type. Besides, we also have a

dagster-pandera

integration which allows you to specify finer granular constraints on data frames. Here’s an guide: https://docs.dagster.io/integrations/pandera

yuhan

09/06/2022, 11:49 PM

1. confirmed on my end. filed a bug report: https://github.com/dagster-io/dagster/issues/9607

Daniel Michaelis

09/07/2022, 12:13 PM

Thanks for filing the bug report! I think the

type_check_fn

DagsterType

is indeed very helpful for ins/outs. However, I was wondering if this sort of check function could also be applied to config parameters. I gave it a try and defined in the config schema a parameter with a self defined

DagsterType

with a

type_check_fn

, but this doesn't seem to work as only a limited number of different types are supported to be used for config parameters. So coming back to the multi-asset example, that could mean defining a

config_check_fn

that asserts that

0 < train_size < train_size + test_size <= 1

, and based on that it would be nice to have the following behavior:

Copy code

ops:
  train_test_validation_split:
    config:
      train_size: 0.7
      test_size: 0.2

results in no error and the materialization can be started, while

Copy code

ops:
  train_test_validation_split:
    config:
      train_size: 0
      test_size: 0.2

results in an error for invalid config and disables the "Launch Run" or "Materialize" button. This would also mean that the check for valid config can be isolated from the business logic in the op/asset, and errors are detected as early as possible. I'm not sure if this is technically doable on your side, especially when jobs contain dynamic steps or config mappings, but if it's possible, I think it would be a nice feature to have.

yuhan

09/08/2022, 1:18 AM

Yea I don’t think checks on configs are supported at the moment. But to your use case, there are some workaround options: • model the size as an input so you can perform checks. then your config will look like

Copy code

ops:
  train_test_validation_split:
    inputs:
      train_size:
      	value: 0.7
      test_size: 
      	value: 0.2

• check the config values in op body and fail the run if it violates the expectation. you can

raise Failure

in op body (example) But yea neither of the workarounds would disable the launch run or materialize button.

Open in Slack

Previous Next