John Conwell
06/02/2021, 5:22 PM@pipeline(mode_defs=[cluster_mode_def])
def build_models():
# determine if we're building or copying the training data set from production
if config['BUILD_DATA_SET']:
training_data_set = cmds.build_training_data_set(cmds.training_data_config())
else:
training_data_set = cmds.copy_training_data_set()
# build vectors for each feature family, build full vector space, then build the model
cmds.build_model(
cmds.build_feature_space(
cmds.build_feature_vectors(
training_data_set
)
)
)
Based on some config state I’d want to either build my training dataset or just copy it from a production environment. But since pipelines don’t execute, just build the solid DAG structure and don’t have access to any configuration information, how would I do this?alex
06/02/2021, 6:19 PMsolid_selection
arguments in most places runs are kicked off, but based on your description i think using the optional outputs may be good for your case@solid(
config_schema={Selector({"build": build_config(), "copy": copy_config()})},
output_defs=[
OutputDefinition("build_info", is_required=False),
OutputDefinition("copy_info", is_required=False),
],
)
def determine_training_data_source(context):
if context.solid_config.get("build"):
yield Output(context.solid_config["build"], output_name="build_info")
else:
yield Output(context.solid_config["copy"], output_name="copy_info")
@pipeline(mode_defs=[cluster_mode_def])
def build_models():
# determine if we're building or copying the training data set from production
build_info, copy_info = determine_training_data_source()
# only one of these paths will be taken at run time
copied_training_data_set = cmds.copy_training_data(copy_info)
built_training_data_set = cmds.build_training_data(build_info)
# build vectors for each feature family, build full vector space, then build the model
cmds.build_model(
cmds.build_feature_space(
cmds.build_feature_vectors(
# here we "fan-in" to prevent the skipping behavior
# of optional outputs from propagating further.
# we can assert in this solid (or an added intermediate one)
# that we have a list of only one item
traning_data_set=[built_training_data_set, copied_training_data_set]
)
)
)
John Conwell
06/02/2021, 6:40 PMif
branching statements. The pipeline structure ends up looking more like a graphical programming language vs a graphical DSL, which is what I’m really looking for.alex
06/02/2021, 7:08 PMJohn Conwell
06/02/2021, 7:47 PMalex
06/02/2021, 8:08 PMMegan Beckett
10/06/2022, 11:39 AMalex
10/06/2022, 2:48 PM