I have a somewhat similar question: We are current...
# announcements
I have a somewhat similar question: We are currently working on a classification pipeline in dagster. We have solids for reading in different inputs, let‘s say input A, B and C. Also we have different solids for training different classifiers, let‘s say classifiers X and Y. Currently we are building separate pipelines for training classifier Y with input A and B, for training X with B and C and so on. So every pipeline uses different solids. We are currently debating if it would be preferable to build one giant pipeline, where you can control the used input and classifiers through the config file. What would be the best way to approach this? Pipeline construction or changes via the config file are not possible AFAIK. So we can not have f.e. a composite solid that chooses between solids A, B and C for input generation based on a configuration file, right? Should we always have all solids connected and control their behavior via output parameters? Or outsource all functionality like reading inputs and initializing classifiers away from solids and only make the right function calls inside the solids based on the configuration? Or is the single pipeline approach the superior approach?
Did you see how you can have multiple and/or optional outputs ?
This should be able to help you achieve the the single pipeline approach.
As for which approach is "better" , its hard to say with out having all the context.
@Philipp G I've thought about the use case of training a set of models, but with the end goal of storing all results as an ongoing comparison/analysis. How would that differ from your use case? My thought process is that I could have a "training" solid which eats a training set as input and which has a loop inside of it of training some (tractable) number of models, and then store resulting params/hyperparams/metrics in some kind of database (handled by downstream solids) Iterating like that within a single solid is perhaps not atomic enough for the functional data engineering approach we're trying to take here though 🤷‍♀️
Thanks for the answers. Regarding the single pipeline approach, I will try to break down my question to a simple example:
Copy code
import functions as f
from dagster import composite_solid, lambda_solid, pipeline

def generate_input_a():
	input_a = f.preprocess_some_more()
	return input_a # This is some kind of list
def generate_input_b():
	input_b = f.preprocess_even_more()
	return input_b # This is some kind of list
def concat_two_lists(list_a, list_b):
	# This is some function that turns two lists
	# into one long list.
	return f.concat_lists(list_a, list_b)

def generate_input():
	if config["chosen_input"] == "a":
		input = generate_input_a()
	elif config["chosen_input"] == "b":
		input = generate_input_b()
	elif config["chosen_input"] == "both":
		input = concat_two_lists(generate_input_a(), 
	return input

def train_model():
	input = generate_input()
	model = define_model() # Lets pretend we have this solid
	trained_model = train_model(input, model)
This example is not functional of course and please excuse any mistakes, but maybe it can get the idea across. So we have two solids which are generating inputs for a model. The interesting part is the composite solid: Based on some configuration (f.e. the standard configuration provided like
) the solid makes a decision whether it makes a call for solid
or even both inputs concatenated into one. If I remember correctly, such an if-clause is not possible in composite_solids, right? How can I make a pipeline like the one shown here (
), where it depends on a configuration which solids I want to use.
@Tommy Naugle Your question is somewhat related to my question, yes. You could build a solid, that in itself could have a configuration where you can set the models you wish to train. If this all happens in one solid, everything is fine. But once the training of the models gets more complicated and you outsource the logic to different solids, you run into a similar problem than the one I described above.
such an if-clause is not possible in composite_solids, right?
Correct, composition functions are invoked at init time to determine dep graphs - nothing at run time. Here is a fleshed out working example of the approach you have above - we have to move the selection in to a solid since those are the only thing invoked at run time https://dagster.phacility.com/P21
Theres maybe another approach where you use
- the ability to subselect solids for execution (most easily achieved using `PresetDefinition`s) to turn on and off different input nodes. That seems a little trickier given the example you have of wanting to choose A, B, or combined[A,B] but is another option.
I will look into both options, thank you for the insights!
Thank you very much for the work you put into it!
Here is what the solid_subset version could look like https://dagster.phacility.com/P22
Hi Alex, just so I understand correctly: In your first example (P21), all solids will always run and A and B will be generated even if they are not needed, correct? In your second example (P22), we are not able to use selective_pipeline.build_sub_pipeline in conjunction with the config, i.e. build a sub-pipeline based on config values, right?
But I guess in the first case, one could make generation of A and B dependent on a config value, as well. It's a bit of a hack but should at least prevent paying to get data that is not needed
A and B will be generated even if they are not needed, correct?
Correct but you could imagine how you could add per solid config and use the config mapping in the composite to turn off generation completely without having that “leak” out of the created composite.
build a sub-pipeline based on config values, right?
Correct - though you could create `PresetDefinition`s for the different set-ups which would still have a reasonable dev experience - though that assumes the permutations of what you want to turn on / off are small.
Let me know if any of that needs any extra clarification
Cool, SGTM