Hi all! I'm currently studying Dagster and trying ...
# announcements
a
Hi all! I'm currently studying Dagster and trying to use it for tidying up ML training. I got kind of stuck with the following problem. A complex solid ("train") in a pipeline needs to be run several times with somewhat different input, but the number of these runs is determined by an integer value in a global config (ML folds). I see only three solutions: • generate the pipeline on the fly based on an original config: more code = more bugs, difficult to use tools like Dagit, seems like an abuse of the framework; • iterate over inputs in a single complex solid "train": looks good in Dagit, but limited parallelism; • just generate the pipeline with a fixed number of parallel "train" solids (i.e. a fixed maximum number of runs), then use optional outputs: seems sort of ugly and with artificial constants. How could this problem be solved in Dagster in an idiomatic way?
m
Hi Alex! Can you expand a little bit on how the inputs vary between the different runs of the
train
solid? Is this a question of iterating the same solid on its own input n times, or something else?
a
Basically I run neural network training on multiple subsets of a training dataset. A description of a such dataset is an input for the
train
solid. The number of subsets is given in an initial (human-edited) config. All the other parameters are the same for each run, they are expected to be passed also as inputs to
train
. Visually, it should look like this:
Copy code
generate_training_data_subsets
    /          |         \
 train_1   train_2   train_3
    \           |         /
         collect_metrics
train
might be pretty resource-intensive, therefore it is important to be able to parallelize
train
runs with e.g. dask.
m
gotcha
We are tracking this general class of problem at https://github.com/dagster-io/dagster/issues/462 -- we've been very hesitant to rush to implement something that is likely to be subtly wrong in general.
I think that if performance is the driving concern, the most idiomatic way to do this right now is probably to use a single solid, and pass it the folds over which to train
That solid logic can then itself parallelize in whatever way seems most appropriate, e.g., using Dask from within the solid compute function
This is an interesting kind of sequence, because it's small-n or constrained-n -- like, I can imagine you running 10 train/test folds, but likely not 10,000.
an alternative is to build a pipeline that runs N times and records the metrics somewhere -- this is less satisfactory if you need mutually exclusive folds
a
Ideally, I want to keep the size of the code (and potential bug count) to minimum and to leverage a general parallelization approach using dagster-dask.
This way of thinking was inspired by Ruffus library: there you can declare an input for a solid which is a list of values, so the solid runs separately for each input value, optionally with multiprocessing enabled. That way it is easier to write pipelines doing a map over a list of input data.
Well, I guess for the moment I will go with the third option (the fixed maximum number of
train
solids). Anyway, thanks a lot for looking into this problem!
b
@max - Alexander’s Ruffus suggestion sounds not-terrible. I would love it if I could an input to a solid as the one to map over. ofc, I’m just looking at this through the lens of my particular (simple, embarrassingly parallel) use case
m
yep, we actually have a draft implementation of this approach: https://github.com/dagster-io/dagster/pull/699