Hi all I m currently studying Dagster and trying to use it f dagster #announcements

Hi all! I'm currently studying Dagster and trying ...

Alexander Kiselyov

10/21/2019, 5:08 PM

Hi all! I'm currently studying Dagster and trying to use it for tidying up ML training. I got kind of stuck with the following problem. A complex solid ("train") in a pipeline needs to be run several times with somewhat different input, but the number of these runs is determined by an integer value in a global config (ML folds). I see only three solutions: • generate the pipeline on the fly based on an original config: more code = more bugs, difficult to use tools like Dagit, seems like an abuse of the framework; • iterate over inputs in a single complex solid "train": looks good in Dagit, but limited parallelism; • just generate the pipeline with a fixed number of parallel "train" solids (i.e. a fixed maximum number of runs), then use optional outputs: seems sort of ugly and with artificial constants. How could this problem be solved in Dagster in an idiomatic way?

max

10/21/2019, 5:14 PM

Hi Alex! Can you expand a little bit on how the inputs vary between the different runs of the

train

solid? Is this a question of iterating the same solid on its own input n times, or something else?

Alexander Kiselyov

10/21/2019, 5:27 PM

Basically I run neural network training on multiple subsets of a training dataset. A description of a such dataset is an input for the

train

solid. The number of subsets is given in an initial (human-edited) config. All the other parameters are the same for each run, they are expected to be passed also as inputs to

train

. Visually, it should look like this:

Copy code

generate_training_data_subsets
    /          |         \
 train_1   train_2   train_3
    \           |         /
         collect_metrics

train

might be pretty resource-intensive, therefore it is important to be able to parallelize

train

runs with e.g. dask.

max

10/21/2019, 5:34 PM

gotcha

max

10/21/2019, 5:35 PM

We are tracking this general class of problem at https://github.com/dagster-io/dagster/issues/462 -- we've been very hesitant to rush to implement something that is likely to be subtly wrong in general.

max

10/21/2019, 5:36 PM

I think that if performance is the driving concern, the most idiomatic way to do this right now is probably to use a single solid, and pass it the folds over which to train

max

10/21/2019, 5:36 PM

That solid logic can then itself parallelize in whatever way seems most appropriate, e.g., using Dask from within the solid compute function

max

10/21/2019, 5:37 PM

This is an interesting kind of sequence, because it's small-n or constrained-n -- like, I can imagine you running 10 train/test folds, but likely not 10,000.

max

10/21/2019, 5:41 PM

an alternative is to build a pipeline that runs N times and records the metrics somewhere -- this is less satisfactory if you need mutually exclusive folds

Alexander Kiselyov

10/21/2019, 5:43 PM

Ideally, I want to keep the size of the code (and potential bug count) to minimum and to leverage a general parallelization approach using dagster-dask.

Alexander Kiselyov

10/21/2019, 5:48 PM

This way of thinking was inspired by Ruffus library: there you can declare an input for a solid which is a list of values, so the solid runs separately for each input value, optionally with multiprocessing enabled. That way it is easier to write pipelines doing a map over a list of input data.

Alexander Kiselyov

10/21/2019, 5:54 PM

Well, I guess for the moment I will go with the third option (the fixed maximum number of

train

solids). Anyway, thanks a lot for looking into this problem!

Beau Cronin

10/21/2019, 5:58 PM

@max - Alexander’s Ruffus suggestion sounds not-terrible. I would love it if I could an input to a solid as the one to map over. ofc, I’m just looking at this through the lens of my particular (simple, embarrassingly parallel) use case

max

10/21/2019, 6:01 PM

yep, we actually have a draft implementation of this approach: https://github.com/dagster-io/dagster/pull/699

Open in Slack

Previous Next