Hey everyone, I'm exploring Dagster as a potential...
# ask-community
Hey everyone, I'm exploring Dagster as a potential platform to migrate my existing bioinformatics pipeline. After going through the extensive documentation and examples, I'm having a hard time breaking down my pipeline into assets, ops, graphs and jobs. Can you guys point me in the right direction? Pipeline summary: The pipeline is broken down into multiple stages, each stage works on the output of the previous stage. Input: A set of raw data files which change for each run of the pipeline Output: CSVs from each stage corresponding to each input file Mainly the pipeline reads the input files, runs a few filtration steps per line and stores the line in the CSV. Some stages involve only 1 filtration, some stages have multiple filtration steps Questions: 1. How do I deal with the input files - what should I classify those as? 2. How do I deal with user input? - essentially I would like the user to add the filepaths as input 3. How do I trigger a set of jobs? - is it as simple as defining a job that calls all the other jobs? 4. Is there any documentation on SLURM integration?
hi @Vinayak Malviya! A few followup questions: How do you know which input files will be used? Is it purely user input (i.e. "run this processing on this set of files")? Are these files generally similar in shape? I.e. do you just point the processing at a directory full of CSVs, or are there a bunch of different file formats in multiple different locations? At a high level, it sounds like you generally want to be working in the realm of assets. You could model your input files as an asset, with a config schema. This config schema could allow the user to configure a list of filepaths they want to process. The body of the asset would read those files in and return them in whatever serializable format you want to work with (or maybe just return those filepaths themselves, to allow downstream assets to know which files they should be working on). each stage could be its own asset, with the ones with multiple filtering steps potentially being modeled as a graph-backed asset if you so choose
regarding triggering a set of jobs, are you thinking of doing this as a manual process (i.e. user goes in and wants to specify multiple different sets of filepaths), or in some automated way?
and for 4, there's no dagster / SLURM integration at the moment
How do you know which input files will be used? Is it purely user input (i.e. "run this processing on this set of files")?
Yes it completely depends on the user input
Are these files generally similar in shape?
No these could be files located at random places, but mainly of 2 or 3 types Using config schema makes sense, I'll take a deeper look into it
regarding triggering a set of jobs
This is purely manual at the moment, although automating it is one of the reasons I'm looking at dagster
Regarding SLURM, is there a way I can write a custom runner or daemon which can trigger the jobs?
Thank you so much for dealing with these questions!
so if you set things up with a config schema, you can use a sensor or a schedule to automatically kick off runs of your job with different configuration (for different sets of files)
👍 1
I'm not familiar with SLURM, to be honest, so it's hard for me to say what the natural way of interacting with it would be, but in general anything with a programatic interface can be integrated w/ dagster. I.e. if you're sending work to something, you can use a <http://I'm%20not%20familiar%20with%20SLURM,%20but%20my%20first%20pass%20is%20that%20you%20could%20interact%20with%20it%20via%20a%20resource|resource> as a wrapper over that API, or if you're getting requests for work from something, you can use a sensor to poll for new requests and kick off jobs in response
Sure no worries. I can interact with SLURM programmatically so I'll take a look at the resource docs in detail. Again, thank you so much for taking the time to help me out 🙇‍♂️