Hi, I'm new to dagster and want to make sure I'm headed in the right direction. A typical use case for me is that I have an ingest directory ingest/ and a product/ directory and I'm processing data files (or sets of data files -- foo.dat, foo.hdr, foo.json) to generate product directories in product/
For each ingest file, which is often an hdf5 or netcdf4 file, I create an output directory structure under product/ based on the filename (or better some metadata in the file). E.g /ingest/foo.nc gets turned into /product/foo/stage1, /product/foo/stage2, …. The stages contain intermediate products that the next stage depends on (i.e., materializations of assets) in the some binary format (like netcdf/h5) as well as some readable-outside-python products like html, pngs, csv, etc.
In a batch mode of operation, the ingest directory is prepopulated. And in another case we process as data is generated, the files arrive one at a time and get processed as they arrive (often in parallel with a new file arriving before the other is complete).
I've written pipelines with a bunch of code for multiprocessing, exception handling, logging, reading/writing files, etc. and dagster looks like it could abstract out that stuff in a way that really closely matches how we think about our processing. That would let us focus more on algorithms to turn raw data into products and less on "plumbing." It could also potentially help support our ML workflow on this data.
For the problem described above, I think we can back software defined assets with ops that take in xarray datasets (data structures representing netcdf datasets -- think of an nd-array dataframe) and output xarray datasets and other ops to write out summaries in pngs, html, csv, etc.
What I'm wondering about is how to handle the fact my DAG is being applied in parallel to different filenames/output directories (and I'd rather not have it be aware of filenames) and I want to use netcdf rather than pickle for efficiency and so the asset materialization are readable outside python.
I
think what I'm supposed to be doing is using partitions that are somehow file aware and a customer IO manager to handle saving and load netcdf? Is that correct? Are there examples of this sort of file-based partitioning? (Sorry if I've missed them.)