Hi, I'm new to dagster and want to make sure I'm h...
# ask-community
Hi, I'm new to dagster and want to make sure I'm headed in the right direction. A typical use case for me is that I have an ingest directory ingest/ and a product/ directory and I'm processing data files (or sets of data files -- foo.dat, foo.hdr, foo.json) to generate product directories in product/ For each ingest file, which is often an hdf5 or netcdf4 file, I create an output directory structure under product/ based on the filename (or better some metadata in the file). E.g /ingest/foo.nc gets turned into /product/foo/stage1, /product/foo/stage2, …. The stages contain intermediate products that the next stage depends on (i.e., materializations of assets) in the some binary format (like netcdf/h5) as well as some readable-outside-python products like html, pngs, csv, etc. In a batch mode of operation, the ingest directory is prepopulated. And in another case we process as data is generated, the files arrive one at a time and get processed as they arrive (often in parallel with a new file arriving before the other is complete). I've written pipelines with a bunch of code for multiprocessing, exception handling, logging, reading/writing files, etc. and dagster looks like it could abstract out that stuff in a way that really closely matches how we think about our processing. That would let us focus more on algorithms to turn raw data into products and less on "plumbing." It could also potentially help support our ML workflow on this data. For the problem described above, I think we can back software defined assets with ops that take in xarray datasets (data structures representing netcdf datasets -- think of an nd-array dataframe) and output xarray datasets and other ops to write out summaries in pngs, html, csv, etc. What I'm wondering about is how to handle the fact my DAG is being applied in parallel to different filenames/output directories (and I'd rather not have it be aware of filenames) and I want to use netcdf rather than pickle for efficiency and so the asset materialization are readable outside python. I think what I'm supposed to be doing is using partitions that are somehow file aware and a customer IO manager to handle saving and load netcdf? Is that correct? Are there examples of this sort of file-based partitioning? (Sorry if I've missed them.)
what about the idea of treating the entire
directory as an asset? I don't know how many files you have, but it seems like it would get pretty hard to visually manage a separate partition for all the files that might be getting added. What you could do is construct a graph using dynamic outputs to process each file in parallel. Then turn that into a graph-backed asset
Thanks so much for taking the time to reply! I'll take a look at this. It looks like it could work. As for how many files, it would typically on the order of 100 input files in ingest and a few thousand resulting output files. I supposed treating something the ingest and product directories as assets makes sense when the config governing the transformation between the two is constant for a given transformation, which is typically the case. Thinking it through, the way we think of the stages above are processing levels. level 0, to level 1 to level 2.... And there are a set of levels for each level 0 file. It probably makes sense to treat the set of files for a given level as an asset since something sometimes we'll want to start at level 2 and go to level 3 with a different config. A lot to think about. Thanks again for your help!