Hi all, I have another modelling question. In our ...
# ask-community
f
Hi all, I have another modelling question. In our pipeline, most of the time inputs and outputs are simply files. We don't have returns like pandas dataframes and numpy arrays. We're working mostly with GIS data, so we produce LAS files, shapefiles, images and so on. So I have been wondering if the best modelling proposal for us would by Assets without I/O: https://docs.dagster.io/guides/dagster/non-argument-deps I can imagine use cases for the usual serialized results we see in all dagster examples, but I guess that would be a corner case for us.
a
Can't you just set the dagster type of the asset to the special Nothing type? That essentially turns off the io manager as @owen taught me a while ago, see here https://dagster.slack.com/archives/C01U954MEER/p1676310924565129?thread_ts=1676257567.623189&channel=C01U954MEER&message_ts=1676310924.565129
f
yep, that's pretty much what the Assets without I/O suggests ✔️ I was just wondering if that would be the best way to handle these things.
for instance, I believe it would make sense to return the paths, if we want to specify at run time where the files are stored 🤔
l
The paths are part of the metadata, not assets themselves.
f
in the assets without I/O example the paths are all hardcoded and aren't even shared in metadata: https://docs.dagster.io/guides/dagster/non-argument-deps#using-an-asset-without-loading-it
rephrasing my question, I would like to know if it makes sense to return paths to a file (or multiple files) generated as an asset. The benefit of it is that assets depending on it could then figure out the paths from the output of the previous step
but still, I see it is a bit awkward to have an I/O manager serialize a Path object just to read it later on 😬 Maybe I am confused about the concepts...
l
I too think assets w/o IO makes sense for your assets definitions. I'd think you can get the path from AssetMaterialization event or OpExecutionContext.get_output_metadata by specifying the output name (from the previous step).
a
@Fabio Picchi But how do downstream assets access the path returned from your asset if it's not serialized? Assets don't (necessarily) share memory and also you need persistent output for reruns and stuff.
f
yep, that was my main doubt, but @Le Yang said I could read it from the output_metadata from the previous step. That would be a form of serialization, but I wouldn't be storing in s3 a pickle of a python string (which I imagine would happen if I simply return an s3 path from one of my assets). I assumed from his message that you can write metadata to the asset even if it doesn't generate output_metadata 🤔 I can't quite follow. I guess I need to spend a bit more time on this and come back with a simple example...
we basically read files from s3, transform them somehow and write them back to s3 in a different path
some steps keep the number of files while others receive m files and output n files. I would like to track that relationship, but at first I thought of just keeping track of the s3 prefixes to where these files are stored
o
hi @Fabio Picchi! good questions -- I think you have a pretty good picture of how things work, but to confirm a few points: • Returning a string (representing a filepath) from the body of an asset will indeed result in that string needing to be serialized somewhere (e.g. s3 if using the s3 IOManager) • Using the
Nothing
type output will result in no serde behavior, meaning there will be no communication between upstream and downstream assets as to the location of files • This is fine in the case that the filepath is a function of the asset key + other static metadata about the asset, as both the upstream and downstream can run that same function to figure out where to read/write the file. However, if the filepath might change at runtime, then this pattern won't work. • You can also add metadata to a specific materialization event at runtime, but this materialization event is not (by default) available to you in body of the downstream asset. So theoretically you could use context.add_output_metadata in the upstream asset, to indicate the file location, then in the downstream asset query the dagster instance database to get the upstream event (along with that metadata), then use that to determine where to read from. This would avoid having to serialize this information to an external system like s3, but would require a bit of tinkering inside any of the assets that you wanted to use this pattern for.
D 1
d
Hi @owen, I’ve stumbled upon the same question. To me, it looks like that dagster has misleading abstractions to support assets working with files. I read “Assets without IO” as assets that do not require any interaction with the io - persistent storage. But in the question from @Fabio Picchi I think it’s clear that assets are actually backed by IO operations - the pipeline is based on storing and reading files from some storage. So the asset, in this case, is some data stored within the file - it is materialized outside of a typical IO manager dagster flow. A common pipeline in the industry: 1) download an archive file from the internet 2) store it in a staging area in s3/gcs 3) read the file, parse / transform and store a derived asset as a file/structured table in the database. I don’t think dagster has a clear path how to support this type of pipeline. To me, it would make sense to have a special type of io manager explicitly handling this type of file-based assets. Please let me know what you think. Thanks!
❤️ 1
o
Hi @Dmitry Ustimov! I think that makes sense -- one option for that sort of setup would be something like:
Copy code
@asset(io_manager=custom_s3_bytes_io_manager)
def staged_file():
    file_bytes = download_file_from_internet()
    return file_bytes

@asset(io_manager=snowflake_pandas_io_manager)
def table(staged_file):
    # ... staged file is just raw bytes, which can be parsed / transformed as desired
    # parse it into a dataframe or something
    return parsed_dataframe
In this case, you're correct that we don't have a native IOManager built for this type of operation, but making your own custom IOManager is generally not too difficult (for example, you would just need to slightly modify the existing s3 pickle io manager to remove the pickling step if you wanted to just store raw bytes: https://sourcegraph.com/github.com/dagster-io/dagster/-/blob/python_modules/libraries/dagster-aws/dagster_aws/s3/io_manager.py?L41)
❤️ 1
d
thanks @owen! this is, of course, something that can be used to solve the task, indeed! Ideally, I still believe this solution can be further improved to eliminate loading downloaded files into memory before dumping them to persistent storage.. many APIs naturally download files, and I think, ideally, we would want the IO manager to get a file handle and efficiently persist it to the corresponding permanent storage like s3 - doing that in chunks or using any other file optimized mechanism