https://dagster.io/ logo
#dagster-support
Title
# dagster-support
d

drogozin

10/26/2022, 8:21 AM
Hi all. We are using GDAL library to process images (masking, cropping, translating to different format). The GDAL API uses a path to the file, instead of passing the data stream (buffer), for example a gdal_merge method accepts parameters as a list of paths to files that should be merged in one. The challenge I am facing is that I don't understand how to feet this into a concept of assets in dagster. For example, I have a method that takes a list of paths to files, that we want to mask, crop and save as tiff. This method doesn't return anything, it just saves the file on disk (it's more like a side effect of the function). As I mentioned I strugle to understand how to fit this into dagster model, as I understood the ideal scenario for Dagster is working with pure functions, but in my case the 3rd party API doesn't allow to do that. Any help is appreciated, thank you.
Copy code
def process_file(file_names:List[string]):
   #mask files
   gdal_merge("-o", "output.tif", file_names) #the gdal_merge API doesn't saves file behind the scene and doesn't return anything.
j

jamie

10/26/2022, 2:12 PM
hi @drogozin one option for you could be to write a custom io manager. In it you could have the io manager just return the path to the file instead of loading the file I’d also need to understand a bit more about you use case, but it’s possible that using ops + jobs might be a better fit. Are you trying to write a pipeline where you can repeatedly pass in a list of different files and dagster will run the GDAL operations you need? if you can share a bit more about the overall process you’re creating i can help figure out whether assets or ops will be a better fit
d

drogozin

10/27/2022, 9:23 AM
Hi @jamie thanks for your reply. To give you some context: We have MODIS image collection - a set of files downloaded from MODIS satelite, the new files are downloaded by another system and saved on S3. So, given we have a set of files stored in S3, these files are tiles, we want to reproject them, apply different masks, and finally combine them, so we have tif files of specific area, let's say grouped by states. These processed files stored as tiff files of specific bands, are later used later in a different pipelines. So my idea was to define the tif file as an asset, the input asset for it will be paths to MODIS data saved on S3, and the output is processed files (apparently also stored on s3). My problem is, that in all examples all the asssets are defined as a function that returns something, that "something" is used by other jobs or assets In my case the simplified example will be:
Copy code
@asset(ins={"modis_data":AssetIn()})
my_asset(modis_data):
  output =process_modis_data(modis_data)
  return output
where output represents processed files. So, that will be an ideal scenario, unfortunatelly GDAL functions don't return values but store processed file as a side effect, so for GDAL functions you pass input_path and output_path and after the function completes, you have an output_path file created. So GDAL library is not applicable to the ideal scenario I've described above, it's more like (and I don't know how to handle it):
Copy code
@asset(ins={"modis_data":AssetIn()})
my_asset(modis_data):        process_modis_data(modis_data,outPath)
  ## NO OUTPUTHERE
j

jamie

10/27/2022, 1:26 PM
ok thank you for the clarification, this really helps. I agree that assets make sense for your use case. So you aren’t strictly required to return anything from an asset. For example, the assets in this docs snippet don’t return. Are you running into issues with the asset that doesn’t return anything? I believe the asset should work, but the issue will come when you want to use the processed images in a downstream asset. That’s where you have a couple options 1. return the filepaths where the images are stored and use those in downstream assets 2. Return an explicit
Output
with metadata of the filepaths, and write a custom IO manager to supply those paths (or load the images) to downstream assets For 2 it would look something like
Copy code
@asset 
def my_asset(modis_data):
   process_data(modis_data, output_path)
   return Output(None, metadata={"filepath": output_path})
Here is out documentation for writing an IO Manager and I’m happy to walk you through that process if you decide to go that route! personally, I think option 1 sounds easiest, but i may be missing more context/requirements that make 2 the better option
d

drogozin

10/31/2022, 5:24 PM
Hello @jamie, great thanks for your help! I can imagine it's pretty challenging for you trying to undersatnd, what I'm trying to archive. Your comments were very helpful! Currently, I went with the first path you described. I made some progress, what I have now is a preproc_asset that uses custom IOManager, that accepts filepaths array (that is a sourceAsset) that does processing and stores files to the file_system, the files are processed as I want. I'm using partitioning, so I can specify the range for files and everything works how I imagine. The only problem right now, is exactly what you mentioned is the usage of these asset. The preproc_asset returns an output path, but for some reason the downstream asset that uses it, gets the paths of the input files, not the output files, in the pseudo code belove the "path" variable from the upstream.
Copy code
@asset(
    ins={
        "modis_input_paths": AssetIn("modis_asset"),
    },
    io_manager_key="modis_asset_io_manager",
)
def modis_preproc_asset(
    context: OpExecutionContext,
    modis_input_paths,
):
     path = process_logic(input_files, *params)
     return path
@asset()
def modis_usage(context, modis_preproc_asset):
    <http://context.log.info|context.log.info>(modis_preproc_asset) #this doesn't return path from the func above
I believe I am misusing the io_manager here. As I overrode the "load_input" method, it works as expected. But for handle_output, I do nothing, just logging. Because the output should be just a path of the output files, but this method doesn't return anything. My next approach will be to override "handle_output" of the IO manager, and just store the output path somewhere, so the downstream asset can read it.
j

jamie

10/31/2022, 5:45 PM
Based on that, i think what’s happening is that the
modis_usage
asset is using the default io manager (the filesystem IO manager unless you specify it otherwise in your run config). The FS io manager stores the output of an asset on your file system and then when that output is loaded as an input in another asset, it reads from that file. So in
modis_usage
the fs io manager is trying to read from a file where it things the upstream output was stored. But (based on my understanding) the io manager for the upstream output (
modis_asset_io_manager
) is only logging info when
handle_output
is called. Instead, it would need to store the filepaths in a file that the downstream asset can find. You can change how youre specifying the io manager for the first asset so that it is only responsible for loading the input asset, buy using the
input_manager_key
parameter for AssetIn. This lets you specify an IO manager that is only used to load the corresponding asset. So if you set
modis_asset_io_manager
as the
input_manager_key
then the default io manager would still be used to store the output of
modis_preproc_asset
and load it as the input to
modis_usage
let me know if any of that doesn’t make sense or doesn’t work for you!
d

drogozin

11/02/2022, 9:42 AM
@jamie you are 100% correct. My idea was that I have to override the handle_output method or use the default method from LocalFileIOmanager. But it's very handy, that I can setup the io_manager only for the AssetIn parameter. Which I did and everything worked as expected. Jamie, great kudos for your help. Frankly, in the very beginning I had doubts, that I am following the right path. After some time playing with Dagster, it all makes sense. But in the very begining it's quite hard to fit all the new concepts at once in your head. Thanks for the support and great examples 👍
7 Views