Hello Dagster community I m new to Dagster and I ve been str dagster #ask-community

Hello Dagster community! I'm new to Dagster and I'...

Jan Hajny

04/11/2023, 7:10 AM

Hello Dagster community! I'm new to Dagster and I've been struggling with creating a job that would copy .zip files from one location to another. The situation is as follows: • I have a bunch of relatively heavy .zip files on a mounted, remote NAS • I want to process these files, but don't want to read them directly from the remote location (the connection is pretty slow) • The idea is to define a step that would copy these files from the remote location into a local, even temporary one So far I've tried defining everything as assets. Since I don't know the exact number and names of these files, I've defined an asset that validates file names in the remote directory and adds valid file paths as dynamic partitions. There's a second asset that uses this

partition_def

and has the previous asset referenced in

non_argument_deps

so that these two steps are linked together in the graph. So far so good, the two assets appear connected in the UI and I can materialize the first one successfully. The problem is how to deal with the second step. I'm okay with using the default

fs_io_manager

for the second asset's output but I'm not sure how to actually perform the copy operation. I tried to read the file content (one at a time as the second asset works with partitions being individual file paths) as binary using

io.BytesIO

and returning the buffer. That seems to materialize the asset successfully but I can't find the saved file anywhere. Also, I'm not sure how to load these local files later on in another asset. Ideally, I'd like the second asset to materialize all the partitions and then start a task for the third asset that would get paths to all the local files as an input. Is my approach so far reasonable? Is there any recommended way how to do this? Thank you in advance for any advice!

claire

04/11/2023, 11:54 PM

Hi Jan. I think that this approach sounds reasonable, though the part where you'd like to materialize all of the partitions together in the second asset makes me think that dynamic outputs might be a better fit for your use case. You could: • In an upstream op read and validate file names from the remote directory, creating a temporary directory to store your copied files. Yield a dynamic output for each file path • Map an op to handle each dynamic output, copying each file into your temporary directory • Add a downstream collect operation. This would accept all of the filepaths as inputs, and then you can perform whatever operation on the copied files in the tempdir

Jan Hajny

04/12/2023, 4:31 AM

Hi Claire, thank you so much for your suggestion! I will definitely look into dynamic outputs, I haven't got to that topic in the documentation yet!

2 Views

Open in Slack

Previous Next