wave Curious for a bit of guidance on the best way to copy dagster #ask-community

:wave: Curious for a bit of guidance on the best w...

Paige Moody

06/08/2023, 8:36 PM

👋 Curious for a bit of guidance on the best way to copy data from one s3 bucket to another as part of a job... Previously, for a small number of files we've used multi-assets, but for this case there are >1k items in the source bucket. Would an op be a better choice for this kind of operation? Thanks in advance for any insight!

dagsir 1

Zach

06/08/2023, 8:41 PM

Do you want to model the files all separately as different data assets? Or are you more interested in just the transfer of the data to a new location, on which more ops / steps will operate?

Paige Moody

06/08/2023, 8:49 PM

We are more interested in just the transfer - no need to have each be an individual asset

Paige Moody

06/08/2023, 8:51 PM

(want to avoid each file being a new asset, don't want to blow up the asset lineage UI since it's useful for other assets)

Zach

06/08/2023, 8:55 PM

Okay in that case it seems like a dynamic graph might be a good fit here - just map over the list of files you want to download, and have the mapped op be the transfer op

👀 1

sandy

06/08/2023, 11:32 PM

Dynamic graphs are great if you want parallelism in the transfers. If you don't care about parallelism, you could just represent the collection of all the files as one logical asset.

Paige Moody

06/09/2023, 1:08 PM

Thank you @Zach + @sandy ! In this case we aren't super concerned with parallelism 👍

Paige Moody

06/09/2023, 1:10 PM

@sandy - could you say a bit more about how we might represent a collection of files as an asset? Would this differ from the multi-asset approach I mentioned above that we are using for a handful of other files getting transferred?

sandy

06/09/2023, 2:56 PM

basically:

Copy code

@asset
def my_files() -> None:
    input_files = list_files(source_bucket)
    for file in input_files:
        copy_file(target_bucket)

Paige Moody

06/09/2023, 3:34 PM

gotcha - this might be based on a misunderstanding, but I thought that an asset had to (or, was supposed to) return something. Perhaps that thinking is tied to our use of IO mangers 🤔

sandy

06/09/2023, 3:36 PM

ah - assets don't have to return something. here's more info on this: https://docs.dagster.io/tutorial/managing-your-own-io#tutorial-part-7-managing-your-own-io

James O'Toole

06/09/2023, 3:46 PM

what's the right way to think about when to use an asset that doesn't accept an input or return anything vs. when to use an op? i'm very new to dagster, and thought that ops were meant to be used in scenarios like this where we just want to execute arbitrary code rather than pass a data structure around

sandy

06/09/2023, 3:51 PM

Here's our general recommendation on when to use assets vs. ops: https://docs.dagster.io/guides/dagster/how-assets-relate-to-ops-and-graphs#when-should-i-use-assets-or-ops-and-graphs Let me know if it's still unclear

James O'Toole

06/09/2023, 4:05 PM

thx sandy! saw that—i think my confusion stems from the docs saying that an important distinction between use cases for ops vs. assets is that SDAs "couple an asset to the function and upstream assets that are used to produce its contents." in the case we're describing here, where it's just an s3 copy operation, there are no upstream or downstream dependencies, nor a return value that could be the subject of additional computation

sandy

06/09/2023, 4:06 PM

the way we look at it, the asset is the location in S3 that you're copying the data to, and the function is responsible for producing the data that gets stored at that location (even if an IO manager isn't shepherding that process)

Paige Moody

06/09/2023, 4:32 PM

ah this is interesting! to check my understanding - the asset is the location in s3, not the file once it exists in that location?

sandy

06/09/2023, 5:25 PM

right - each time the asset is materialized, it overwrites the data that's stored at the same location

Open in Slack

Previous Next