:wave: Curious for a bit of guidance on the best w...
# ask-community
p
👋 Curious for a bit of guidance on the best way to copy data from one s3 bucket to another as part of a job... Previously, for a small number of files we've used multi-assets, but for this case there are >1k items in the source bucket. Would an op be a better choice for this kind of operation? Thanks in advance for any insight!
dagsir 1
z
Do you want to model the files all separately as different data assets? Or are you more interested in just the transfer of the data to a new location, on which more ops / steps will operate?
p
We are more interested in just the transfer - no need to have each be an individual asset
(want to avoid each file being a new asset, don't want to blow up the asset lineage UI since it's useful for other assets)
z
Okay in that case it seems like a dynamic graph might be a good fit here - just map over the list of files you want to download, and have the mapped op be the transfer op
👀 1
s
Dynamic graphs are great if you want parallelism in the transfers. If you don't care about parallelism, you could just represent the collection of all the files as one logical asset.
p
Thank you @Zach + @sandy ! In this case we aren't super concerned with parallelism 👍
@sandy - could you say a bit more about how we might represent a collection of files as an asset? Would this differ from the multi-asset approach I mentioned above that we are using for a handful of other files getting transferred?
s
basically:
Copy code
@asset
def my_files() -> None:
    input_files = list_files(source_bucket)
    for file in input_files:
        copy_file(target_bucket)
p
gotcha - this might be based on a misunderstanding, but I thought that an asset had to (or, was supposed to) return something. Perhaps that thinking is tied to our use of IO mangers 🤔
s
ah - assets don't have to return something. here's more info on this: https://docs.dagster.io/tutorial/managing-your-own-io#tutorial-part-7-managing-your-own-io
j
what's the right way to think about when to use an asset that doesn't accept an input or return anything vs. when to use an op? i'm very new to dagster, and thought that ops were meant to be used in scenarios like this where we just want to execute arbitrary code rather than pass a data structure around
s
Here's our general recommendation on when to use assets vs. ops: https://docs.dagster.io/guides/dagster/how-assets-relate-to-ops-and-graphs#when-should-i-use-assets-or-ops-and-graphs Let me know if it's still unclear
j
thx sandy! saw that—i think my confusion stems from the docs saying that an important distinction between use cases for ops vs. assets is that SDAs "couple an asset to the function and upstream assets that are used to produce its contents." in the case we're describing here, where it's just an s3 copy operation, there are no upstream or downstream dependencies, nor a return value that could be the subject of additional computation
s
the way we look at it, the asset is the location in S3 that you're copying the data to, and the function is responsible for producing the data that gets stored at that location (even if an IO manager isn't shepherding that process)
p
ah this is interesting! to check my understanding - the asset is the location in s3, not the file once it exists in that location?
s
right - each time the asset is materialized, it overwrites the data that's stored at the same location