When creating assets at runtime how can their line...
# ask-community
t
When creating assets at runtime how can their lineages be derived? We have the following ops connected in a job: 1. look for new files on sftp server 2. download files from sftp to local fs 3. uncompress any zip archives 4. upload files from local fs to an s3 location Since we do not know what files and how many will exist until runtime we are creating assets within the ops via
AssetMaterialization
. This is working, but when viewing the assets in dagit there is no lineage. Is there a way we can link these so that the lineage shows
sftp file asset --> local compressed file --> local uncompressed file(s) --> s3 file
? Is there a better approach to replicate this pattern using assets instead of ops?
🤖 1
plus1 1
s
Hi Tom, good question. I am not 100% sure but I think that asset lineage is only supported for software-defined assets, as opposed to assets created within ops. cc @sandy for confirmation, and maybe advice as to whether your dynamic asset creation needs can be accomodated within SDAs.
t
We created a lot of these ops a while back and with the recent emphasis on SDAs we are a little unsure how/if we can port them over to SDAs. Any advice would be appreciated
s
Hey Tom - we only recommend using SDAs in cases where you know what the assets will be ahead of time. Ops are and will remain fully supported for cases where the assets aren't known until runtime. I filed an issue to track adding lineage for assets in these cases: https://github.com/dagster-io/dagster/issues/9056 Something that we've discussed adding for SDAs is "runtime asset partitions" - i.e. the set of SDAs would still be determined at definition time, but, within an SDA, you could add new partitions at runtime. So e.g. you'd be able have a single asset that has a partition for each of your S3 files. Would that be a way that you'd be open to modeling your assets? Here's the issue where we're tracking this: https://github.com/dagster-io/dagster/issues/7943.
t
Thanks for response Sandy, this is interesting. If I'm understanding you correctly regarding runtime asset partitions, would the SDA be something like "vendor_abc_s3_files" with a 1:1 relationship between the asset partitions and s3 files? Would a similar pattern in ops be logging multiple materializations for "vendor_abc_s3_files" with a partition date and more specific file info attached via metadata? Something like
Copy code
context.log_event(
    AssetMaterialization(
        asset_key="vendor_abc_s3_files",
        partition=partition_date,
        metadata={
            ...attributes about the file...
        }
    )
)
We currently derive runtime asset keys like
job_name/run_id/op_name/filename
but I'm wondering if that is too narrowly scoped and a more generalized asset key with multiple partitions is a better approach.
s
I'm understanding you correctly regarding runtime asset partitions, would the SDA be something like "vendor_abc_s3_files" with a 1:1 relationship between the asset partitions and s3 files? Would a similar pattern in ops be logging multiple materializations for "vendor_abc_s3_files" with a partition date and more specific file info attached via metadata?
Exactly
k
Hi. One way to do this can be to provide the asset_key as a list of strings instead of AssetKey(). This creates kind of a lineage (asset inside asset). At the end of every op in the chain, you can create an asset by adding step names to the list incrementally. op1 -> AssetMaterialization(asset_key=["sftp file asset"]) op2 -> AssetMaterialization(asset_key=["sftp file asset", "local compressed file"]) op3 -> AssetMaterialization(asset_key=["sftp file asset", "local compressed file", "local uncompressed file(s)"]) op4 -> AssetMaterialization(["sftp file asset", "local compressed file", "local uncompressed file(s)", "s3 file"])
h
For lineage, if I create an AssetGroup with SourceAsset that has an asset key pointing to an asset generated by another job (within same repo), would the lineage work correctly?
s
@Hebo Yang to make sure I'm understanding correctly, the other job would not be an asset job?
h
the other job is also an asset job. However, all of our jobs are generated functionally/dynamically. I am hoping to reference other jobs by asset key, without holding a reference to the job
s
I see in 0.15, if they're in the same repository and they're both asset jobs, you shouldn't need a SourceAsset - the lineage should just work in 0.14, I believe you would need a SourceAsset in the downstream AssetGroup to represent the asset in the upstream asset group
👍 1
🎉 1
h
Awesome! Thanks Sandy!