Is there a best practice for defining parents for ...
# ask-community
a
Is there a best practice for defining parents for an s3 stage in snowflake?
j
Hi Alec, what do you mean by parents? Dagster Ops?
a
Hey Johann, what I'm wondering is if I have an external stage for my s3 bucket in snowflake, how can I tell dagster that these two are linked with assets?
an asset is produced in s3 and and asset is produced in snowflake when copying from the stage
@johann
j
@owen may have thoughts here
o
is the operation that puts data in s3 orchestrated by dagster, or does it just arrive there from some external process? if the former, you would just create another asset whose body is that s3 computation, and have the snowflake asset (which does the copy operation) list this s3 asset as its input. if the latter, your snowflake asset would still be defined exactly the same way, but instead of supplying a fully-specified s3 asset, you would create a SourceAsset to represent that entity: https://docs.dagster.io/concepts/assets/software-defined-assets#source-assets---representing-assets-are-generated-somewhere-else
a
It is the former, so I guess what I'm wondering is where the external stage fits into this. The flow is... data is written to s3 in JSON format, partitioned by date. It is copied into snowflake using an external stage that is connected to my s3 bucket.
all orchestrated by dagster
Does it make sense to separate these into multiple jobs ? 1. data source to s3 2. Copy from external stage Where 2 is somehow triggered by a sensor?
o
haven't used snowflake much myself, but to the best of my knowledge, the external stage is just a pointer to an s3 bucket, not a copy of the data therein, is that right? If that's the case I don't think the fact that there's an external stage is involved has any impact on the asset layer (there's still only two assets, s3 data and snowflake table), and becomes just an operational detail. so I would structure it something like
Copy code
@asset
def s3_asset():
    write_data_to_s3(...)

@asset
def snowflake_asset(s3_asset) -> None:
    copy_from_stage_to_table()

do_both_job = AssetGroup([s3_asset, snowflake_asset]).build_job("do_both")
a
correct, the external stage is just a pointer to an s3 bucket
o
there's some question of what data you actually want to pass between these two assets. in this code sample, I assume you don't really need to pass any data, and the s3_asset argument is just used to tell dagster that there's a dependency between these two steps
but I could imagine wanting to pass a bucket name or something like that. IOManagers https://docs.dagster.io/concepts/io-management/io-managers#io-managers give you a lot of flexibility with those sorts of decisions
a
Yeah, I had originally done this. I don't need to pass data between the two right now but perhaps an IO manager would be interesting here
Say if I want to swap buckets for certain environments
1
thanks for the info!
o
no problem 🙂