I ve used Dagster with Snowflake before but am now using SDA dagster #integration-snowflake

I've used Dagster with Snowflake before, but am no...

Seth Kimmel

02/06/2023, 5:57 PM

I've used Dagster with Snowflake before, but am now using SDA's. I'm trying to make sure I'm thinking about and using them in the right way. When I'm running an op that is strictly using data within snowflake (i.e. just running a query within Snowflake that creates a new table), should I be using an asset? Or is this simply an op?

clay

02/06/2023, 6:19 PM

I was wondering about that the other day and ended up creating Assets that return None and using

non_argument_deps

to make sure dependencies were properly tracked.

clay

02/06/2023, 6:19 PM

It works fine, but feels a bit like an abuse of the term Asset, how Dagster defines it

Seth Kimmel

02/06/2023, 7:31 PM

Yeah... doesn't really seem right, as it's no longer "tracked" within the asset ecosystem. @jamie can you advise?

clay

02/06/2023, 7:35 PM

Seems like it would be ok if the resulting table(s) could be tagged as Assets at the end of the function that creates them

Stephen Bailey

02/06/2023, 7:35 PM

i do something like @clay in much of our models. I describe it as the "weak asset" approach to using assets -- basically assets sans IO Manager and argument-level dependencies.

jamie

02/06/2023, 7:40 PM

i don’t know if i’d go so far as to say executing a query within an asset and returning None is an abuse of assets. the “data asset” is the new table that’s created and it’s represented in dagster by the “software defined asset” that executes the query to create the table. I think that approach makes a lot of sense in a lot of cases. There are other cases when you may need to pull the data into a pandas dataframe and do stuff with it, and in that case, using the IO manager abstractions can be a good way to potentially reduce code duplication and abstract away the IO stuff

it’s no longer “tracked” within the asset ecosystem

can you elaborate a bit more on what you mean by this? keeping the snowflaek query as an op would lose some of the additional features around assets, but just want to make sure that’s what you’re referring to

jamie

02/06/2023, 7:44 PM

obviously i don’t know what your whole use case is, but if you’re transitioning to SDAs for your other dagster code, doing what clay recommended with

non_argument_deps

and executing the query directly in snowflake seems like the way to go. that’ll allow you to integrate the asset in with the rest of your assets and set it as an upstream dependency, take advantage of freshness policies, asset reconicilliation, etc

Seth Kimmel

02/06/2023, 7:57 PM

Thanks for the help here! I think I'll probably proceed forward with the non-argument deps and return None approach. When I say not "tracked", I mean that from what I can tell in the docs - dagster has some notion of the contents of the resulting output from a given SDA in most cases. Since it never gets loaded into an i/o, I assume that dagster has no way of knowing about the results of the query you execute. Therefore, the only thing that is being tracked is the operation itself, not the resulting output, which feels more appropriate to call an "op".

Stephen Bailey

02/06/2023, 8:18 PM

instead of returning None, you can pass in metadata about the object, too.. things like

{"name": "..."}

, etc. can be useful for downstream operations to take advantage of, especially when you get into cross-system dependencies -- for example, an upstream asset that is going to need to know the name of the snowflake table, but not the actual

df

of its contents.

❤️ 1

Seth Kimmel

02/06/2023, 8:21 PM

ah clever!

Stephen Bailey

02/06/2023, 8:22 PM

we have a sagemaker pipeline that is entirely built on external compute, so dagster never actually touches the objects themselves. it looks like:

Copy code

@asset(non_argument_deps = {"snowflake_table_1_key", "snowflake_table_2_key"})
def training_Job():
    job_id = sagemaker.execute_training_job(...)
    results_dict = sagemaker.get_training_Job(job_id)
    return results_dict

@asset
def model(training_job):
    model_Id = sagemaker.create_model(training_Job_id=training_job["name"])
    results_dict = sagemaker.get_training_Job(job_id)
    return results_dict

@asset
def endpoint(model):
    ...
    return results_dict

keeping the lineage clean is really useful, as it lets you build cross-system lineage, which is where assets really pay off imo

Seth Kimmel

02/06/2023, 8:26 PM

Cool, very helpful. I think it would be helpful if the docs said something about the notion of using a pattern that tracks metadata/asset lineage and how that's distinguished from the actual assets themselves. cc: @jamie

👀 1

sandy

02/07/2023, 12:17 AM

Even if it returns

None

, we consider it a software-defined asset because it defines how to produce a particular data asset. By using the

@asset

decorator with a

None

output, the developer is kind of agreeing to a "contract" with the framework that they will materialize the asset when the decorated function is invoked. IO managers make it more ergonomic to write the code that materializes that data asset, but they're not fundamental to the paradigm. Here's a little more on this subject: https://docs.dagster.io/tutorial/assets/non-argument-deps#assets-without-arguments-and-return-values

16 Views

Open in Slack

Previous Next