wave Heya folks i m working trough <https docs dagster io a dagster #ask-community

:wave: Heya folks i’m working trough <snowflake, d...

Bojan

11/04/2022, 2:11 PM

👋 Heya folks i’m working trough snowflake, dagster SDAs and i’ve got a question - it’s highly likely that i’m failing to understand something but in cases when i want to read in a table from snowflake that already exists in snowflake, how do i use that asset later on. The example gives the following:

Copy code

@asset(
    key_prefix=["my_schema"]  # will be used as the schema in snowflake
)
def my_table() -> pd.DataFrame:  # the name of the asset will be the table name
    ...

dagster bot responded by community 1

🤖 1

James Hale

11/04/2022, 2:17 PM

@Bojan you'll need to specify an IO manager to replace the default filesystem IO manager (which just stores the pickled value that the function returns). dagster-snowflake provides an IO manager that does exactly what you're describing.

Bojan

11/04/2022, 2:18 PM

Oh i did all of that, the issue is that i can’t really use the initial table in the example above, i don’t know how to reference it.

James Hale

11/04/2022, 2:19 PM

@Bojan if you're using the IO manager from dagster-snowflake, that IO manager provides you the contents of the table as an input to the downstream asset.

James Hale

11/04/2022, 2:19 PM

If you instead want to operate on the table directly, you might need to define a custom IO manager that provides a table reference instead of the contents.

James Hale

11/04/2022, 2:20 PM

Almost all of our IO managers are custom because of similar needs

James Hale

11/04/2022, 2:20 PM

(e.g., we don't want to fetch full tables all of the time).

Bojan

11/04/2022, 2:26 PM

gotcha, i’m still testing this out but i suppose i’m getting some of these asset concepts wrong. eg. even when i’m using the dagsters official snowflake io manager. Lets say that i want to read in the whole table as a dataframe, slite it around and then pass it downstream. For that initial table read, should i try and create an asset (and if so how would i go about it) or do i use an op. eg.

Copy code

@asset(
    key_prefix=["my_schema"]  # will be used as the schema in snowflake
)
def scrub_my_table() -> pd.DataFrame:
    return my_table.dropna() # This obv doesn't work but how would i read in the my_table initially 
    ...

James Hale

11/04/2022, 2:38 PM

You could do 2 things: 1. Create an asset to represent the table and pass it downstream. The IO manager for your table asset will then be responsible for how the table is presented to the downstream asset. 2. Just configure your downstream asset with a resource to access the table and do the reading in the op (the function of the SDA).

James Hale

11/04/2022, 2:38 PM

If you need to ensure ordering between the table and any downstream assets (e.g., you want to update the table then consume it) - I would go with option 1.

James Hale

11/04/2022, 2:39 PM

If the table is static, or you don't need to worry about ordering, option 2 is much simpler.

Bojan

11/04/2022, 6:36 PM

Sorry for not replying @James Hale i got busy with other stuff and completely forgot - i’ll get back to this tomorrow Thanks for the input you’re awesome !

Bojan

11/05/2022, 12:06 PM

I’m essentially trying to achieve

1.

but i guess i haven’t figured out how to create and return the table from an asset using snowflake io manager

Bojan

11/05/2022, 1:07 PM

@James Hale sorry to be bothersome but if you have an example for

Copy code

Create an asset to represent the table and pass it downstream.

it would be quite helpful !

James Hale

11/06/2022, 7:50 PM

@Bojan do you have logic that adds values to the table? For us, table creation is handled in the IO manager. E.g., we have an IO manager that implements

handle_output()

as: 1. Store the

obj

output yielded by the software-defined asset as a JSON temp file 2. Create a table on Snowflake if it doesn't exist 3. Create a stage on Snowflake if it doesn't exist 4.

put

the file to the stage 5.

merge

the contents of the file into the table You can see that IO manager here - it's implemented with SQLAlchemy: https://gist.github.com/jayhale/c5f08dcd1656db1b82e3177425911091

❤️ 1

Open in Slack

Previous Next