When using `dagster snowflake pandas` when an ` asset` funct dagster #integration-snowflake

When using `dagster-snowflake-pandas`, when an `@a...

Binoy Shah

01/18/2023, 5:17 PM

When using

dagster-snowflake-pandas

, when an

@asset

function returns a Data Frame, does Dagster download the data into Dataframe and hold it in memory? If yes, what mechanism does it use to download and build a Panda dataframe? Can it be possible to convert into a Polars dataframe ?

jamie

01/18/2023, 8:39 PM

yes, dagster downloads the data into a dataframe in memory. We use the

pd.read_sql

method to do this (here’s the line in the source code). If you wanted to download as a Polars dataframe, you could write a new type handler for the snowflake io manager. This process isn’t documented right now (it’s a pretty new internal feature, and writing docs for it is on my to do list). I’m happy to help walk you through the process though if that’s something you’re interested in

Binoy Shah

01/18/2023, 9:04 PM

Thanks for the information @jamie, yes I am interested in creating such type handler for Polars I am guessing the process involves creating sub-type of

DbTypeHandler

like https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster[…]andas/dagster_snowflake_pandas/snowflake_pandas_type_handler.py And “registering” it someplace to have it injected during load/unload ?

jamie

01/18/2023, 9:07 PM

yep! that’s exactly it - here are a couple other examples of type handlers (for duckdb) pandas and pyspark once you write the type handler, you “register” it with the snowflake io manager using the

build_snowflake_io_manager

command like this https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster[…]andas/dagster_snowflake_pandas/snowflake_pandas_type_handler.py

👍 1

jamie

01/18/2023, 9:10 PM

basically the flow of control goes like this: an asset/op returns an output, then the snowflake io manager figures out what part of the table needs to be deleted (ie if you are re-materializing a partition, we delete the old data so that it gets replaced with the new data). Then the

handle_output

method of the type handler is called, and that method is responsible for actually uploading the output to snowflake

Binoy Shah

01/18/2023, 9:12 PM

and I realize that dagster first uses

PUT

snowflake command to upload the parquet file and then

COPY

snowflake command to create/upsert table entries.

Binoy Shah

01/18/2023, 9:12 PM

is it also doing reverse

UNLOAD/GET

to download the data ?

jamie

01/18/2023, 9:14 PM

can you link to the lines where you’re seeing PUT commands? just want to make sure we’re looking at the same thing

Binoy Shah

01/18/2023, 9:15 PM

https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-snowflake/dagster_snowflake/resources.py#L324-L325

jamie

01/18/2023, 9:47 PM

ah so that’s a method on the snowflake resource that you can call from the body of an op/asset if you want. it’s not used by the io manager at all

jamie

01/18/2023, 9:49 PM

this is the io manager file - it does create a SnowflakeConnection object (that’s what’s in resources.py) but it’s only used to create a connection to snowflake. all of the SQL commands are in the io manager file

jamie

01/18/2023, 9:49 PM

it’s admittedly a bit confusing that there’s a separate io manager and resource right now. we should probably consolidate into one resource that can also be an io manager

👍 1

3 Views

Open in Slack

Previous Next