Hi My understanding is that all the new cool stuff like part dagster #integration-snowflake

Hi! My understanding is that all the new cool stuf...

Andrea

02/27/2023, 12:45 AM

Hi! My understanding is that all the new cool stuff like partitions goes through the IOManager, which (again per my understanding) converts data back and forth between snowflake and pandas. For large data, you probably want to run the processing directly in snowflake instead. What's the best practice in that case? How to still be able to have partitioning etc.?

Vinnie

02/27/2023, 8:00 AM

If you wanna keep the processing solely within snowflake, you can write a custom IO Manager that doesn’t load anything into memory or return

None

. The asset could simply be a query that gets executed remotely (as is the case in dbt assets/models). Partitions are passed to the asset function through the

context

variable (check https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/execution/context/compute.py#L84) so you can refer to them in your queries.

Andrea

02/27/2023, 11:03 AM

So a good abstraction could be to have asset functions that return the query itself? And everything else is handled by the IOManager?

Copy code

@asset(io_manager_key="snowflake_pushdown")
def some_snowflake_table():
    query = '''SELECT * FROM some_table INNER JOIN some_other_table'''
    return query

with the IOManager worrying about materialising the query into a table in snowflake so the user does not have to do it. Potentially could also add some default metadata (or leave it configurable by the user)

Vinnie

02/27/2023, 11:04 AM

That would be one way to do it, yes

Andrea

02/27/2023, 11:08 AM

Thanks! we usually do use dbt for these kinds of pushdown ELT jobs but there are a lot of cases where having full control of the query can be useful!

Open in Slack

Previous Next