Hi! My understanding is that all the new cool stuf...
# integration-snowflake
a
Hi! My understanding is that all the new cool stuff like partitions goes through the IOManager, which (again per my understanding) converts data back and forth between snowflake and pandas. For large data, you probably want to run the processing directly in snowflake instead. What's the best practice in that case? How to still be able to have partitioning etc.?
v
If you wanna keep the processing solely within snowflake, you can write a custom IO Manager that doesn’t load anything into memory or return
None
. The asset could simply be a query that gets executed remotely (as is the case in dbt assets/models). Partitions are passed to the asset function through the
context
variable (check https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/execution/context/compute.py#L84) so you can refer to them in your queries.
a
So a good abstraction could be to have asset functions that return the query itself? And everything else is handled by the IOManager?
Copy code
@asset(io_manager_key="snowflake_pushdown")
def some_snowflake_table():
    query = '''SELECT * FROM some_table INNER JOIN some_other_table'''
    return query
with the IOManager worrying about materialising the query into a table in snowflake so the user does not have to do it. Potentially could also add some default metadata (or leave it configurable by the user)
v
That would be one way to do it, yes
a
Thanks! we usually do use dbt for these kinds of pushdown ELT jobs but there are a lot of cases where having full control of the query can be useful!