Dipesh Kumar
06/13/2023, 9:07 PMpartition_expr
) to the IO Manager
I am re-using the snowflake_io_manager as the base from the Fully Featured Project. This IO manager is working great and I am even able to use the partitioning.
Just require guidance on how to pass the value for partition_expr
to the IO manager. This is so I can use it in the _time_window_where_clause
.
The context
has the partition keys or the time window but I am not able to find the partition_expr
value.
From reviewing other IO managers (Duckdb, Snowflake, etc) I understand that TableSlice
is being used in the IO managers to pass partition_expr
to the `_time_window_where_clause`function but I don't fully understand how to correctly use this in my IO manager.
Any help will be appreciatedjamie
06/14/2023, 4:54 PMpartition_expr
on your asset, or how to retrieve the value in the IO manager itself? For the first one, you do it like this
@asset(
metadata={"partition_expr": "MY_COLUMN"}
)
def my_asset():
...
Dipesh Kumar
06/14/2023, 11:48 PMpartition_expr
value in the IO Managerjamie
06/15/2023, 4:00 PMhandle_output
function, the context
that is passed by dagster is an OutputContext
but in the load_input
function, the context
is an InputContext
. For that case you will need to get the output context using context.upstream_output
Dipesh Kumar
06/15/2023, 4:14 PMload_input
I can get the metadata value from the Output context
output_context_metadata = output_context.metadata or {}
partition_expr = output_context_metadata.get("partition_expr")
I can now make my io manager abit more genericjamie
06/15/2023, 4:15 PMmartin o leary
08/16/2023, 9:23 AMasset_key
(ie. it is an OP output) there is no real logic there for resolving partitions - could you explain why that is I wonder?
I'm building a single IO manager to handle op IO as well as asset IO for contextjamie
08/16/2023, 1:47 PM1/my_op/result
and the second run would store at 2/my_op/result
. For assets, the same location is overwritten for each run. And that has implications for partitions. For an op you might end up with all of these files 1/my_op/result_2023_05_06
1/my_op/result_2023_05_07
2/my_op/result_2023_05_06
but for an asset you would only have one file per partition per asset.
Trying to then translate that to a database gets tricky - do we make a new schema for each run and then add partitions to the table like we would in an asset-centric world? do we make a new table for each partition? do we adopt the asset approach entirely and just keep one copy of the data?
it’s tough to decide which of these approaches is the “correct” approach, and from what i can tell, most users of the database IO managers are using assets. so we haven’t prioritized filling this gap.martin o leary
08/17/2023, 8:26 AMjamie
08/17/2023, 1:46 PM