< Scout> I have a partitioned asset which saves partitioned dagster #ask-ai

<@U0667DNC02Y> I have a partitioned asset which sa...

Gil Tinde

02/14/2024, 12:30 PM

@Scout I have a partitioned asset which saves partitioned parquet files using a UPathIOManager i/o manager. In a downstream asset I use this partitioned asset but I want to open all parquet files into the same dask dataframe. Normally, I would do this with dd.read_parquet(path + "*.parquet"). Is there a way to achieve something similar with the UPathIOManager, by maybe overriding a method in a custom UPathIOManager i/o manager?

Scout

02/14/2024, 12:30 PM

Based on the documentation, the

UPathIOManager

is designed to handle partitioned assets, and you can customize its behavior by implementing the

dump_to_path

and

load_from_path

methods for a given

universal_pathlib.UPath

. Here is an example provided in the documentation for a custom

UPathIOManager

that works with pandas and parquet files:

Copy code

python import pandas as pd from upath import UPath  from dagster import (     InputContext,     OutputContext,     UPathIOManager, )  class PandasParquetIOManager(UPathIOManager):     extension: str = ".parquet"      def dump_to_path(self, context: OutputContext, obj: pd.DataFrame, path: UPath):         with path.open("wb") as file:             obj.to_parquet(file)      def load_from_path(self, context: InputContext, path: UPath) -> pd.DataFrame:         with path.open("rb") as file:             return pd.read_parquet(file)

To achieve the behavior you're looking for with Dask, you would need to create a similar custom

UPathIOManager

that overrides the

load_from_path

method to use Dask's

read_parquet

function. This method should return a Dask DataFrame that represents all the parquet files for the partitions you want to load. Here's a conceptual example of how you might implement this:

Copy code

python import dask.dataframe as dd from upath import UPath  from dagster import (     InputContext,     OutputContext,     UPathIOManager, )  class DaskParquetIOManager(UPathIOManager):     extension: str = ".parquet"      def dump_to_path(self, context: OutputContext, obj: dd.DataFrame, path: UPath):         # Assuming obj is a Dask DataFrame         obj.to_parquet(path)      def load_from_path(self, context: InputContext, path: UPath) -> dd.DataFrame:         # Construct the path with a wildcard to match all partition files         partitioned_path = str(path) + "*.parquet"         # Use Dask to read all parquet files into a single Dask DataFrame         return dd.read_parquet(partitioned_path)

Please note that this is a conceptual example and may require adjustments based on your specific setup and requirements. The actual implementation may vary depending on how your partitions are structured and how you want to handle them in your downstream assets.

Gil Tinde

02/14/2024, 12:38 PM

@Scout Doing that yields this error: dagster._check.CheckError: Failure condition: Loading an input that corresponds to multiple partitions, but the type annotation on the op input is not a dict, Dict, Mapping, or Any: is 'class 'dask.dataframe.core.DataFrame''. This is because the load_input method of the base UPathIOManager class checks that context.dagster_type.typing_type is a dictionary, but in my case I don't want it to be a dictionary mapping from the partition to the dataframe, I want it to simply be the dataframe of all the partitions.

Scout

02/14/2024, 12:38 PM

There was an error processing your request. Please try again.

Scout

02/14/2024, 12:39 PM

There was an error processing your request. Please try again.

Scout

02/14/2024, 12:40 PM

There was an error processing your request. Please try again.

Scout

02/14/2024, 12:48 PM

There was an error processing your request. Please try again.

Scout

02/14/2024, 1:00 PM

The error you're encountering indicates that Dagster is expecting a dictionary type when loading an input that corresponds to multiple partitions, as per the default behavior of the

UPathIOManager

. This is a known issue that has been discussed in the Dagster community, and it reflects a limitation in the current implementation of the

UPathIOManager

when dealing with multiple partitions. From the documentation and the issues raised in the community, it's clear that the

UPathIOManager

has certain expectations about the types used when loading inputs corresponding to multiple partitions. Specifically, it expects a dictionary mapping partition keys to data when multiple partitions are involved. To work around this limitation, you might consider one of the following approaches: 1. Custom IOManager: Create a custom IOManager that overrides the

load_input

method to bypass the type check and directly return a Dask DataFrame. However, this approach might require a deeper understanding of the Dagster internals and careful handling to ensure compatibility with the rest of the Dagster system. 2. Load Partitions Individually: Instead of trying to load all partitions at once, you could load them individually and then concatenate them into a single Dask DataFrame within your asset computation logic. This would involve iterating over the partition keys and loading each partition separately. 3. Wait for a Fix or Feature Update: Given that this is a known issue, it's possible that a future update to Dagster might provide a more flexible way to handle this scenario. You could monitor the issue for updates or contribute to the discussion to advocate for a resolution. Unfortunately, without the ability to modify the core behavior of the

UPathIOManager

or without a feature update from Dagster that addresses this limitation, there isn't a straightforward solution that aligns with the current implementation. If this is a critical feature for your workflow, you might need to explore alternative data management strategies or consider contributing to Dagster to help introduce the desired functionality.

32 Views

Open in Slack

Previous Next