Huuuge shoutout to <@U03JEQ2PRRV> for open-sourcin...
# community-showcase
t
Huuuge shoutout to @Daniel Gafni for open-sourcing and deploying dagster-polars! 🐻‍❄️ It exposes a
PolarsParquetIOManager
and a
PolarsIOManager
base class for further extension. He's been using it himself for a while now and graciously open-sourced it to the community. Thank you for your contribution to the Dagster community.
🙏 2
❤️ 2
keanu thanks 3
🐻‍❄️ 5
🌈 2
🎉 23
j
Can make a Polars Delta IO manager out of this.
d
Thanks my next plan
j
Mind posting a showcase when you do :) Noticed the polars blocker. I'll follow that too.
m
Ooh, this is exactly what I needed! I've been rather miffed about the lack of Polars support in Dagster and having to go back to using Pandas drives my crazy.
🙂 1
j
d
@Jordan Fox I found a workaround and added
PolarsDeltaIOManager
(with native deltalake partitioning support!) https://github.com/danielgafni/dagster-polars/releases/tag/v0.0.4
j
Woot! Testing today
👌 1
b
Also tested this today! Overall the delta io manager is working great, but I’m hitting the following error when trying to run a compaction job on a basic schema: Here’s the basic pyarrow schema:
Copy code
from deltalake import DeltaTable
dt = DeltaTable("/path/to/asset.delta")
dt.pyarrow_schema()
date: string
counter_name: string
counter_value: double
And the following error:
Copy code
_internal.DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema: got: Field { name: "counter_name", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "counter_value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, expected: Field { name: "counter_name", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "counter_value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
It looks like there’s an implicit conversion from
pl.Utf8()
->
pyarrow.large_string()
happening somewhere, but I’m not sure where. Ah, look like polars always uses the
LargeUtf8
data type, which isn’t supported by delta-rs yet (ref).
d
Yes, there are some issues around DataTypes conversion. For example,
UInt
is casted to
Int
. Some DataTypes are not supported at all. Apparently DeltaLake doesn't support the same types as Parquet.
j
Thanks for this. any opinion on whether it's worth adding an S3ParquetIOManager that a) writes to S3 as parquet b) reads from S3 using s3fs to support predicate pushdowns and col partitioning. This seems less useful given the duckdb-polars-io-manager can be subclassed to just replace the table with a parquet path I'm happy to contribute this to this package if either seems useful. related post on the topic here
d
Hey, it should be possible. I already have this feature in the deltalake IOManager.