It's been interesting seeing how things play out f...
# random
c
It's been interesting seeing how things play out for Apache Arrow, python + rust. Polars is a new one (new to me at least) that I'm keeping an eye on, probably for use side by side with duckdb and offloading the heavy lifting from pandas. It uses ConnectorX for DB I/O with rust, then polars itself is rust too. I never quite got the hang of Flight services, a bit over my head, but adding arrow/parquet as supported content types in existing HTTP services was pretty straightforward. It reminds me of Spark in some ways, but vectorized - Sandy would like this one I think. Generally when it comes to decoupling business logic, unit tests, etc, I'm thinking of arrow as the interop format to bring things into and work from. https://www.pola.rs/ Better parquet reading vs pandas: https://pola-rs.github.io/polars-book/user-guide/howcani/io/parquet.html Podcast+video: https://talkpython.fm/episodes/show/402/polars-a-lightning-fast-dataframe-for-python
a
Cool thing actually. Thought to switch onto it from PySpark (luckily, we don't have too much code in our dagster pipelines yet), but delayed it again because of the fresh Spark 3.4.0 release with a new promising PySpark Connect feature, which could make the usage of PySpark more smooth, particulary with Dagster. But for sure, Polars must replace Pandas (though it also became more performance- and memory- friendly since its 2.0 release).
Also, I'm not sure that polars + pyarrow.parquet + s3fs would be a good choice if someone (like us) stores the data in S3. But Spark's support for S3 is not very good also, so would wait for some sort of remote storage implementation on the side of
.rs
libraries. (or maybe will write custom IOManager, which would write local parquets and upload them to S3 via
rclone
)
z
Still too early to use most likely, but check out https://github.com/awslabs/mountpoint-s3
There's some other S3 fuse file systems as well but i haven't had great experiences getting them installed and working :S
a
I meant the other s3fs, not the FUSE implementation, but this one - https://s3fs.readthedocs.io/en/latest/ which implements pythonic filesystem interface 🙂. Not sure if writing to the FUSE could be as effective as working with S3 directly. It is too different from Posix FS... Pythoninc fsspec is also a layer of abstraction though.
Thats why it would be a good idea to wait for a storage abstraction and S3 implementation on the rust side of polars, or other framework such as datafusion, or something like
pyarrow.fs.S3FileSystem
(which status I don't understand, I could be wrong, but AFAIR they started a couple of different implementation of such interfaces declaring each one of them as deprecated at some moment)
z
Interesting point, IMO I’d expect anything that spends too much time in python land will far behind from being GIL locked. Arrow + C++/Rust bindings could of course change that