I've been leisurely working on an integration betw...
# community-showcase
I've been leisurely working on an integration between dagster and Trino: https://github.com/andreapiso/dagster-trino I think Trino and Dagster are a match made in heaven, because Trino (thanks to its large number of native connectors) can turn all of your data assets into dagster assets with a single integration. No matter whether these assets live on Snowflake, Bigquery, Object Storage/HDFS, MongoDB, etc... This is what the library supports at the moment: • Basic functionalities are essentially 1:1 with dagster-snowflake. The initial cut of this library was basically made by taking dagster-snowflake and re-wire it to use Trino instead. • A
type handler that allows the user to pushdown storage and compute of assets to Trino without taking the data out, passing a reference to the Trino table. (the Dagster
system is really amazing, the library heavily use it, super thumbs up for the person who had the idea!) • A set of type handlers that allow to "side-load" Trino data, ie, when using a Trino catalog with Hive metastore, it allows Dagster to automatically find underlying Parquet Data and directly load it from S3/GCS/HDFS/etc... which is a lot faster than using the Trino Client, especially for larger data (i have a benchmark in the example folder showing 10x speed of read even for a small-ish 300MB dataframe). • The type handlers are composable, so a Parquet File type handler is used to build an Arrow Table handler, which is used to build a Pandas handler... it makes it very simple to build custom type handlers. In the example folder i have an example showing a custom Polars type handler in just a couple of lines of code (just converting from/to arrow instead of going all the way to Trino). • I adapted the dagster dbt jaffle shop example to work with dagster-trino instead of duckdb, next i plan to add an example showing the use of ibis (python dataframe library that can be used with Trino) and distributed system such as Spark/Dask/Ray with distributed reads on Trino data using the parquet type handler. The library is still rough around the edges, but if anyone here is interested in having a look and get me some feedback, it would make my day!
🎉 6
❤️ 10
This folder contains some examples of how dagster-trino can be used.
I know that @Tim Castillo already reached out to you separately, but this is really awesome @Andrea
Thanks @Dagster Jarred! There's still a lot of refactoring and work to do, for example Dagster partitions still don't work properly when accessing parquet files directly (as the IOManager should enforce that the parquet files are partitioned on the same key as the dagster partition), or the automatic Trino<->Arrow type mapping still does not support complex types such as array, maps etc. Thought it would be still good to share so that people would know that this exists !
Amazing stuff! A small feature request: consider loading polars LazyFrames, this may be very valuable
@Daniel Gafni thanks! the polars example shows how to implement a custom connector. Thanks to the way those type_handlers build on top of each other, it should be simple to have a custom type handler for lazy polars frames. The existing example uses the arrow type handler and converts arrow tables into polars. Could adapt it to instead use the parquet files type handler and load lazy frames as
Copy code
df = pl.scan_parquet(parquet_paths_from_type_handler)
👍 1
is trino the one that was presto? I haven't had the time to work out which is the "better" product or they're distinct that theyre not really comparable