Andrea
04/12/2023, 12:49 PMTrinoQuery
type handler that allows the user to pushdown storage and compute of assets to Trino without taking the data out, passing a reference to the Trino table. (the Dagster DbTypeHandler
system is really amazing, the library heavily use it, super thumbs up for the person who had the idea!)
• A set of type handlers that allow to "side-load" Trino data, ie, when using a Trino catalog with Hive metastore, it allows Dagster to automatically find underlying Parquet Data and directly load it from S3/GCS/HDFS/etc... which is a lot faster than using the Trino Client, especially for larger data (i have a benchmark in the example folder showing 10x speed of read even for a small-ish 300MB dataframe).
• The type handlers are composable, so a Parquet File type handler is used to build an Arrow Table handler, which is used to build a Pandas handler... it makes it very simple to build custom type handlers. In the example folder i have an example showing a custom Polars type handler in just a couple of lines of code (just converting from/to arrow instead of going all the way to Trino).
• I adapted the dagster dbt jaffle shop example to work with dagster-trino instead of duckdb, next i plan to add an example showing the use of ibis (python dataframe library that can be used with Trino) and distributed system such as Spark/Dask/Ray with distributed reads on Trino data using the parquet type handler.
The library is still rough around the edges, but if anyone here is interested in having a look and get me some feedback, it would make my day!Andrea
04/12/2023, 12:51 PMDagster Jarred
04/13/2023, 5:48 AMAndrea
04/13/2023, 5:53 AMDaniel Gafni
04/15/2023, 10:30 PMAndrea
04/16/2023, 1:51 AMdf = pl.scan_parquet(parquet_paths_from_type_handler)
Harrison Conlin
04/19/2023, 1:23 AM