Hello is there reason <https github com dagster io dagster b dagster #ask-community

Hello, is there reason <s3 parquet io managers> ar...

John Smith

07/17/2023, 8:38 AM

Hello, is there reason s3 parquet io managers are not part of the "official" dagster maintained io managers? e.g. you wish to steer users towards s3 duckdb instead? I'm open to being "nudged" in the right direction ;) Aside: I believe the s3 parquet io managers example above can be improved using s3fs so it supports predicate pushdown and column pruning.

jamie

07/17/2023, 6:51 PM

hey @John Smith there’s no particular reason. I believe we haven’t written a s3 parquet IO manager since there haven’t been user requests for it. We also don’t have a specific s3 duckdb io manager written either fwiw (we do have just a plain duckdb io manager). I think there was a custom one written for an example. In general you should use the set of tools that you determine are best for your project! Based on the example you linked you should be able to write your own s3 IO manager that meets your requirements!

John Smith

07/17/2023, 7:05 PM

Thanks, https://github.com/dagster-io/dagster/blob/master/examples/project_fully_featured/project_fully_featured/resources/parquet_io_manager.py as per above is written and owned by dagster right? It's just not been given the same "official" status as the duckdb ones Furthermore I appreciate the duckdb io manager isn't specifically for S3 but duckdb supports S3 (almost) out of the box hence there isn't a need for S3 specific one.

jamie

07/18/2023, 2:45 PM

yes the s3 parquet io manager in the example is owned and maintained by dagster, but since it’s written as part of an example project, it isn’t part of the released

dagster

or `dagster-aws`libraries. it’s intended more as an example of how to write your own custom IO manager. re duckdb - that’s right. I think we had a blog post a couple months ago that talked about duckdb with s3, so i just wasn’t sure if that’s what you were referring to

John Smith

07/25/2023, 8:16 PM

Thanks @jamie could you forward me the link to the post about duckdb on s3? upon closer inspection, the duckdb io managers can only write duckdb tables instead of parquet. from what I've seen e.g. motherDuck, the convention is to store files on S3 as parquet instead of the duckdb parquet format. presumably because one would have to load the entire file back into at least local storage before being able to query the duckDB. Does that sound right?

jamie

07/25/2023, 10:23 PM

https://dagster.io/blog/duckdb-data-lake and the follow up https://dagster.io/blog/poor-mans-datalake-motherduck

John Smith

07/26/2023, 12:10 AM

as i suspected both solutions eventually write the data as parquet files in S3 as opposed to duckdb files. is there a reason dagster don't officially maintain a

duckdb-s3-parquet-io-manager

along with the other

duckdb-io-managers

like the one @sandy & Pete shared in the articles above?

sandy

07/26/2023, 2:23 PM

the only reason is that we haven't gotten around to prioritizing it, not that we think it's a bad pattern

Open in Slack

Previous Next