Hello, is there reason <s3 parquet io managers> ar...
# ask-community
j
Hello, is there reason s3 parquet io managers are not part of the "official" dagster maintained io managers? e.g. you wish to steer users towards s3 duckdb instead? I'm open to being "nudged" in the right direction ;) Aside: I believe the s3 parquet io managers example above can be improved using s3fs so it supports predicate pushdown and column pruning.
j
hey @John Smith there’s no particular reason. I believe we haven’t written a s3 parquet IO manager since there haven’t been user requests for it. We also don’t have a specific s3 duckdb io manager written either fwiw (we do have just a plain duckdb io manager). I think there was a custom one written for an example. In general you should use the set of tools that you determine are best for your project! Based on the example you linked you should be able to write your own s3 IO manager that meets your requirements!
j
Thanks, https://github.com/dagster-io/dagster/blob/master/examples/project_fully_featured/project_fully_featured/resources/parquet_io_manager.py as per above is written and owned by dagster right? It's just not been given the same "official" status as the duckdb ones Furthermore I appreciate the duckdb io manager isn't specifically for S3 but duckdb supports S3 (almost) out of the box hence there isn't a need for S3 specific one.
j
yes the s3 parquet io manager in the example is owned and maintained by dagster, but since it’s written as part of an example project, it isn’t part of the released
dagster
or `dagster-aws`libraries. it’s intended more as an example of how to write your own custom IO manager. re duckdb - that’s right. I think we had a blog post a couple months ago that talked about duckdb with s3, so i just wasn’t sure if that’s what you were referring to
j
Thanks @jamie could you forward me the link to the post about duckdb on s3? upon closer inspection, the duckdb io managers can only write duckdb tables instead of parquet. from what I've seen e.g. motherDuck, the convention is to store files on S3 as parquet instead of the duckdb parquet format. presumably because one would have to load the entire file back into at least local storage before being able to query the duckDB. Does that sound right?
j
as i suspected both solutions eventually write the data as parquet files in S3 as opposed to duckdb files. is there a reason dagster don't officially maintain a
duckdb-s3-parquet-io-manager
along with the other
duckdb-io-managers
like the one @sandy & Pete shared in the articles above?
s
the only reason is that we haven't gotten around to prioritizing it, not that we think it's a bad pattern