https://dagster.io/ logo
#announcements
Title
# announcements
a

Andy H

11/05/2020, 12:17 AM
Is there a comprehensive list of available intermediate storage definitions? I am building a distributed pipeline and the s3 intermediate from
dagster_aws
won't be a valid option in production.
y

yuhan

11/05/2020, 12:36 AM
Hi Andy, what is your use case?
a

Andy H

11/05/2020, 12:37 AM
We are pushing pipelines through dask as an intermediate for a slurm cluster. In order to push using distributed, I have to supply a ModeDefinition which provides an intermediate storage definition, and it complains and fails if I use the filesystem intermediate. I am OK to use s3 in development, but that won't be an option when we hit production.
y

yuhan

11/05/2020, 12:42 AM
what would the production be using?
a

Andy H

11/05/2020, 12:42 AM
That's what we're not sure of yet. I was hoping to find a list of intermediates documentation so that we could make a decision about how we would proceed there.
m

matas

11/05/2020, 6:09 AM
Hey Andy! Just curious: why doesn’t s3 fits your production needs? And how about a self-hosted s3?
a

Andy H

11/05/2020, 3:44 PM
Thanks @yuhan, much appreciated.
@matas Self-hosted s3 might work, I wasn't aware of such a thing. If we can build an on-prem s3 service that might work.
m

matas

11/05/2020, 3:50 PM
We used minio (https://github.com/minio/minio) and zenko (https://github.com/scality/cloudserver) - both compatible with dagster_aws, self-deployed containerized solutions. Though minio is nicer with its gui, it is 50x slower with dagster_aws compute_logs (https://github.com/dagster-io/dagster/issues/2438) due to some strange boto3 behaviour. So we’ve switched to zenko for now
you can look for a deployment inspiration in our boilerplate repo https://github.com/bestplace/cube. It is quite outdated for now, but still valid about s3 connections
a

Andy H

11/05/2020, 4:00 PM
Awesome, thanks @matas -- I'll check this out
🔥 1