https://dagster.io/ logo
Title
d

Daniel Michaelis

01/14/2022, 12:31 PM
Hi, I'm experiencing a bug when trying to write an
ALS
model from
pyspark.ml.recommendation
to S3 and reading it back in if this takes place within a dynamically executed graph (i.e. via dynamic mapping). I wrote a custom IO manager using the pattern
f's3a://{self.s3_bucket}/{key}'
as
_uri_for_key
similar to the one currently implemented in the
PickledObjectS3IOManager
. As the step identifiers for the dynamically generated steps contain square brackets [ and ] these are included in the S3 uri when an object is written. Even though I can clearly see the model was saved to this path in S3, I'm getting an error when the downstream op tries to load the model, something like:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3a://....fit_model[...]/model/metadata matches 0 files
When I replace/remove the square brackets from
_uri_for_key
this works fine:
f's3a://{self.s3_bucket}/{key}'.replace('[', '_').replace(']', '')
It seems that technically
_uri_for_key
is only used for debug logs in the
PickledObjectS3IOManager
and the writing/reading occurs via
upload_fileobj
and
pickle.loads
without actually using this key. I can imagine that the same error could occur with the
PickledObjectS3IOManager
and thought I'd point it out here, in case this hasn't been tested yet. Moreover, square brackets are among the characters to avoid in S3 object keys according to https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html On the other hand, maybe the error is specific to my particular use case if: 1. the resulting uri from
upload_fileobj
is actually different from the one generated when using
upload_fileobj
, which would mean that the debug log message should be corrected 2. maybe the error only occurs when trying to read a folder which is the case for the
ALS
model (see error message) but not for pickled objects 3. maybe the error only occurs within Py4J
s

sandy

01/18/2022, 4:10 PM
Thanks for reporting this @Daniel Michaelis. Based on what you described here, I don't think this error is specific to your particular use case. I filed a github issue to track it: https://github.com/dagster-io/dagster/issues/6238.
d

Daniel Michaelis

01/18/2022, 4:16 PM
Yes, I also thought that it's probably not specific for my case. Thanks for filing the issue 👍
Of course in point 1 I wanted to write that the resulting uri from 
upload_fileobj
 could be different from 
_uri_for_key
, which I also assume not to be the case.