Daniel Michaelis
01/14/2022, 12:31 PMALS
model from pyspark.ml.recommendation
to S3 and reading it back in if this takes place within a dynamically executed graph (i.e. via dynamic mapping). I wrote a custom IO manager using the pattern f's3a://{self.s3_bucket}/{key}'
as _uri_for_key
similar to the one currently implemented in the PickledObjectS3IOManager
. As the step identifiers for the dynamically generated steps contain square brackets [ and ] these are included in the S3 uri when an object is written. Even though I can clearly see the model was saved to this path in S3, I'm getting an error when the downstream op tries to load the model, something like:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3a://....fit_model[...]/model/metadata matches 0 files
When I replace/remove the square brackets from _uri_for_key
this works fine:
f's3a://{self.s3_bucket}/{key}'.replace('[', '_').replace(']', '')
It seems that technically _uri_for_key
is only used for debug logs in the PickledObjectS3IOManager
and the writing/reading occurs via upload_fileobj
and pickle.loads
without actually using this key. I can imagine that the same error could occur with the PickledObjectS3IOManager
and thought I'd point it out here, in case this hasn't been tested yet. Moreover, square brackets are among the characters to avoid in S3 object keys according to https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
On the other hand, maybe the error is specific to my particular use case if:
1. the resulting uri from upload_fileobj
is actually different from the one generated when using upload_fileobj
, which would mean that the debug log message should be corrected
2. maybe the error only occurs when trying to read a folder which is the case for the ALS
model (see error message) but not for pickled objects
3. maybe the error only occurs within Py4Jsandy
01/18/2022, 4:10 PMDaniel Michaelis
01/18/2022, 4:16 PMupload_fileobj
could be different from _uri_for_key
, which I also assume not to be the case.