Dagster team, happy Friday! I’ve run into interes...
# ask-community
w
Dagster team, happy Friday! I’ve run into interesting behavior: in our K8s (with K8sRunLauncher) deployment, when I execute a pipeline using any executor (in process, multiprocess, k8s step), our code’s boto client fails to retrieve S3 credentials from the EC2 metadata server. When running the script locally in the very same run container that Dagster spins up, however, it just works! For context, we are authenticating with S3 via
kiam
(the pods are annotated with the proper IAM role). Is there something going on with the way Dagster executes runs that is causing this? Hope this is interesting-- thanks.
Copy code
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from iam-role: Credential refresh failed, response did not contain: access_key, secret_key, token, expiry_time
Strangely, it does work when we instantiate our boto client via a boto session, rather than directly.
🤔 1
I’ve learned too that the boto client is not thread-safe unless it’s instantiated from a boto session.
m
are you seeing an error from
botocore
w
Yea exactly, see above.
m
(i don't have a theory for why the handling would be different in a boto session but i can imagine it might be)
w
Yes I have seen that thank you @max. We already have the expiration set to 60m.
1
Re-iterating: boto sessions are thread-safe, directly instantiated clients are not. Maybe something to do with that…
m
@johann wonder if this rings any bells
@William Reed is your solid code multi-threaded?
w
It is multi-processed.
m
are you attempting to share the boto client or session across the processes, or are you instantiating it in each process?
w
Yes, definitely shares the client. When the client comes from a session (
session.client()
), it works. Otherwise, not.
I am 85% sure of this, my co-worker wrote the code.. confirming..
m
ok. the boto3 docs suggest that this is unlikely to work in general
feels like if you can instantiate the clients per-process, you probably should
w
Wait a second, this says the opposite of what I was under the impression of.
It’s saying clients are thread-safe, not sessions! 🤔
Strange…
m
yeah, it may be that there is a race here and that you're changing the timing by creating the clients from sessions
and that would explain why you don't encounter it when just instantiating the client from within the pod
w
Hmmm, interesting. And why it works when executed directly from CLI vs. via Dagster?
m
exactly
w
Shoot @max sorry, we DO NOT share client objects. 😂
1
j
Coming in to this thread, just want to check which variables we’ve isolated: • multiprocess works outside of dagster k8s • multiprocess doesn’t work with dagster k8s • is it possible to run a single process version?
w
Yes this is correct @johann. It is possible to run a single process, just need to edit a lot of code. 😂
This script was already written before we deployed Dagster, so hopefully in the future we can write the pipeline better so it’s easily parallelized with Dagster “native” abstractions.
m
have we confirmed that it's multiprocessing that causes the issue
w
We will have to do more investigation, probably next week. I’ll keep you updated @max!
👍 1
@max, @johann an update! I double-checked how we use boto clients. I confirmed we only instantiate one per process, and do not share them between processes. We do not do any multithreading, just multiprocessing. I increased the values of boto’s retry attempts to 60 (!) and timeout to 360s (6m) for fetching credentials from the EC2 metadata service (using
AWS_METADATA_SERVICE_NUM_ATTEMPTS
and
AWS_METADATA_SERVICE_TIMEOUT
, respectively). I also switched the production run configuration to use the
multiprocess
executor instead of the
in_process
one. These things all combined contributed to a working flow that we are happy with. In the end, my hunch (yes, hunch) is that there’s something happening at the Dagster layer (read: the orchestrator layer) that crippled boto. Not sure what, but maybe it’s worth looking at the source code differences between the executors. I expected that running a script that multiprocesses from either executor would work the same, since (to my knowledge) the
multiprocess
executor is only parallelizing inter-solid execution, not _intra-_solid computations. Anyways, just following up here in case you guys are curious. I will try to take a look at the code eventually in my spare time, but I hope that helps in some way. It very well may be that I don’t understand Dagster well enough, or our underlying script isn’t well architected! Fin.