Hi team, we are currently using a custom implement...
# ask-community
a
Hi team, we are currently using a custom implementation of repository and periodically load the job definitions from an external API. We also call reload_repository() every time we load the definitions and we are running multiple dagit instances. However we are often seeing
Job not found: Could not find pipeline ...
error in Dagit UI while opening the LaunchPad for that job. Any thoughts on why this could happen?
🆙 3
h
Related question..When we make a graphQL request reloadRepositoryLocation(), is this is only sent to one Dagit instance and would only restart one node instead of all dagit instances right?
a
We believe its due to the caching of the repository definitions in the dagit servers. Could someone please confirm? Is there any way we can reduce the TTL or just invalidate the cache for all the dagit instances at once
y
Hi! I’m checking in with the team to look into this.
🙏 1
a
yes this seems likely the result of having dagit replicas with their own in memory copy of the workspace What in the system is sending the
reloadRepositoryLocation
request? If its done from within the cluster you could maybe use
kubectl
to get the ips of all replicas and broadcast the request (or something similar).
Is there any way we can reduce the TTL or just invalidate the cache for all the dagit instances at once
Theres nothing currently built in to the process to facilitate this. Two mechanisms for reloading are restarting the process or the graphql call you are making. You could emulate TTL by ensuring the pods restart at some frequency using something like
pod-reaper
. I think figuring out how to broadcast the request is probably preferable.
However we are often seeing Job not found: Could not find pipeline ... error in Dagit UI while opening the LaunchPad for that job.
Just to confirm, these failures come when adding new jobs? Or for existing jobs?
h
Thank Alex! These failures come when adding new jobs.
For my code server, we have a sensor job to do
reloadRepositoryLocation
using DagsterGraphQLClient every 30m. We probably won’t be able to get the ips of replicas from within the job.
a
@alex Curious how other's are handling this issue? Any plans to improve this in the upcoming releases so that we can consider upgrading instead of creating the fix by ourselves? We autogenerate jobs in most of our repositories and this is creating multiple issues preventing our users from triggering jobs using Dagit
a
Unfortunately I am not aware of any other users in this specific situation or any work currently planned that would address it.