Hi team we are currently using a custom implementation of re dagster #ask-community

Hi team, we are currently using a custom implement...

Arun Kumar

08/09/2023, 7:08 PM

Hi team, we are currently using a custom implementation of repository and periodically load the job definitions from an external API. We also call reload_repository() every time we load the definitions and we are running multiple dagit instances. However we are often seeing

Job not found: Could not find pipeline ...

error in Dagit UI while opening the LaunchPad for that job. Any thoughts on why this could happen?

🆙 3

Hebo Yang

08/09/2023, 7:13 PM

Related question..When we make a graphQL request reloadRepositoryLocation(), is this is only sent to one Dagit instance and would only restart one node instead of all dagit instances right?

Arun Kumar

08/09/2023, 9:08 PM

We believe its due to the caching of the repository definitions in the dagit servers. Could someone please confirm? Is there any way we can reduce the TTL or just invalidate the cache for all the dagit instances at once

yuhan

08/09/2023, 10:44 PM

Hi! I’m checking in with the team to look into this.

🙏 1

alex

08/10/2023, 2:36 PM

yes this seems likely the result of having dagit replicas with their own in memory copy of the workspace What in the system is sending the

reloadRepositoryLocation

request? If its done from within the cluster you could maybe use

kubectl

to get the ips of all replicas and broadcast the request (or something similar).

Is there any way we can reduce the TTL or just invalidate the cache for all the dagit instances at once

Theres nothing currently built in to the process to facilitate this. Two mechanisms for reloading are restarting the process or the graphql call you are making. You could emulate TTL by ensuring the pods restart at some frequency using something like

pod-reaper

. I think figuring out how to broadcast the request is probably preferable.

alex

08/10/2023, 2:50 PM

However we are often seeing Job not found: Could not find pipeline ... error in Dagit UI while opening the LaunchPad for that job.

Just to confirm, these failures come when adding new jobs? Or for existing jobs?

Hebo Yang

08/10/2023, 3:55 PM

Thank Alex! These failures come when adding new jobs.

Hebo Yang

08/10/2023, 3:58 PM

For my code server, we have a sensor job to do

reloadRepositoryLocation

using DagsterGraphQLClient every 30m. We probably won’t be able to get the ips of replicas from within the job.

Arun Kumar

08/18/2023, 8:54 PM

@alex Curious how other's are handling this issue? Any plans to improve this in the upcoming releases so that we can consider upgrading instead of creating the fix by ourselves? We autogenerate jobs in most of our repositories and this is creating multiple issues preventing our users from triggering jobs using Dagit

alex

08/18/2023, 9:15 PM

Unfortunately I am not aware of any other users in this specific situation or any work currently planned that would address it.

Open in Slack

Previous Next