Oh I think I see, there's a grpc subprocess starte...
# announcements
Oh I think I see, there's a grpc subprocess started automatically
yeah, they communicate back and forth over gRPC - dagit interacts with an ExternalRepository that's derived from the RepositoryDefinition
👍 So there's a gRPC repo data service running no matter what 🙂
Makes sense
yeah - the button in the UI should reboot that service and pick up any changes without needing to restart dagit (although it's not great about picking up brand new pipelines or repositories right now - I'd expect it to pick up code changes for sure though?)
Yeah, so the only problem would be putting a TTL on the in-process cache using inside Dagit and my launcher daemons
I think I'm not totally clear what the problem is that you're running into
I have a
dagster api grpc
daemon serving my workspace data. I update that with new pipeline code/definitions and restart that daemon. I then need to get that change visible and used everywhere
The two places that seem to have long lived caches are Dagit and my new launcher daemons.
I think I can fix the launcher daemons by fully seralizing the pipeline definition and sending that over to the launcher daemons rather than just a reference. So that would just leave Dagit.
Basically I want dagit and the launchers (and the solids) to be independent long lived daemons with the pipeline definitions managed somehow externally.
got it - so the button I mentioned earlier is not sufficient, you would like it to be more automatic
The easiest way to do that seems to be to pack each workspace into a gRPC daemon with a CI/CD workflow on it so when the pipeline code changes, CI will push a new container image for the gRPC daemon for that workspace
Right, this should be autonomous after someone makes a commit to the foo-pipelines git repo 🙂
That will build a foo-pipelines image that runs the gRPC workspace daemon, with dagit and everything else aimed at all of those
so I think this block of code in our k8s launcher does something similar to what you want: https://dagster.phacility.com/source/dagster/browse/master/python_modules/libraries/dagster-k8s/dagster_k8s/launcher.py$242-271
No, that does a weird hack that assumes that pipeline code is available to the k8s pod in the same place as it was to the grpc daemon.
Line 266 you can see it just lies and says "pretend it was a python origin"
That's the code I started from, but I think the better approach there is to serialize the entire pipeline definition and have that be the input to the launcher worker 🙂
That way the launcher doesn't need to think about the origin handles at all
well, the launcher worker, the actual launcher plugin still maybe does?
hmm, generally our run launchers don't have access to the actual pipeline definition (or run any user code - we intentionally keep user code out of the dagit process)
This is the equiv of
dagster api execute_run_with_structured_logs
but running as a celery task
That does appear to load the repo off disk
yeah, calling "dagster api execute_run_with_structured_logs" in a different process is totally fair game for a run launcher - maybe I misinterpreted what you meant by "serialize the entire pipeline definition"
sorry not sure I'm helping much at the moment
I'm trying to avoid launching extraneous subprocesses because they are hell on Kubernetes
So I wrote a custom launcher that uses Celery instead
running something in celery sounds fine too - you're good as long as its not running in the dagit host process
But it means its long running so the repo data gets cached when using
I think at least
Right now I'm using the same JSON as dagster-k8s, with an origin reference to the pipeline
I think it would be better to instead serialize the whole pipeline 🙂
That avoids my launcher workers ever having to load the repo data
when you say 'serialize the whole pipeline' what's the exact object you're referring to being serialized?
the PipelineDefinition?
the ExternalPipeline passed in to launch_run
Or at least that's the only handle to the pipeline I think I have at the launcher phase?
got it - I don't think ExternalPipeline by itself gives you enough information to do any actual execution unfortunately. It definitely doesn't give you access to the underlying code that makes up the solids in the pipeline, for example.
My solids will live off in their own universes anyway
So the launcher doesn't need those 🙂
(other than the definitions)
Those will be individual celery workers as well
"(other than the definitions)" - I think this may be the confusion - ExternalPipeline doesn't really have the definitions, it has a bunch of handles that reference the definitions somewhere else (in user code in a different process). It does have enough information to display lots of information about the pipeline in dagit, but I think you'd end up rewriting large chunks of our execution stack if you tried to just use an ExternalPipeline to execute a pipeline
Is there any way to load things for execution over gRPC or does that 100% expect a local PythonOrigin in all cases?
the default run launcher does execution over gRPC, that can be running in a server with code that dagit doesn't have access to
The GrpcRunLauncher that is
Sure but the gRPC daemon still needs local access itself
Trying to figure out a workflow for versioning and managing pipelines that will scale beyond one person, basically
And external_pipeline.external_pipeline_data.pipeline_snapshot seemed like it had enough data to evaluate a pipeline 🙂
on the versioning thing, you may want to talk to sandy on our team who's thinking through that class of problems right now (pipeline versioning) for our next release
👍 Roger
If I do have to have local code for both my launcher workers and the grpc daemons for dagit, that's also survivable 🙂
Just means building two images and running separate launcher workers for each repo, but that's totally doable
Probably not even two images, just two Deployments
That might be the path of least resistance for now, along with figuring out where exactly dagit caches its external handles 🙂
yeah, I think all of our launchers that use gRPC are currently assuming that the code pointers / origins that it spits out will also work in the environment that does execution - it's not out of the question for them to be able to spit out some kind of other origin instead though, if there's a use case for it
Fair, just with the pipeline being entirely declarative data I was hoping there was a good way to just turn it into JSON 🙂
Sounds like that might actually be the case but not from the data given in launch_run?
Or rather, not from the data exposed via gRPC 🙂
yeah we're getting a little out of my area of expertise there. certainly the pipeline is declarative and could boil down to a DAG, I just haven't seen anybody try to do execution without loading the PipelineDefinition from code
So it that is automatically reloads the location every X seconds
Still cached for moment to moment query use but set it to like 300 seconds and tell people to be very slightly patient after they push a code update 😄
(alternatively skip that and put an initContainer on the gRPC daemons that call the reload location mutation whenever the pod restarts
Thanks for back-and-forthing this with me ❤️ I think I see a path forward 🙂
ladydaga 1
@sashank is currently working on this dagit caching / not-knowing-the-grpc-server-changed-out problem
👍 1
pipeline_snapshot seemed like it had enough data to evaluate a pipeline
yea this is the direction we are heading in - but for now the pipeline execution paths all still operate on the definitions directly so need the local code