Hey! I wonder if I can combine/nest jobs running o...
# ask-community
a
Hey! I wonder if I can combine/nest jobs running on multiple pieces of hardware, in multiple parts? What I’ve got working right now is: 1. Job
A
starts on our Dagster cluster, 2. Job
A
creates an external environment, 3. Job
A
launches execution of job
B
in the external environment. My question is on whether if there’s a recommended way to tie
A
and
B
together, since functionally
A
job actually is complete only when job
B
is done, because
A
would in ideal world destroy the external environment upon completion.
m
Hi Arturs -- I think there are a bunch of ways you could accomplish this -- does job
A
really need to be a separate job, or could you write a job factory that embedded the graph for job
B
into the graph for job
A
? So that the setup ops from
A
ran ahead of the business logic for
B
and then the teardown ops ran after it was complete?
a
@max
B
is a reasonably complex Dagster job (or graph I guess), that dynamically assembles itself into anywhere between hundreds and thousands ops, depending on execution time circumstances. For that reason, I would prefer to be able to actually review individual steps in Dagit if necessary. I’m not sure what would be a job factory in this case, but basically what I want to do is tie together the construction-deconstruction job
A
with the business logic job
B
so that the entire process can be visually reviewed in a single flamegraph.
m
is the reason that they aren't a single job that job
A
is re-used elsewhere?
a
If that’s impossible, it’s not a blocking problem - would just make for a really annoying Dagit user experience, where a reviewer will open failed
A
job
a1b2c3
and will then need to figure out that it is connected to the failed
B
run
x4y5z6
.
m
i guess i'm just asking why the jobs are separate? if you want elements of job
A
to run after job
B
is complete
a
They’re not a single job because environment for
B
is ephemeral, and
B
must be executed inside a specific environment that is at all times external to both Dagit and Dagster Daemon.
B
happens on a Spark cluster.
m
so the ops in job
B
instigate compute on the Spark cluster?
a
The job of
A
is to survey environment, determine spec of Spark cluster necessary, provision the cluster, submit the job to the cluster, and close the cluster once the job has completed.
B
is a PySpark project, yes.
m
it seems to me like logically this is a single job (unless the right way to think of
A
is as a context manager/environment provisioner that gets reused for jobs
B
,
C
, ...)
if that's the case, i'm having.a hard time understanding what technical limitations require it to be split into two jobs -- there's no reason that ops can't launch Spark jobs on clusters created by upstream ops (and then cleaned up by downstream ops)
a
That’s technologically impossible, in this situation there’s a hard requirement that I run the equivalent of
spark-submit b.py
locally from the Spark cluster’s perspective.
My question is basically if I can connect a graph running on a 3rd party environment to an ongoing job and insert it between the op that caused it to be executed and the op that is scheduled next.
m
you can certainly have an op which kicks off the dependent job
that op can emit structured metadata which can provide links to the dependent job
depending on exactly what you're doing, you may also be able to use functionality like asset sensors to introduce loose coupling between your various jobs
you might also be able to use the resource system to provision your ephemeral cluster (there are models in the codebase of how to do this with EMR)
maybe i should have asked this earlier -- is
B
a Dagster job or a Spark job?
a
It’s both. It is a Dagster job that dynamically generates a set of Spark operations at runtime. It’s several nested layers of dynamic mapping ops.
Is there a documentation page describing dependent jobs functionality? Changing our provisioning logic for the sake of this is not a feasible option, I think.
m
all i mean is that an op can call the python or graphql apis to start another job
but again i'm having trouble understanding why that's necessary in your case -- an op that generates a set of Spark operations feels like it should be able to run within any Dagster graph -- so hard to see why the jobs need to be separate
a
Because it cannot execute on the same hardware, that’s technologically impossible in our hardware configuration.
m
gotcha, so you can't remotely execute your spark-submit call
a
I can and I do, that’s the problem.
When you perform a spark-submit call, your Spark cluster creates a new set of Python processes, marshalled by it and local to its workers.
So in that process I have a Dagster job executing, that feeds workloads to Spark.
m
so you are running your Dagster job (
B
) within a Spark job?
a
Due to the way our infrastructure is set up, it is not possible for me to feed individual workloads to our Spark cluster.
Yes, correct, I should’ve expressed myself clearer about that when you asked before.
m
i see
and is the primary concern to find a way to link the two jobs in the UI, or is it more around control flow -- like ensuring that job B has completed before running the cleanup step from job A
a
Just the user experience, I can handle the cleanup step via environmental configuration by setting up self-destruction triggered by time spent idle - so long as that it was actually visible to Dagit while executing, and so Dagit has the logs from the job if it failed.
Although I guess logs also are not that critical, I don’t log locally to cluster anyway. So yeah, primary concern is to have them linked in UI.
m
You can attach arbitrary tags to job runs -- so for instance if you wanted to tag job `A`'s run id on the run of job
B
a
That will do, cheers. Any documentation pointers if you have them handy?
m
@sandy