There is something that's not entirely clear to me...
# announcements
f
There is something that's not entirely clear to me, is there any situation where
dagster-graphql
will run a pipeline directly or does it only communicate that to a
dagit
instance? Also, is it possible to run
dagster
instead of
dagit
to keep a server running or is it only
dagit
that can do it? I find the documentation great, but the architecture isn't completely clear to me by reading it.
p
Hi Fran! dagit is the webapp that serves the UI and responds to graphql requests. It uses the graphql schema defined by the dagster-graphql package to accept and respond to these requests. It in turn uses APIs from the dagster package to service the requests (e.g. query storage, execute runs, etc). We also have the CLI tool
dagster
to execute dagster commands in the current process. Hope this helps!
f
Hi Phil, thank you for your answer. I have never used graphql before so that's maybe why I don't completely understand what exactly
dagster-graphql
really is. A few more questions: • Can I use
dagster-graphql
as a CLI tool to trigger a pipeline in a
dagit
server? • When
dagit
runs a pipeline in a separate process, does it run another
dagit
process or a
dagster
process? Thank you!
p
Great questions… • Both the
dagster
and the
dagster-graphql
CLI tools will allow you to trigger pipeline execution.
dagster
will use the python API to trigger execution, and
dagster-graphql
will take graphql queries to trigger execution. GraphQL is a different API, but wraps the same behavior. They only difference is in the way you interact with them. Both of these are independent of a running dagit instance, which is a web interface. • When you run a
dagster
CLI command, it runs the corresponding execution in the same process. • For pipelines initiated through both the
dagit
UI, as well as the
dagster-graphql
CLI command, the execution will occur in a separate process. • Re:
can I use dagster-graphql as a CLI tool to trigger a pipeline in a dagit server
, can you say more about what you’re looking for?
👍 1
f
I'm mostly trying to understand the architecture and the options and figure out how I can fit my use case with the available options, but let me describe my use case. We have an in-house built ETL framework. Pretty much every single concept in dagster exists in our framework (i.e. types, solids, hydration, materializations, data dependencies, config, etc.). The issue is that our framework allow users to define a tasks in a slightly different way: • We allow to dynamically define the number of inputs at pipeline building time. For that we pass some
Metadata
to the task so it can decide what it wants as inputs. • With a task definition users define both, inputs and data dependencies by reference (I want to consume this column from the file that's being analysed or I want to consume this output from this other task, our framework is opinionated with regard to the type of data being analysed). So far I have created a solid factory that takes one of our
Transform
tasks and returns a valid
solid
with all its inputs/outputs defined (I provide the
Metadata
for the current job) and I can also dynamically generate the dagster pipeline using the data dependencies defined by our
Transform
using the
SolidInvocaton
,
DependencyDefinition
and
PipelineDefinition
classes. I have also other stuff to generate
DagsterTypes
out of our
ArtifactTypeHandler
or create
Resource
out of our
Extract
plugins. All of this is being done at pipeline building time, so when I run
dagit repo.py
I can see my pipeline and execute it from the UI. Now, the way we run pipelines is by means of an external trigger (currently an HTTP request with some
Metadata
), so we need to have a service running or instruct our trigger system to run a
dagster
process if we need to. Also, every time this will build a new and different pipeline as the
Metadata
will change it. What I would like to have is a
dagit
server running so users can inspect the status of running pipelines or pipelines that already run in the past. Also, I would like to run each pipeline in a different kubernetes job/pod. Can I deploy dagit configured to use postgresql and then use dagster everytime I want to run a pipeline to run in kubernetes with the same postgresql configuration? Will
dagit
be able to show the information? Maybe I should use
dagster-graphql
in my trigger system and have something different in dagit that build and launches a new pipeline every time?
p
You are exactly right. Each dagit instance is running off an instance configuration that configures things like
run_storage
and
event_log_storage
. You can use the same instance configuration for running
dagster
or
dagster-graphql
CLI commands and the runs should be visible in
dagit
.
You might also want to check out our
dagster-k8s
library integration. There’s some tooling there to launch runs as kubernetes jobs.
More information on setting that up is here: https://docs.dagster.io/docs/deploying/k8s