Hi Guys, I have just started my 1st pipeline on K8...
# announcements
i
Hi Guys, I have just started my 1st pipeline on K8s with celery executor. My question would be, how much CPU does Dagit need? I gave it 1000 milli core, but K8s throttling immediately start scheduling down. So i have started 5 Celery workera and my guess was Dagit only responsible to take pipeline executions, starts them, so put them into the celery broker queue (redis), and the rest, the calculations are disributed between celery workers. But it seems my celery workers are bored 🙂 and Dagit cannot have enough resource. Do you have CPU suggestion for Dagit? (this is in a stream processing environment, so we are hitting the GraphQL API)
Dagit CPU is under high pressure, but I dont understand why
MEM seems OK
and Celery workers are bored
why? 😉
if there is anyone who can give me some tip, what should I check, tune what would be appreciated! Thanks in advance
a
do you have a browser window open connected to
dagit
? Especially if you are watching an execution, we are subscribed to the events db to pick up whats happening
i
no
I am watching the logs with k8s logs
maybe the LOG backend?
it seems it is in DEBUG mode
a
what run launcher are you using? if default than the dagit server is also driving the pipeline execution in subprocesses
the steps are happening in celery workers, but there is a process that is submitting steps to celery and consuming the results
i
it seems the default
Copy code
# ==================================================================================================
    # Run Launcher
    # ==================================================================================================
    # Component that determines where runs are executed.
    run_launcher:
      module: dagster.core.launcher
      class: DefaultRunLauncher

    # ==================================================================================================
    # Run Storage
    # ==================================================================================================
    # Controls how the history of runs is persisted. Can be set to SqliteRunStorage (default) or
    # PostgresRunStorage.
    run_storage:
      module: dagster_postgres.run_storage
      class: PostgresRunStorage
      config:
        postgres_db:
          username:
            env: DAGSTER_PG_USERNAME
          password:
            env: DAGSTER_PG_PASSWORD
          hostname:
            env: DAGSTER_PG_HOST
          db_name:
            env: DAGSTER_PG_DB
          port: 5432
as far as I can remember, you already tell me this, I should use another run launcher....hm...@alex, can you please point me the documentation, or example code?
tomorrow, I am going to check that....
you could also just turn up the cpu requirements for dagit - whatever load its under shouldn’t really increase as your computations get more complex
i
thanks, tomorrow I am going to reread it, and try to figure out something, but I hope this will not start a container, because that would be a very big overhead.
a
it will - so might not be the path you are looking for
i
so the default run launcher is good than?
these are very small computes, aproximetly 1-2 seconds
is dagit stateless? can I scale it up and down?
a
not in this default configuration - the pipeline run processes are happening on that machine so if you scaled down those pipeline runs would fail
i
maybe the best if I drop celery, and start using only Dagit with inprocess execution and scale dagit up/down
ok, I see
a
scaling up should be fine
i
do you have some tear down signal?
if not, are you planing to implement something to help to achive gracefull shutdowns?
with health checks
a
we handle graceful shutdown on sigint, and have flags on some commands to remap the sigterm k8s sends for teardown to sigint
i
k8s can remove dagit pod from network scheduling, during the grace period, dagit can finish its qued jobs
cool
tomorrow I am going to test this a little bit..thanks for the tips
@alex SIGINT works, actually only stop accepting connections, but dagit does not terminate as you told. I think you should handle SIGTERM with Dagit and I think you should also introduce a proper health check endpoint with liveness (less interesting 😉 ) and readiness (this is the interesting part). So if Dagit gets a signal like SIGTERM, or just not feeling well 🙂 it should switch the readiness endpoint state. 🙂 and network scheduling can be disabled, in K8s environment this is automatic.
for now, I configured SIGINT and a grace time
a
cool, would you be willing to file an issue?
i
@alex ofc, I am still testing....after I am going to do it
a
thanks much appreciated, very useful to have actual user context in issues when possible
i
okokm, i am goin to upload everything...I figured auth a health check script until u can introduce a proper endpoint
like this one, ofc, the grep tailored for me 😉