Had a job fail overnight, and Dagster says this: `...
# deployment-kubernetes
c
Had a job fail overnight, and Dagster says this:
Copy code
datagun_etl_gsheets (fac0512e-a510-4ab9-87f7-528dd3b81fa8) started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
Pretty sure I traced the event that lead to this in the GKE logs:
Copy code
Scale-down: removing node gk3-dagster-cloud-nap-1jnp3gdh-282a8794-g2v8, utilization: {0.7951653944020356 0.2775837641283469 0 cpu 0.7951653944020356}, pods to reschedule: dagster-cloud/dagster-run-fac0512e-a510-4ab9-87f7-528dd3b81fa8-x27xz
I'm running GKE Autopilot with the multiprocess executor. Not sure why it decided to scale down the node in the middle of a run, but any idea? Is there a good way to handle this? I have retries enabled, but that errored for another reason I'll post in the thread
The retry failed with
Copy code
Execution of run for "datagun_etl_gsheets" failed. Pipeline failure during initialization for pipeline "datagun_etl_gsheets". This may be due to a failure in initializing the executor or one of the loggers.
dagster._core.errors.DagsterInvariantViolationError: Cannot perform reexecution with in-memory io managers.
To enable reexecution, you can set a persistent io manager, such as the fs_io_manager, in the resource_defs argument on your job: resource_defs={"io_manager": fs_io_manager}

Stack Trace:
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/context_creation_pipeline.py", line 342, in orchestration_context_event_generator
    _validate_plan_with_context(execution_context, execution_plan)
,  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/context_creation_pipeline.py", line 411, in _validate_plan_with_context
    validate_reexecution_memoization(pipeline_context, execution_plan)
,  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/memoization.py", line 24, in validate_reexecution_memoization
    raise DagsterInvariantViolationError(
however, I'm already using the
gcs_io_manager
for that job, so not sure why it thinks I'm using an in-memory one
ok found a little more info. the run pod was successfully transferred to a new node, but Dagster didn't like it restarting
Copy code
Successfully assigned dagster-cloud/dagster-run-fac0512e-a510-4ab9-87f7-528dd3b81fa8-nf9wj to gk3-dagster-cloud-nap-ydkcnud3-76ec00cf-pu8c
this happened 9 seconds before the inital log I posted (tagged with the same run id)
Copy code
Started container dagster
a
cc @chris
validate_reexecution_memoization
issue
Not sure why it decided to scale down the node in the middle of a run, but any idea?
there must be autoscaling enabled, i believe there are tags/labels you can set to communicate to the auto scaler that certain pods are undesirable to reschedule to reduce how often this happens
Is there a good way to handle this?
Retries are the current best way to handle this. If all of your computations are safe to retry, you can configure automatic run retries https://docs.dagster.io/deployment/run-retries#run-retries
c
thanks @alex I'll look into those tags. is the retry failure related to a known issue?
a
thats new to me - i am guessing an relatively easily solved bug
c
ok cool, let me know if you need more information