Had a job fail overnight and Dagster says this ```datagun et dagster #deployment-kubernetes

Had a job fail overnight, and Dagster says this: `...

Charlie Bini

08/18/2022, 1:44 PM

Had a job fail overnight, and Dagster says this:

Copy code

datagun_etl_gsheets (fac0512e-a510-4ab9-87f7-528dd3b81fa8) started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.

Pretty sure I traced the event that lead to this in the GKE logs:

Copy code

Scale-down: removing node gk3-dagster-cloud-nap-1jnp3gdh-282a8794-g2v8, utilization: {0.7951653944020356 0.2775837641283469 0 cpu 0.7951653944020356}, pods to reschedule: dagster-cloud/dagster-run-fac0512e-a510-4ab9-87f7-528dd3b81fa8-x27xz

I'm running GKE Autopilot with the multiprocess executor. Not sure why it decided to scale down the node in the middle of a run, but any idea? Is there a good way to handle this? I have retries enabled, but that errored for another reason I'll post in the thread

Charlie Bini

08/18/2022, 1:46 PM

The retry failed with

Copy code

Execution of run for "datagun_etl_gsheets" failed. Pipeline failure during initialization for pipeline "datagun_etl_gsheets". This may be due to a failure in initializing the executor or one of the loggers.
dagster._core.errors.DagsterInvariantViolationError: Cannot perform reexecution with in-memory io managers.
To enable reexecution, you can set a persistent io manager, such as the fs_io_manager, in the resource_defs argument on your job: resource_defs={"io_manager": fs_io_manager}

Stack Trace:
  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/context_creation_pipeline.py", line 342, in orchestration_context_event_generator
    _validate_plan_with_context(execution_context, execution_plan)
,  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/context_creation_pipeline.py", line 411, in _validate_plan_with_context
    validate_reexecution_memoization(pipeline_context, execution_plan)
,  File "/root/app/__pypackages__/3.10/lib/dagster/_core/execution/memoization.py", line 24, in validate_reexecution_memoization
    raise DagsterInvariantViolationError(

however, I'm already using the

gcs_io_manager

for that job, so not sure why it thinks I'm using an in-memory one

Charlie Bini

08/18/2022, 1:59 PM

ok found a little more info. the run pod was successfully transferred to a new node, but Dagster didn't like it restarting

Copy code

Successfully assigned dagster-cloud/dagster-run-fac0512e-a510-4ab9-87f7-528dd3b81fa8-nf9wj to gk3-dagster-cloud-nap-ydkcnud3-76ec00cf-pu8c

Charlie Bini

08/18/2022, 2:00 PM

this happened 9 seconds before the inital log I posted (tagged with the same run id)

Copy code

Started container dagster

alex

08/18/2022, 2:10 PM

cc @chris

validate_reexecution_memoization

issue

alex

08/18/2022, 2:19 PM

Not sure why it decided to scale down the node in the middle of a run, but any idea?

there must be autoscaling enabled, i believe there are tags/labels you can set to communicate to the auto scaler that certain pods are undesirable to reschedule to reduce how often this happens

Is there a good way to handle this?

Retries are the current best way to handle this. If all of your computations are safe to retry, you can configure automatic run retries https://docs.dagster.io/deployment/run-retries#run-retries

Charlie Bini

08/18/2022, 3:25 PM

thanks @alex I'll look into those tags. is the retry failure related to a known issue?

alex

08/18/2022, 3:29 PM

thats new to me - i am guessing an relatively easily solved bug

Charlie Bini

08/18/2022, 3:30 PM

ok cool, let me know if you need more information

Charlie Bini

08/18/2022, 3:32 PM

here's the job in question, you can see it's passing the

gcs_pickle_io_manager

https://github.com/TEAMSchools/teamster/blob/ea721217c9560aaf3a248a0a20bb615a42c53f04/teamster/local/jobs/datagun.py#L33

3 Views

Open in Slack

Previous Next