Is there a way to re execute a run from failure via the `run dagster #ask-community

Is there a way to re-execute a run from failure vi...

Charlie Bini

05/10/2023, 9:11 PM

Is there a way to re-execute a run from failure via the

run_failure_sensor

? I can identify the runs that I want to resume, but I'm not sure how to convert that into a

RunRequest

. Here's the code I have so far

Copy code

@run_failure_sensor
def run_execution_interrupted_sensor(context: RunFailureSensorContext):
    run_requests = []
    for event in context.get_step_failure_events():
        if event.event_specific_data.error_source == ErrorSource.INTERRUPT:
            ...
            run_requests.append(RunRequest(...))

    return SensorResult(run_requests=run_requests)

Charlie Bini

05/10/2023, 9:12 PM

trying to build a smarter way to combat k8s killing run pods. Usually, this happens when GKE changes the nodes under the hood

Jordan

05/10/2023, 9:34 PM

Hi! I personally use graphql to launch runs from

run_failure_sensor

sean

05/11/2023, 12:19 PM

Hi Charlie, I don’t think it’s possible to “directly” reexecute from a run failure sensor, but I’ve reached out to the team for confirmation. To elaborate on what Jordan is saying, you could use the GQL python client as a workaround: https://docs.dagster.io/concepts/dagit/graphql-client#graphql-python-client

Charlie Bini

05/11/2023, 2:46 PM

nice yeah I hadn't considered graphql

sean

05/11/2023, 3:04 PM

Hey Charlie, I’ve confirmed you can’t reexecute from failure via the sensor. However, you might be able to avoid the need for this with Run Retries

Charlie Bini

05/11/2023, 7:14 PM

@sean should the graphql client work for dagster cloud?

Charlie Bini

05/11/2023, 7:15 PM

I'm trying a simple

get_run_status

call and it's just hanging

Charlie Bini

05/11/2023, 7:43 PM

changed some stuff and I'm seeing

Copy code

gql.transport.exceptions.TransportServerError: 401 Client Error: Unauthorized for url: <https://kipptaf.dagster.cloud/graphql>

Charlie Bini

05/11/2023, 7:44 PM

here's my code

Copy code

@run_failure_sensor
def run_execution_interrupted_sensor(context: RunFailureSensorContext):
    client = DagsterGraphQLClient(hostname="kipptaf.dagster.cloud")

    for event in context.get_step_failure_events():
        run_status = client.get_run_status(event.logging_tags["run_id"])

        <http://context.log.info|context.log.info>(run_status)

Charlie Bini

05/11/2023, 7:46 PM

@Jordan if you have any snippets you can share, that'd be appreciated

Charlie Bini

05/11/2023, 7:58 PM

difference between the two is the port: • no port = 401 • port 3000 = hanging

Jordan

05/11/2023, 8:08 PM

I don't use the dagster cloud version so I can't really help you with that. My code looks like this:

Copy code

url = "<http://127.0.0.1:3000/graphql>"
payload = {"query": graphql_query, "variables": variables}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url=url, json=payload, headers=headers)

Daniel Gafni

06/05/2023, 8:06 PM

Hey @sean, so how to authorize the GraphQL client for Dagster Cloud? The docs don't seem to cover that.

sean

06/09/2023, 10:20 PM

Does this help? https://github.com/dagster-io/dagster/discussions/7772

Daniel Gafni

06/09/2023, 10:25 PM

Yes, thank you

2 Views

Open in Slack

Previous Next