It would be nice to have a way to allow some steps to fail without making the full job fail. Within our application code, we can do this with regular Python exception handling. But occasionally there are kubernetes / AWS API errors, and usually what we want is for the job to process the rest of what it can; that's the case for ops in our DAG that don't have any output, so the downstream steps can still attempt to run. An example we're dealing with currently is k8s pod OOMKilled errors for a couple steps in a fan-out / dynamic output stage with a couple thousand steps. We'd like to know about them, but retrying doesn't make sense, and we still want all the other steps from the fan-out to run.