The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

It would be nice to have a way to allow some steps to fail without making the full job fail. Within our application code, we can do this with regular Python exception handling. But occasionally there are kubernetes / AWS API errors, and usually what we want is for the job to process the rest of what it can; that's the case for ops in our DAG that don't have any output, so the downstream steps can still attempt to run. An example we're dealing with currently is k8s pod OOMKilled errors for a couple steps in a fan-out / dynamic output stage with a couple thousand steps. We'd like to know about them, but retrying doesn't make sense, and we still want all the other steps from the fan-out to run.

This seems like an interesting idea - I guess essentially a type of dagster event that shows up as a failure, but not a blocking failure

I think you’ll always have to handle the underlying exception in user code, but maybe we make it easier to surface that something went wrong anyway?