Hi everyone I was wondering if anyone has any ideas on why I dagster #ask-community

Hi everyone, I was wondering if anyone has any ide...

Alex Smith

03/04/2024, 4:52 PM

Hi everyone, I was wondering if anyone has any ideas on why I'm experiencing the following strange behaviour. I have a k8s one-pod-per-op job that runs a graph of seven ops in series. On the second op I generate dynamic partition outputs. My output looks like this:

Copy code

for partition in data_partitions:
      partition_key = ....
      yield DynamicOutput(value={'partition': partition}, mapping_key=partition_key)

These partitions are run on ops three to six (through mapping) before they are collected up and run through the final op. I am using the gcs IO pickle manager and dagster 1.6.7. The pods are run on spot instances on a GKE autopilot cluster. The job runs as expected on test datasets with a handful of partitions. On larger ones however, 50+ partitions, I find that ~80% of partitions complete the partitioned phase, some will fail due to pre-emption/node autoscaling and the rest will complete the first or second op in the partition phase and then won't continue - the UI doesn't display the downstream op. You can see this in the top partition in the screen shot provided. When I then restart from failure, only the failed partitions complete the rest of their ops. The op following partition collection can still run despite not all the partitions completing, but I get this log for each one missed : _No previously stored outputs found for source StepOutputHandle(step_key='partition_name...', output_name='result', mapping_key=None). This is either because you are using an IO Manager that does not depend on run ID, or because all the previous runs have skipped the output in conditional execution._ Big thanks in advance for any help/suggestions!dagster bot responded by community

Open in Slack

Previous Next