In case it’s helpful to someone, here’s a quick-and-dirty bash script I created to let me selectively clean the outputs from successful jobs. Let me know if there’s a better approach! When to use: • If you’re working with large, uncompressed data in a dev or local environment and your disk space just isn’t big enough • You want to keep the logs from the runs, just not the IO from each op • You accept that a re-run of the job won’t be able to resume from an intermediate step When not to use: • If you’re uncomfortable with a random guy’s bash script running a recursive find and remove command 🙂 • If you’re in production or otherwise doing anything critically important • Please make sure nothing else lives in
! If you’re unsure, don’t run it How to use it: • Add to your
, or other appropriate file • reload your terminal or
the file above •
dagprune name_of_your_op
How it works: The script checks for the existence of a
file exists in the logs and if a corresponding IO directory also exists, then removes everything in the run’s directory except for the
directory. Technically, you could specify any op, but the intended use is for cleaning up outputs from successfully-completed runs (we leave failures alone, since we might actually want to re-run them from their point of failure)
dagprune() {
	if [ -z "$DAGSTER_HOME" ]; then
		echo "Make sure to set env var DAGSTER_HOME before running this command"
		exit 1
	for rundir in $DAGSTER_HOME/storage/*/; do
		if [ -f "$FILE" ] && [ -d "$FINAL_OP_DIR" ]; then
			echo "Completed run found. Cleaning op outputs in $rundir"
			pushd $rundir > /dev/null
			find -type d -not \( -name 'compute_logs' -or -name '.' \) -exec rm -rf "{}" \;
			popd > /dev/null
