In case it’s helpful to someone, here’s a quick-an...
# community-showcase
a
In case it’s helpful to someone, here’s a quick-and-dirty bash script I created to let me selectively clean the outputs from successful jobs. Let me know if there’s a better approach! When to use: • If you’re working with large, uncompressed data in a dev or local environment and your disk space just isn’t big enough • You want to keep the logs from the runs, just not the IO from each op • You accept that a re-run of the job won’t be able to resume from an intermediate step When not to use: • If you’re uncomfortable with a random guy’s bash script running a recursive find and remove command 🙂 • If you’re in production or otherwise doing anything critically important • Please make sure nothing else lives in
$DAGSTER_HOME
! If you’re unsure, don’t run it How to use it: • Add to your
.bashrc
,
.bash_profile
, or other appropriate file • reload your terminal or
source
the file above •
dagprune name_of_your_op
How it works: The script checks for the existence of a
name_of_your_op.complete
file exists in the logs and if a corresponding IO directory also exists, then removes everything in the run’s directory except for the
compute_logs
directory. Technically, you could specify any op, but the intended use is for cleaning up outputs from successfully-completed runs (we leave failures alone, since we might actually want to re-run them from their point of failure)
Copy code
dagprune() {
	if [ -z "$DAGSTER_HOME" ]; then
		echo "Make sure to set env var DAGSTER_HOME before running this command"
		exit 1
	fi
	FINAL_OP="$1"
	for rundir in $DAGSTER_HOME/storage/*/; do
		FILE="$rundir/compute_logs/$FINAL_OP.complete"
		FINAL_OP_DIR="$rundir/$FINAL_OP"
		if [ -f "$FILE" ] && [ -d "$FINAL_OP_DIR" ]; then
			echo "Completed run found. Cleaning op outputs in $rundir"
			pushd $rundir > /dev/null
			find -type d -not \( -name 'compute_logs' -or -name '.' \) -exec rm -rf "{}" \;
			popd > /dev/null
		fi
	done
}
🎉 4