hello, i just submitted <https://github.com/dagste...
# ask-community
f
hello, i just submitted https://github.com/dagster-io/dagster/issues/8314 : dagster k8s executor passes "known_state" to ops on command line. this fails for large DAGs, because the known_state object it too big to fit on command line. if there would be a workaround for this, i'd be interested since i'm a bit blocked now
d
Hey frank - do you have the exact character count for the args handy?
(maybe by describing the pod)
f
i can get it from the yaml yes
d
was just looking around for workarounds - this post suggests that increasing the stack size will also increase the character limit but i'm not sure if that's possible on k8s https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command
(while we figure out a better real fix)
f
171451
d
maybe if you could DM us the raw value of the args that'd be helpful (to see if fixes like gzipping might help)
f
here is the yaml
🙏 1
i did this to get the char count : cat /tmp/o| yq '.spec.containers[0].args' | wc -c
seems increasing stack size is not possible on k8s 😞 https://github.com/kubernetes/kubernetes/issues/3595
fixing this won't be very trivial i guess ? if you don't pass the state using cmdline args then you have to create a configmap or so to pass it and mount the configmap in the job ?
which i guess is a non trivial change
d
yeah the two possibilities that come to mind are sending it as an env var, or gzipping it, or both. not trivial but also not a huge project
trying to think if there's anything that can unblock you today though
f
hm, but env var has 32kb limit too
d
true. gzipping that file you sent brings it down to about 8k
so maybe both
there's also likely ways to include less data there or maybe even persist it somewhere instead of passing it around
f
the thing is my job size will increase with increased dataset size, i expect it to grow another order of magnitude
the job is building an octree and there is an op for every node in the octree
i wonder if the celery executor would suffer the same problem
(disadvantage of celery executor is that i then have to have a fixed amount of workers, now everything scales dynamically)
d
i definitely can't promise that celery executor wouldn't run into a similar issue
f
hm, then maybe i'm "abusing" dagster here ? the reason i went for this design is that dagster manages retry for a job keeping the result of failed ops with me
so basically
it takes a list of las tiles. which is then passed to an op that creates octree nodes, and an octree node is a bounding box + a number of las tiles that intersect with that bounding box. then (this is where the "meat" of all happens) every octree node is materialized ( a new las file is written for that octree node). and in the end, a set of summary jsons is written with information about the density of every node in the octree.
materializing an octree node is very intensive on CPU and IO, and takes a long time, so you can expect a number of pods to fail every time you run the job.
so we first tried to do single op --> no good. too slow. without dagster --> no good, no incremental retry in case of failure.
so i wonder if there is another primitive that is more suited to what i'm trying to do
i could try to write all nodes to a postgres database, then create a sensor on that table that launches a job (so basically transform an op to a job)
then another sensor that watches for all jobs for one "umbrella job" to complete and create the summary jsons
(i'm going to do that)