Could use some help with a temporary pickled file ...
# ask-community
d
Could use some help with a temporary pickled file manager. We have an op that reads in data from a MSSQL server, that we need to pass to a downstream op, and then we want to scrap that data. We have the job running with the multiprocess_executor, so we can't just use the
mem_io_manager
, so I figure the best way to solve this is to write a temp io manager. How would I go about deleting the file after the downstream op has finished processing it?
s
There isn't currently a great way to do this that I'm aware of (other than rely on the OS to clean up /tmp eventually). If you're able to file an issue on Github, we might be able to get to adding this
d
s
@Danny Steffy Not with the multiprocess executor, since the setup and teardown happens per-process
d
hm... so the only solution we currently have is to combine them into a single op then?
s
d
thank you!
z
if the intermediate data you're trying not to persist is just being persisted inside the docker container for the run wouldn't it automatically be cleaned up when the docker container's process completes?
d
we have 30k keys pulling in close to 1B rows total, we don't want them to persist after the downstream op completes
z
ah I see. is the size of the rows causing the container to crash somewhere later on?
d
We haven't tested that, but the gut instinct is that it will be too much data. I think the default io manager also writes it out to a location that we have mounted on the docker container, so it would still persist to the VM
z
but all billion rows fit in memory?
d
No, we have 30k+ keys, we fan out each of those keys, pull in the data to score (on average 15k-20k rows), score the data, then load the data to database
So we're only ever keeping ~20k rows in memory per process at a time