Could use some help with a temporary pickled file manager We dagster #ask-community

Could use some help with a temporary pickled file ...

Danny Steffy

03/03/2023, 9:44 PM

Could use some help with a temporary pickled file manager. We have an op that reads in data from a MSSQL server, that we need to pass to a downstream op, and then we want to scrap that data. We have the job running with the multiprocess_executor, so we can't just use the

mem_io_manager

, so I figure the best way to solve this is to write a temp io manager. How would I go about deleting the file after the downstream op has finished processing it?

sandy

03/03/2023, 10:23 PM

There isn't currently a great way to do this that I'm aware of (other than rely on the OS to clean up /tmp eventually). If you're able to file an issue on Github, we might be able to get to adding this

Danny Steffy

03/03/2023, 10:23 PM

would something like this be possible: https://dagster.slack.com/archives/C01U954MEER/p1677866136690939?thread_ts=1677864919.402759&cid=C01U954MEER or would that not work?

Spencer Nelson

03/03/2023, 10:24 PM

@Danny Steffy Not with the multiprocess executor, since the setup and teardown happens per-process

Danny Steffy

03/03/2023, 10:31 PM

hm... so the only solution we currently have is to combine them into a single op then?

sandy

03/03/2023, 10:56 PM

I filed an issue to track this: https://github.com/dagster-io/dagster/issues/12707

Danny Steffy

03/03/2023, 10:56 PM

thank you!

Zach

03/03/2023, 10:57 PM

if the intermediate data you're trying not to persist is just being persisted inside the docker container for the run wouldn't it automatically be cleaned up when the docker container's process completes?

Danny Steffy

03/03/2023, 10:58 PM

we have 30k keys pulling in close to 1B rows total, we don't want them to persist after the downstream op completes

Zach

03/03/2023, 11:02 PM

ah I see. is the size of the rows causing the container to crash somewhere later on?

Danny Steffy

03/03/2023, 11:07 PM

We haven't tested that, but the gut instinct is that it will be too much data. I think the default io manager also writes it out to a location that we have mounted on the docker container, so it would still persist to the VM

Zach

03/03/2023, 11:09 PM

but all billion rows fit in memory?

Danny Steffy

03/03/2023, 11:10 PM

No, we have 30k+ keys, we fan out each of those keys, pull in the data to score (on average 15k-20k rows), score the data, then load the data to database

Danny Steffy

03/03/2023, 11:11 PM

So we're only ever keeping ~20k rows in memory per process at a time

Open in Slack

Previous Next