I'm currently moving my "research" PyTorch code to...
# ask-community
d
I'm currently moving my "research" PyTorch code to Dagster. The training is failing with the following error:
Copy code
AttributeError: Can't pickle local object 'SeqDataset.make_collate_fn.<locals>.collate_fn'
This is happening inside the
torch.DataLoader
, I'm using multiprocessing and
num_workers
> 0. But this happens even with
num_workers=0
. This did not happen outside of Dagster, so I assume it has something to do with Dagster's
Multiprocess Executor
. Sadly, my understanding of multiprocessing is not the best, so I'm stuck here. I was unable to google anything relevant. Would appreciate any help... The last lines of the error:
Copy code
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'SeqDataset.make_collate_fn.<locals>.collate_fn'
@owen @sandy maybe one of you guys can help?
y
do you need multiprocess? if not, you can turn it off by configuring the run:
Copy code
execution:
  config:
    in_process: null
d
I probably don’t need it locally, and in production K8s pods it won’t matter, right?
s
in production K8s pods it won’t matter, right?
Right Btw, I suspect what's going on here is that the asset function is accessing something defined at a scope outside the asset function. When the python multiprocessing library forks a new process, it needs to bring that object into the new process, so it tries to pickle is, but that thing isn't pickle-able
d
Thanks Sandy. Basically that's true, I have a
make_collate_fn
method that has a
def collate_fn
inside. It's a closure for some variables. I wonder why this isn't an issue for torch's multiprocessing? I refactored the function and it's working now
s
it's also possible that the forkserver start_method would help with this: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor