Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefetcher shutdown hang when running with multiprocess + distributed reading service #1176

Open
zhengwy888 opened this issue Jun 1, 2023 · 1 comment

Comments

@zhengwy888
Copy link

馃悰 Describe the bug

Prefetcher will hang indefinitely on shutdown(), the faulthandler stack traces indicates that main thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L113 while child thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L81, but I don't know why time.sleep could block on exit.

Repro:

@functional_datapipe("frame_slicer")
class FrameSlicer(IterDataPipe):
    def __init__(self, source_datapipe) -> None:
        self.source_datapipe = source_datapipe

    def __iter__(self):
        for fields in self.source_datapipe:
            video_id, seg_start, seg_end = fields
            for i in range(int(seg_start), int(seg_end)+1):
                yield (video_id, i)

def generate_entries():
    lines = []
    # start with a prime number to make sure we have uneven dataloaders
    random.seed(10)
    for i in range(37):
        frame_count = random.randint(5, 10)
        lines.append([f'video-{i}', 10, 10 + frame_count])
    return lines

def build_one_datapipe():
    entries = generate_entries()
    total_frames = sum([x[2] - x[1] + 1 for x in entries])
    dp = IterableWrapper(entries)
    dp = dp.shuffle()
    dp = dp.sharding_filter()
    dp = dp.frame_slicer()
    return dp, total_frames

def build_dataloader2():
    dp, total_frames = build_one_datapipe()

    mp_rs = MultiProcessingReadingService(num_workers=2)
    dist_rs = DistributedReadingService()
    rs = SequentialReadingService(dist_rs, mp_rs)

    dl = DataLoader2(dp, reading_service=rs)
    dl.seed(2)
    counter = 0
    video_ids = set()
    for data in dl:
        video_ids.add(data[0])
        counter += 1

    dl.shutdown()  # hang here

Versions

PyTorch version: 2.0.0a0+gite9ebda2
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 12.0.1 (https://github.com/conda-forge/clangdev-feedstock d44358f44aef33e9fa7c5f93e2481ee8f1a04ab6)
CMake version: version 3.19.1
Libc version: glibc-2.31

Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-64-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 12.0.140
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.3.0
[pip3] numpy==1.23.5
[pip3] pytorch3d==0.6.2
[pip3] torch==2.0.1+1684801906.cuda120.cudnn891.nccl218.ap
[pip3] torch-mlir==1684442443
[pip3] torch-scatter==2.1.0
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.7.0.dev20230601
[pip3] torchfile==0.1.0
[pip3] torchvision==0.15.1a0+42759b1
[conda] magma-cuda121             2.6.1                         1    pytorch
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-include               2023.1.0         h84fe81f_48680    conda-forge
[conda] numpy                     1.23.5           py38h7042d01_0    conda-forge
[conda] pytorch3d                 0.6.2                    pypi_0    pypi
[conda] torch                     2.0.1+1684801906.cuda120.cudnn891.nccl218.ap          pypi_0    pypi
[conda] torch-mlir                1684442443               pypi_0    pypi
[conda] torch-scatter             2.1.0                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchvision               0.15.1a0+42759b1          pypi_0    pypi
@pableeto
Copy link

I've met the same problem. Is there any help or workarounds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants