Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in CutPairsSampler with CutSet.from_files(): Fails to raise StopIteration at the end of dataset iteration, raises AttributeError: 'tuple' object has no attribute 'subset' #1396

Open
Aijohc opened this issue Sep 20, 2024 · 3 comments

Comments

@Aijohc
Copy link

Aijohc commented Sep 20, 2024

Hello, and thank you for the excellent work with Lhotse's data management features!

I encountered a bug when using CutPairsSampler. When I load my source_cuts and target_cuts using CutSet.from_files() (with a list of .jsonl.gz files), the expected StopIteration exception is not raised correctly at the end of the dataset iteration. Instead, I encounter a different error:

W0920 21:49:06.145000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
W0920 21:49:06.146000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
W0920 21:49:06.148000 140064889530176 torch/multiprocessing/spawn.py:146] Terminating process [PID] via signal SIGTERM
Traceback (most recent call last):
  File "[PATH]/trainer.py", line 1186, in <module>
    main()
  File "[PATH]/trainer.py", line 1177, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
    while not context.join():
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "[PATH]/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "[PATH]/trainer.py", line 1067, in run
    train_one_epoch(
  File "[PATH]/trainer.py", line 822, in train_one_epoch
    valid_info = compute_validation_loss(
  File "[PATH]/trainer.py", line 558, in compute_validation_loss
    for batch_idx, batch in enumerate(valid_dl):
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
    return self._process_data(data)
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1368, in _process_data
    self._try_put_index()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 1350, in _try_put_index
    index = self._next_index()
  File "[PATH]/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "[PATH]/site-packages/lhotse/dataset/sampling/base.py", line 323, in __next__
    combined = combined + combined.subset(first=diff).modify_ids(
AttributeError: 'tuple' object has no attribute 'subset'

CutParisSampler

train_sampler = CutPairsSampler(
    cuts_train[0],
    cuts_train[1],
    max_target_duration=40,
    shuffle=True,
    drop_last=False,
    seed=self.args.seed,
)

Cuts

def train_cuts(self) -> Tuple[CutSet]:
    logging.info("About to get train cuts")
    prompt_files = list(
        sorted(self.args.manifest_dir.glob("train/*.cuts.prompts.jsonl.gz"))
    )
    target_files = list(
        sorted(self.args.manifest_dir.glob("train/*.cuts.targets.jsonl.gz"))
    )
    prompts = CutSet.from_files(prompt_files + target_files, shuffle_iters=False)
    targets = CutSet.from_files(target_files + prompt_files, shuffle_iters=False)
    return prompts, targets

lhotse version: 1.26.0

I believe this could be an issue with how the end of the dataset is handled when iterating over CutPairsSampler. Could you please investigate this?

Thanks again for your hard work!

Additional Question:

I also have a question regarding the CutPairsSampler. Is it possible to specify parameters like buffer_size and quadratic_duration similar to the DynamicBucketingSampler? These parameters are very important when working with the DynamicBucketingSampler, and I noticed they are not directly available in CutPairsSampler. Could you consider supporting such parameters?

Thank you!

@Aijohc Aijohc changed the title Bug in CutPairsSampler with CutSet.from_files(): Fails to raise StopIteration at the end of dataset iteration Bug in CutPairsSampler with CutSet.from_files(): Fails to raise StopIteration at the end of dataset iteration, raises AttributeError: 'tuple' object has no attribute 'subset' Sep 20, 2024
@pzelasko
Copy link
Collaborator

Regarding the first issue it looks like I haven't updated CutPairsSampler properly with latest changes. I'll take a look.

Regarding the other question, You might want to use DynamicCutSampler or DynamicBucketingSampler instead; if you give them more than one CutSet, they act as CutPairsSampler (and support triples, quadruples, and so on as well). In fact CutPairsSampler should be deprecated at this point.

@Aijohc
Copy link
Author

Aijohc commented Sep 29, 2024

Thank you for your answer! So, does that mean DynamicCutSampler can completely replace CutPairsSampler? I will give it a try. Thanks!

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 1, 2024

Yes, it can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants