Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

bschreck · 2024-04-05T16:29:02Z

Tested versions

3.1

System information

macOs 13.6 - pyannote 3.1 - M2 air

Issue description

Im running ```
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HF_API_KEY"]
)
segmentations = self.pipeline.get_segmentations({'waveform': torch.from_numpy(waveform), 'sample_rate': sample_rate})
splits = [(segment, data) for segment, data in segmentations]

Each segment has start end times that ascend by one, e.g. (0,10), (1, 11), ... (5, 15)
These sort of match the length of the waveform (14.7 seconds), but clearly don't represent anything useful- the waveform is real speech. When I just run the full diarization pipeline it does diarize correctly, the results are:
```[(<Segment(1.16159, 2.41034)>, 'SPEAKER_00'), (<Segment(4.21597, 5.43097)>, 'SPEAKER_01'), (<Segment(5.76847, 6.39284)>, 'SPEAKER_00'), (<Segment(8.18159, 10.2741)>, 'SPEAKER_01'), (<Segment(11.3372, 12.9741)>, 'SPEAKER_00'), (<Segment(13.2947, 14.4591)>, 'SPEAKER_00')]```
And in both cases there are 6 segments.
Where do these latter segments get constructed?

My use case is:
1. run diarization on a concatenation of many different audio files. Save speaker to centroid mapping
2. user submits a new audio file (audio_new)
3. get embedding for each segment of audio_new
4. find closest speaker centroid by cosine distance for each segment
5. save diarization of each segment of audio_new

There doesn't appear to be a great documented workflow for this.
It's odd to me that get_embeddings returns arrays with num_local_speakers as a dimension, which doesn't even correspond exactly to the existing number of speakers from the original diarization. What does this actually mean? Relative confidence of the mapping to some threshold-gated speakers?
To reduce this dimension and find the closest centroid, I'm doing:

        embeddings = self.pipeline.get_embeddings(audio,segmentations)
        for (segment, _), segment_embedding in zip(splits, embeddings):
            min_distance_idx = np.argmin(
                [
                    np.min(
                        cdist(
                            segment_embedding,
                            center[np.newaxis, :],
                            metric="cosine",
                        )
                    )
                    for center in self.speaker_to_centroids.values()
                ]
            )
            speaker = list(self.speaker_to_centroids.keys())[min_distance_idx]

Not sure if this works as intended, especially since the segmentations aren't yet showing useful start/end times

### Minimal reproduction example (MRE)

see above

The text was updated successfully, but these errors were encountered:

bschreck · 2024-04-05T17:49:23Z

Okay I dug through the code and see that the actual start/ends are created later in to_diarization or to_annotatin.

However, trying to diarize the new audio file this way using existing clusters (with the same speaker- me) results in totally different (and very bad) annotations compared to just running the pretrained pipeline on the file directly. Running by itself produces this set of segments:

        DiarizationSegment(
            speaker="SPEAKER_04", start=1.1370997453310672, end=2.461378183361628
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=4.193126910016975, end=5.466471561969438
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=5.755096349745333, end=6.4172355687606135
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=8.182940152801354, end=10.271225382003397
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=11.35781281833616, end=12.953738115449912
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=13.344230475382002, end=14.51570755517827
        ),

While doing the method I described with existing clusters gives me:

[DiarizationSegment(speaker='SPEAKER_00', start=5.00909375, end=5.75159375), DiarizationSegment(speaker='SPEAKER_01', start=5.75159375, end=6.443468750000001), DiarizationSegment(speaker='SPEAKER_00', start=6.443468750000001, end=6.59534375)]```

This is totally different

stale · 2024-10-05T07:38:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

bschreck commented Apr 5, 2024

bschreck commented Apr 5, 2024

stale bot commented Oct 5, 2024

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

Comments

bschreck commented Apr 5, 2024

Tested versions

System information

Issue description

bschreck commented Apr 5, 2024

stale bot commented Oct 5, 2024