Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

Open
bschreck opened this issue Apr 5, 2024 · 1 comment

Comments

@bschreck
Copy link

bschreck commented Apr 5, 2024

Tested versions

3.1

System information

macOs 13.6 - pyannote 3.1 - M2 air

Issue description

Im running ```
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HF_API_KEY"]
)
segmentations = self.pipeline.get_segmentations({'waveform': torch.from_numpy(waveform), 'sample_rate': sample_rate})
splits = [(segment, data) for segment, data in segmentations]

Each segment has start end times that ascend by one, e.g. (0,10), (1, 11), ... (5, 15)
These sort of match the length of the waveform (14.7 seconds), but clearly don't represent anything useful- the waveform is real speech. When I just run the full diarization pipeline it does diarize correctly, the results are:
```[(<Segment(1.16159, 2.41034)>, 'SPEAKER_00'), (<Segment(4.21597, 5.43097)>, 'SPEAKER_01'), (<Segment(5.76847, 6.39284)>, 'SPEAKER_00'), (<Segment(8.18159, 10.2741)>, 'SPEAKER_01'), (<Segment(11.3372, 12.9741)>, 'SPEAKER_00'), (<Segment(13.2947, 14.4591)>, 'SPEAKER_00')]```
And in both cases there are 6 segments.
Where do these latter segments get constructed?

My use case is:
1. run diarization on a concatenation of many different audio files. Save speaker to centroid mapping
2. user submits a new audio file (audio_new)
3. get embedding for each segment of audio_new
4. find closest speaker centroid by cosine distance for each segment
5. save diarization of each segment of audio_new

There doesn't appear to be a great documented workflow for this.
It's odd to me that get_embeddings returns arrays with num_local_speakers as a dimension, which doesn't even correspond exactly to the existing number of speakers from the original diarization. What does this actually mean? Relative confidence of the mapping to some threshold-gated speakers?
To reduce this dimension and find the closest centroid, I'm doing:
        embeddings = self.pipeline.get_embeddings(audio,segmentations)
        for (segment, _), segment_embedding in zip(splits, embeddings):
            min_distance_idx = np.argmin(
                [
                    np.min(
                        cdist(
                            segment_embedding,
                            center[np.newaxis, :],
                            metric="cosine",
                        )
                    )
                    for center in self.speaker_to_centroids.values()
                ]
            )
            speaker = list(self.speaker_to_centroids.keys())[min_distance_idx]
Not sure if this works as intended, especially since the segmentations aren't yet showing useful start/end times

### Minimal reproduction example (MRE)

see above
@bschreck
Copy link
Author

bschreck commented Apr 5, 2024

Okay I dug through the code and see that the actual start/ends are created later in to_diarization or to_annotatin.

However, trying to diarize the new audio file this way using existing clusters (with the same speaker- me) results in totally different (and very bad) annotations compared to just running the pretrained pipeline on the file directly. Running by itself produces this set of segments:

        DiarizationSegment(
            speaker="SPEAKER_04", start=1.1370997453310672, end=2.461378183361628
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=4.193126910016975, end=5.466471561969438
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=5.755096349745333, end=6.4172355687606135
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=8.182940152801354, end=10.271225382003397
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=11.35781281833616, end=12.953738115449912
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=13.344230475382002, end=14.51570755517827
        ),

While doing the method I described with existing clusters gives me:

[DiarizationSegment(speaker='SPEAKER_00', start=5.00909375, end=5.75159375), DiarizationSegment(speaker='SPEAKER_01', start=5.75159375, end=6.443468750000001), DiarizationSegment(speaker='SPEAKER_00', start=6.443468750000001, end=6.59534375)]```

This is totally different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant