You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im running ```
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HF_API_KEY"]
)
segmentations = self.pipeline.get_segmentations({'waveform': torch.from_numpy(waveform), 'sample_rate': sample_rate})
splits = [(segment, data) for segment, data in segmentations]
Each segment has start end times that ascend by one, e.g. (0,10), (1, 11), ... (5, 15)
These sort of match the length of the waveform (14.7 seconds), but clearly don't represent anything useful- the waveform is real speech. When I just run the full diarization pipeline it does diarize correctly, the results are:
```[(<Segment(1.16159, 2.41034)>, 'SPEAKER_00'), (<Segment(4.21597, 5.43097)>, 'SPEAKER_01'), (<Segment(5.76847, 6.39284)>, 'SPEAKER_00'), (<Segment(8.18159, 10.2741)>, 'SPEAKER_01'), (<Segment(11.3372, 12.9741)>, 'SPEAKER_00'), (<Segment(13.2947, 14.4591)>, 'SPEAKER_00')]```
And in both cases there are 6 segments.
Where do these latter segments get constructed?
My use case is:
1. run diarization on a concatenation of many different audio files. Save speaker to centroid mapping
2. user submits a new audio file (audio_new)
3. get embedding for each segment of audio_new
4. find closest speaker centroid by cosine distance for each segment
5. save diarization of each segment of audio_new
There doesn't appear to be a great documented workflow for this.
It's odd to me that get_embeddings returns arrays with num_local_speakers as a dimension, which doesn't even correspond exactly to the existing number of speakers from the original diarization. What does this actually mean? Relative confidence of the mapping to some threshold-gated speakers?
To reduce this dimension and find the closest centroid, I'm doing:
embeddings = self.pipeline.get_embeddings(audio,segmentations)
for (segment, _), segment_embedding in zip(splits, embeddings):
min_distance_idx = np.argmin(
[
np.min(
cdist(
segment_embedding,
center[np.newaxis, :],
metric="cosine",
)
)
for center in self.speaker_to_centroids.values()
]
)
speaker = list(self.speaker_to_centroids.keys())[min_distance_idx]
Not sure if this works as intended, especially since the segmentations aren't yet showing useful start/end times
### Minimal reproduction example (MRE)
see above
The text was updated successfully, but these errors were encountered:
Okay I dug through the code and see that the actual start/ends are created later in to_diarization or to_annotatin.
However, trying to diarize the new audio file this way using existing clusters (with the same speaker- me) results in totally different (and very bad) annotations compared to just running the pretrained pipeline on the file directly. Running by itself produces this set of segments:
While doing the method I described with existing clusters gives me:
[DiarizationSegment(speaker='SPEAKER_00', start=5.00909375, end=5.75159375), DiarizationSegment(speaker='SPEAKER_01', start=5.75159375, end=6.443468750000001), DiarizationSegment(speaker='SPEAKER_00', start=6.443468750000001, end=6.59534375)]```
This is totally different
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Tested versions
3.1
System information
macOs 13.6 - pyannote 3.1 - M2 air
Issue description
Im running ```
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HF_API_KEY"]
)
segmentations = self.pipeline.get_segmentations({'waveform': torch.from_numpy(waveform), 'sample_rate': sample_rate})
splits = [(segment, data) for segment, data in segmentations]
The text was updated successfully, but these errors were encountered: