Releases: pyannote/pyannote-audio
Releases · pyannote/pyannote-audio
Version 3.3.1
Breaking changes
- setup: drop support for Python 3.8
Fixes
- fix: fix support for
numpy==2.x
(@ibevers) - fix: fix support for
speechbrain==1.x
(@Adel-Moumen)
Version 3.3.0
TL;DR
pyannote.audio
does speech separation: multi-speaker audio in, one audio channel per speaker out!
pip install pyannote.audio[separation]==3.3.0
New features
- feat(task): add
PixIT
joint speaker diarization and speech separation task (with @joonaskalda) - feat(model): add
ToTaToNet
joint speaker diarization and speech separation model (with @joonaskalda) - feat(pipeline): add
SpeechSeparation
pipeline (with @joonaskalda) - feat(io): add option to select torchaudio
backend
Fixes
- fix(task): fix wrong train/development split when training with (some) meta-protocols (#1709)
- fix(task): fix metadata preparation with missing validation subset (@clement-pages)
Improvements
- improve(io): when available, default to using
soundfile
backend - improve(pipeline): do not extract embeddings when
max_speakers
is set to 1 - improve(pipeline): optimize memory usage of most pipelines (#1713 by @benniekiss)
Version 3.2.0
New features
- feat(task): add option to cache task training metadata to speed up training (with @clement-pages)
- feat(model): add
receptive_field
,num_frames
anddimension
to models (with @Bilal-Rahou) - feat(model): add
fbank_only
property toWeSpeaker
models - feat(util): add
Powerset.permutation_mapping
to help with permutation in powerset space (with @FrenchKrab) - feat(sample): add sample file at
pyannote.audio.sample.SAMPLE_FILE
- feat(metric): add
reduce
option todiarization_error_rate
metric (with @Bilal-Rahou) - feat(pipeline): add
Waveform
andSampleRate
preprocessors
Fixes
- fix(task): fix random generators and their reproducibility (with @FrenchKrab)
- fix(task): fix estimation of training set size (with @FrenchKrab)
- fix(hook): fix
torch.Tensor
support inArtifactHook
- fix(doc): fix typo in
Powerset
docstring (with @lukasstorck)
Improvements
- improve(metric): add support for number of speakers mismatch in
diarization_error_rate
metric - improve(pipeline): track both
Model
andnn.Module
attributes inPipeline.to(device)
- improve(io): switch to
torchaudio >= 2.2.0
- improve(doc): update tutorials (with @clement-pages)
Breaking changes
- BREAKING(model): get rid of
Model.example_output
in favor ofnum_frames
method,receptive_field
property, anddimension
property - BREAKING(task): custom tasks need to be updated (see "Add your own task" tutorial)
Community contributions
- community: add tutorial for offline use of
pyannote/speaker-diarization-3.1
(by @simonottenhauskenbun)
Version 3.1.1
TL;DR
Providing num_speakers
to pyannote/speaker-diarization-3.1
now works as expected.
Full changelog
Fixes
- fix(pipeline): fix support for setting
num_speakers
inpyannote/speaker-diarization-3.1
pipeline
Version 3.1.0
TL;DR
pyannote/speaker-diarization-3.1
no longer requires unpopular ONNX runtime
Full changelog
New features
- feat(model): add WeSpeaker embedding wrapper based on PyTorch
- feat(model): add support for multi-speaker statistics pooling
- feat(pipeline): add
TimingHook
for profiling processing time - feat(pipeline): add
ArtifactHook
for saving internal steps - feat(pipeline): add support for list of hooks with
Hooks
- feat(utils): add
"soft"
option toPowerset.to_multilabel
Fixes
- fix(pipeline): add missing "embedding" hook call in
SpeakerDiarization
- fix(pipeline): fix
AgglomerativeClustering
to honornum_clusters
when provided - fix(pipeline): fix frame-wise speaker count exceeding
max_speakers
or detectednum_speakers
inSpeakerDiarization
pipeline
Improvements
- improve(pipeline): compute
fbank
on GPU when requested
Breaking changes
- BREAKING(pipeline): rename
WeSpeakerPretrainedSpeakerEmbedding
toONNXWeSpeakerPretrainedSpeakerEmbedding
- BREAKING(setup): remove
onnxruntime
dependency.
You can still use ONNXhbredin/wespeaker-voxceleb-resnet34-LM
but you will have to installonnxruntime
yourself. - BREAKING(pipeline): remove
logging_hook
(useArtifactHook
instead) - BREAKING(pipeline): remove
onset
andoffset
parameter inSpeakerDiarizationMixin.speaker_count
You should now binarize segmentations before passing them tospeaker_count
Version 3.0.1
TL;DR
pyannote/speaker-diarization-3.0
is now much faster when sent to GPU.
import torch
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0")
pipeline.to(torch.device("cuda"))
Full changelog
Fixes and improvements
- fix: fix WeSpeakerPretrainedSpeakerEmbedding GPU support
Dependencies update
- setup: switch from
onnxruntime
toonnxruntime-gpu
Version 3.0.0
TL;DR
Better pretrained pipeline and model
- Much better overlapping speech detection with powerset pyannote/segmentation-3.0
- Much better speaker diarization performance with pyannote/speaker-diarization-3.0
Benchmark (DER %) | v2.1 | v3.0 |
---|---|---|
AISHELL-4 | 14.1 | 12.3 |
AliMeeting (channel 1) | 27.4 | 24.3 |
AMI (IHM) | 18.9 | 19.0 |
AMI (SDM) | 27.1 | 22.2 |
AVA-AVD | - | 49.1 |
DIHARD 3 (full) | 26.9 | 21.7 |
MSDWild | - | 24.6 |
REPERE (phase2) | 8.2 | 7.8 |
VoxConverse (v0.3) | 11.2 | 11.3 |
Major breaking changes
- BREAKING: pipelines now run on CPU by default
Usepipeline.to(torch.device('cuda'))
to use GPU - BREAKING: removed
SpeakerSegmentation
pipeline
UseSpeakerDiarization
pipeline instead - BREAKING: removed support for
prodi.gy
recipes
Full changelog
Features and improvements
- feat(pipeline): send pipeline to device with
pipeline.to(device)
- feat(pipeline): add
return_embeddings
option toSpeakerDiarization
pipeline - feat(pipeline): make
segmentation_batch_size
andembedding_batch_size
mutable inSpeakerDiarization
pipeline (they now default to1
) - feat(pipeline): add progress hook to pipelines
- feat(task): add powerset support to
SpeakerDiarization
task - feat(task): add support for multi-task models
- feat(task): add support for label scope in speaker diarization task
- feat(task): add support for missing classes in multi-label segmentation task
- feat(model): add segmentation model based on torchaudio self-supervised representation
- feat(pipeline): check version compatibility at load time
- improve(task): load metadata as tensors rather than pyannote.core instances
- improve(task): improve error message on missing specifications
Breaking changes
- BREAKING(task): rename
Segmentation
task toSpeakerDiarization
- BREAKING(pipeline): pipeline defaults to CPU (use
pipeline.to(device)
) - BREAKING(pipeline): remove
SpeakerSegmentation
pipeline (useSpeakerDiarization
pipeline) - BREAKING(pipeline): remove
segmentation_duration
parameter fromSpeakerDiarization
pipeline (defaults toduration
of segmentation model) - BREAKING(task): remove support for variable chunk duration for segmentation tasks
- BREAKING(pipeline): remove support for
FINCHClustering
andHiddenMarkovModelClustering
- BREAKING(setup): drop support for Python 3.7
- BREAKING(io): channels are now 0-indexed (used to be 1-indexed)
- BREAKING(io): multi-channel audio is no longer downmixed to mono by default.
You should update howpyannote.audio.core.io.Audio
is instantiated:- replace
Audio()
byAudio(mono="downmix")
; - replace
Audio(mono=True)
byAudio(mono="downmix")
; - replace
Audio(mono=False)
byAudio()
.
- replace
- BREAKING(model): get rid of (flaky)
Model.introspection
If, for some weird reason, you wrote some custom code based on that,
you should instead rely onModel.example_output
. - BREAKING(interactive): remove support for Prodigy recipes
Fixes and improvements
- fix(pipeline): fix reproducibility issue with Ampere CUDA devices
- fix(pipeline): fix support for IOBase audio
- fix(pipeline): fix corner case with no speaker
- fix(train): prevent metadata preparation to happen twice
- fix(task): fix support for "balance" option
- improve(task): shorten and improve structure of Tensorboard tags
Dependencies update
- setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+
- setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+
- setup: switch to speechbrain 0.5.14+
Version 2.1.1
Version 2.1.x
introduces a major overhaul of pyannote.audio
default speaker diarization pipeline, made of three main stages:
- neural speaker segmentation applied to a short sliding window;
- neural speaker embedding of each (local) speakers;
- (global) agglomerative clustering.
More details in the attached technical report.
Version 1.1.1
chore: do not update to pyannote.pipeline >= 2.0