-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
I've wrote this week audio.vadwebrtc as I needed a quick way to remove from audio files segments without voice as I need to transcribe audio files with audio.whisper and that model hallucinates on audio segments containing only silences.
I looked at this chunk of code in package av:
Lines 652 to 670 in 58d7026
SEXP R_convert_audio(SEXP audio, SEXP out_file, SEXP out_format, SEXP out_channels, | |
SEXP sample_rate, SEXP start_pos, SEXP max_len){ | |
output_container *output = av_mallocz(sizeof(output_container)); | |
if(Rf_length(out_channels)) | |
output->channels = Rf_asInteger(out_channels); | |
if(Rf_length(sample_rate)) | |
output->sample_rate = Rf_asInteger(sample_rate); | |
if(Rf_length(out_format)) | |
output->format_name = CHAR(STRING_ELT(out_format, 0)); | |
output->audio_input = open_audio_input(CHAR(STRING_ELT(audio, 0))); | |
double start_pts = Rf_length(start_pos) ? Rf_asReal(start_pos) : 0; | |
if(start_pts > 0) | |
av_seek_frame(output->audio_input->demuxer, -1, start_pts * AV_TIME_BASE, AVSEEK_FLAG_ANY); | |
if(Rf_length(max_len)) | |
output->max_pts = (Rf_asReal(max_len) + start_pts) * AV_TIME_BASE; | |
output->output_file = CHAR(STRING_ELT(out_file, 0)); | |
R_UnwindProtect(encode_audio_input, output, close_output_file, output, NULL); | |
return out_file; | |
} |
and it only handles one start_time / total_time.
My code looks like this to extract from an audio file only the part containing voice.
> library(av)
> library(audio.vadwebrtc)
> file <- system.file(package = "audio.vadwebrtc", "extdata", "test_wav.wav")
> vad <- VAD(file, mode = "normal")
> vad$vad_segments
vad_segment start end has_voice
1 1 0.00 0.08 FALSE
2 2 0.09 3.30 TRUE
3 3 3.31 3.71 FALSE
4 4 3.72 6.78 TRUE
5 5 6.79 6.99 FALSE
>
> voiced <- subset(vad$vad_segments, vad$vad_segments$has_voice == TRUE)
> voiced$file <- sprintf("%s.wav", voiced$vad_segment)
> voiced
vad_segment start end has_voice file
2 2 0.09 3.30 TRUE 2.wav
4 4 3.72 6.78 TRUE 4.wav
> for(i in seq_len(nrow(voiced))){
+ av_audio_convert(file, output = voiced$file[i],
+ start_time = voiced$start[i],
+ total_time = voiced$end[i] - voiced$start[i])
+ }
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\2.wav':
Metadata:
ISFT : Lavf58.29.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 28 at timestamp 3.42sec - audio stream completed!
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\4.wav':
Metadata:
ISFT : Lavf58.29.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 26 at timestamp 6.79sec - audio stream completed!
>
Would it be possible technically to allow multiple start/total_times so that these are all combined in 1 file? So that I can write something like this: av_audio_convert(file, output = "test.wav", start_time = voiced$start, total_time =voiced$end - voiced$start)
, generating 1 output file?
Metadata
Metadata
Assignees
Labels
No labels