Skip to content

av_audio_convert multiple start_time/total_time #52

Open
@jwijffels

Description

@jwijffels

I've wrote this week audio.vadwebrtc as I needed a quick way to remove from audio files segments without voice as I need to transcribe audio files with audio.whisper and that model hallucinates on audio segments containing only silences.

I looked at this chunk of code in package av:

av/src/video.c

Lines 652 to 670 in 58d7026

SEXP R_convert_audio(SEXP audio, SEXP out_file, SEXP out_format, SEXP out_channels,
SEXP sample_rate, SEXP start_pos, SEXP max_len){
output_container *output = av_mallocz(sizeof(output_container));
if(Rf_length(out_channels))
output->channels = Rf_asInteger(out_channels);
if(Rf_length(sample_rate))
output->sample_rate = Rf_asInteger(sample_rate);
if(Rf_length(out_format))
output->format_name = CHAR(STRING_ELT(out_format, 0));
output->audio_input = open_audio_input(CHAR(STRING_ELT(audio, 0)));
double start_pts = Rf_length(start_pos) ? Rf_asReal(start_pos) : 0;
if(start_pts > 0)
av_seek_frame(output->audio_input->demuxer, -1, start_pts * AV_TIME_BASE, AVSEEK_FLAG_ANY);
if(Rf_length(max_len))
output->max_pts = (Rf_asReal(max_len) + start_pts) * AV_TIME_BASE;
output->output_file = CHAR(STRING_ELT(out_file, 0));
R_UnwindProtect(encode_audio_input, output, close_output_file, output, NULL);
return out_file;
}

and it only handles one start_time / total_time.
My code looks like this to extract from an audio file only the part containing voice.

> library(av)
> library(audio.vadwebrtc)
> file <- system.file(package = "audio.vadwebrtc", "extdata", "test_wav.wav")
> vad <- VAD(file, mode = "normal")
> vad$vad_segments
  vad_segment start  end has_voice
1           1  0.00 0.08     FALSE
2           2  0.09 3.30      TRUE
3           3  3.31 3.71     FALSE
4           4  3.72 6.78      TRUE
5           5  6.79 6.99     FALSE
> 
> voiced <- subset(vad$vad_segments, vad$vad_segments$has_voice == TRUE)
> voiced$file <- sprintf("%s.wav", voiced$vad_segment)
> voiced
  vad_segment start  end has_voice  file
2           2  0.09 3.30      TRUE 2.wav
4           4  3.72 6.78      TRUE 4.wav
> for(i in seq_len(nrow(voiced))){
+     av_audio_convert(file, output = voiced$file[i], 
+                      start_time = voiced$start[i], 
+                      total_time = voiced$end[i] - voiced$start[i])
+ }
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\2.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 28 at timestamp 3.42sec - audio stream completed!
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\4.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 26 at timestamp 6.79sec - audio stream completed!
>

Would it be possible technically to allow multiple start/total_times so that these are all combined in 1 file? So that I can write something like this: av_audio_convert(file, output = "test.wav", start_time = voiced$start, total_time =voiced$end - voiced$start), generating 1 output file?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions