Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

av_audio_convert multiple start_time/total_time #52

Open
jwijffels opened this issue Jan 24, 2024 · 0 comments
Open

av_audio_convert multiple start_time/total_time #52

jwijffels opened this issue Jan 24, 2024 · 0 comments

Comments

@jwijffels
Copy link

jwijffels commented Jan 24, 2024

I've wrote this week audio.vadwebrtc as I needed a quick way to remove from audio files segments without voice as I need to transcribe audio files with audio.whisper and that model hallucinates on audio segments containing only silences.

I looked at this chunk of code in package av:

av/src/video.c

Lines 652 to 670 in 58d7026

SEXP R_convert_audio(SEXP audio, SEXP out_file, SEXP out_format, SEXP out_channels,
SEXP sample_rate, SEXP start_pos, SEXP max_len){
output_container *output = av_mallocz(sizeof(output_container));
if(Rf_length(out_channels))
output->channels = Rf_asInteger(out_channels);
if(Rf_length(sample_rate))
output->sample_rate = Rf_asInteger(sample_rate);
if(Rf_length(out_format))
output->format_name = CHAR(STRING_ELT(out_format, 0));
output->audio_input = open_audio_input(CHAR(STRING_ELT(audio, 0)));
double start_pts = Rf_length(start_pos) ? Rf_asReal(start_pos) : 0;
if(start_pts > 0)
av_seek_frame(output->audio_input->demuxer, -1, start_pts * AV_TIME_BASE, AVSEEK_FLAG_ANY);
if(Rf_length(max_len))
output->max_pts = (Rf_asReal(max_len) + start_pts) * AV_TIME_BASE;
output->output_file = CHAR(STRING_ELT(out_file, 0));
R_UnwindProtect(encode_audio_input, output, close_output_file, output, NULL);
return out_file;
}

and it only handles one start_time / total_time.
My code looks like this to extract from an audio file only the part containing voice.

> library(av)
> library(audio.vadwebrtc)
> file <- system.file(package = "audio.vadwebrtc", "extdata", "test_wav.wav")
> vad <- VAD(file, mode = "normal")
> vad$vad_segments
  vad_segment start  end has_voice
1           1  0.00 0.08     FALSE
2           2  0.09 3.30      TRUE
3           3  3.31 3.71     FALSE
4           4  3.72 6.78      TRUE
5           5  6.79 6.99     FALSE
> 
> voiced <- subset(vad$vad_segments, vad$vad_segments$has_voice == TRUE)
> voiced$file <- sprintf("%s.wav", voiced$vad_segment)
> voiced
  vad_segment start  end has_voice  file
2           2  0.09 3.30      TRUE 2.wav
4           4  3.72 6.78      TRUE 4.wav
> for(i in seq_len(nrow(voiced))){
+     av_audio_convert(file, output = voiced$file[i], 
+                      start_time = voiced$start[i], 
+                      total_time = voiced$end[i] - voiced$start[i])
+ }
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\2.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 28 at timestamp 3.42sec - audio stream completed!
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\4.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 26 at timestamp 6.79sec - audio stream completed!
>

Would it be possible technically to allow multiple start/total_times so that these are all combined in 1 file? So that I can write something like this: av_audio_convert(file, output = "test.wav", start_time = voiced$start, total_time =voiced$end - voiced$start), generating 1 output file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant