Faster Whisper

Faster Whisper template of Datastone Spirit Serverless.

Usage

Config

Minimal request body:

{
    "input": {
        "audio": "http://your-audio.wav"
    },
    "webhook": "http://your-backend-to-receive" 
}

or

{
    "input": {
        "audio_base64": "xxxrfsfsfsfs"
    },
    "webhook": "http://your-backend-to-receive" 
}

where "webhook" is for async request only.

Full request body:

{
    "input": {
        "audio": "http://your-audio.wav",
        "model": "base",
        "transcription": "plain_text",
        "translate": false,
        "language": null,
        "beam_size": 5,
        "best_of": 5,
        "patience": 1.0,
        "length_penalty": 1.0,
        "temperature": 0.0,
        "temperature_increment_on_fallback": 0.2,
        "initial_prompt": null,
        "condition_on_previous_text": true,
        "compression_ratio_threshold": 2.4,
        "log_prob_threshold": -1.0,
        "no_speech_threshold": 0.6,
        "enable_vad": false,
        "word_timestamps": false
    },
    "webhook": "http://your-backend-to-receive" 
}

If use webhook in async mode, the result will send to your webhook with query requestID=xxx-xxx&statusCode=200. You can find requestID from response of your async request.

Output format:

{
  "model": "base",
  "detected_language": "en",
  "device": "cpu",
  "segments": [
    {
      "id": 1,
      "seek": 1000,
      "start": 0,
      "end": 9.8,
      "text": " Four score and seven years ago, ...",
      "tokens": [
        50364,
        7451,
        ...
      ],
      "temperature": 0,
      "avg_logprob": -0.2194819552557809,
      "compression_ratio": 1.380952380952381,
      "no_speech_prob": 0.012501929886639118
    }
  ],
  "transcription": "Four score and seven years ago, ...",
  "translation": null,
  "word_timestamps": [
    {
      "word": " Four",
      "start": 0,
      "end": 0.6
    },
    {
      "word": " score",
      "start": 0.6,
      "end": 0.96
    },
    ...
  ]
}

argument	type	description
`audio`	str	URL of audio file
`audio_base64`	str	Base64 string of audio
`model`	str	Whisper model to use, available models: "tiny", "base", "small", "medium", "large-v1", "large-v2", "large-v3". Default: "base"
`transcription`	str	Type of output, available transcriptions: "plain_text", "formatted_text", "srt", "vtt". Default: "plain_text"
`translate`	bool	Translate to english or not, faster whisper only support translate to english now. Default: False
`language`	str	The language spoken in the audio. It should be a language code such as "en" or "fr". If not set, the language will be detected in the first 30 seconds of audio. Default: None
`beam_size`	int	Beam size to use for decoding. Default: 5
`best_of`	int	Number of candidates when sampling with non-zero temperature. Default: 5
`patience`	float	Beam search patience factor. Default: 1.0
`length_penalty`	float	Exponential length penalty constant. Default: 1.0
`temperature`	float	Temperature for sampling. Default 0.0
`temperature_increment_on_fallback`	float	Increment to temperature upon fallback. Increase `temperature` to 1.0 by `temperature_increment_on_fallback`. Default: 0.2. Means [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
`initial_prompt`	str	Optional text string or iterable of token ids to provide as a prompt for the first window. Default: None
`condition_on_previous_text`	bool	If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync. Default: True
`compression_ratio_threshold`	float	If the gzip compression ratio is above this value, treat as failed. Default: 2.4
`log_prob_threshold`	float	If the average log probability over sampled tokens is below this value, treat as failed. Default: -1.0
`no_speech_threshold`	float	If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below "log_prob_threshold", consider the segment as silent. Default: 0.6
`enable_vad`	bool	Enable the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model https://github.com/snakers4/silero-vad. Default: False
`word_timestamps`	bool	If True, include word timestamps in the output. Default: False

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Faster Whisper

Usage

Config

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

datastone-spirit/worker-faster-whisper

Folders and files

Latest commit

History

Repository files navigation

Faster Whisper

Usage

Config

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages