Vosk vs Whipscribe in 2026 — tiny offline Kaldi STT vs hosted Whisper Large-v3

May 8, 2026 · Neugence · 12 min read

Vosk is a 50 MB Kaldi-based recognizer that fits on a Raspberry Pi and streams in real time on a CPU. Whipscribe is hosted Whisper Large-v3 plus diarization that takes a URL or a file. They look similar — "speech to text" — but they live on opposite ends of the speech-recognition map. Different model families, different deployment shapes, different accuracy tiers, different jobs to be done. This is the honest decision frame for 2026.

The one-sentence summary

If your problem is "my Raspberry Pi needs to recognize 'turn on the lights' without the internet", Vosk is the answer. If your problem is "I have a 90-minute podcast and I want a clean transcript with speaker labels", Whipscribe is the answer. Anyone telling you the same tool fits both shapes is selling you something.

The two tools come from two completely different worlds

It is worth slowing down for one paragraph on this, because the rest of the decision falls out of it. Vosk is built on Kaldi, the open-source speech-recognition toolkit that came out of Daniel Povey's lab at Johns Hopkins around 2011. Kaldi uses HMM-GMM and HMM-DNN acoustic models stitched to n-gram language models — the architecture that ran most production ASR from roughly 2012 to 2020. It is small, fast, deterministic, and trained on relatively narrow data per language.

Whisper, released by OpenAI in late 2022, is a Transformer encoder-decoder trained on 680,000 hours of multilingual web audio. It is bigger, slower, much more accurate on hard inputs, and dramatically more robust to accents, codecs, and noise — at the cost of a model file that ranges from 75 MB (Tiny) to 3 GB (Large-v3) and a runtime that wants real GPU compute to feel responsive.

Both projects are excellent at what they're built for. They're just not built for the same thing. Vosk's model is roughly 50 MB; Whisper Large-v3 is 3 GB. That ratio — sixty-to-one — is most of the story.

Side-by-side, in the dimensions that actually matter

↔ scroll the table sideways
Dimension Vosk (small en model) Whipscribe (Large-v3 + whisperX)
Model familywhat's under the hood Kaldi · HMM-DNN + n-gram LM OpenAI Whisper · Transformer encoder-decoder
Model sizewhat you ship ~50 MB (small) · ~1.8 GB (large) 3 GB Large-v3 (server-side, you don't ship it)
Runs ontarget hardware Raspberry Pi · Android · iOS · embedded x86 · browsers via WASM Server GPUs · accessed via URL or file upload from any client
WER · clean Englishread-speech benchmarks ~10–15% (small) · ~6–9% (large) ~2.7%
Accent / noise robustnessreal-world audio Trails Whisper noticeably Strong — 680k hours of pretraining buys this
Streamingreal-time partials Yes — sub-second on Pi-class hardware No — file/URL based, batch transcription
Diarization"who said what" Speaker x-vectors via separate model · basic whisperX · production-grade speaker labels
URL ingestionpaste a YouTube link No — bring your own audio bytes Yes — paste a YouTube / podcast / Drive URL
Languagespre-trained models ~20 with shipped acoustic models 99 (Whisper's training set)
Licensehow you can use it Apache 2.0 · commercial use fine Hosted SaaS · subscription
Costto you $0 software · your edge device $0 / $2 hr / $12 mo / $29 mo
Cloud round-tripprivacy / offline story None — fully on-device Required — audio uploaded to Whipscribe servers

WER ranges drawn from Vosk's own README test results, the Whisper paper's English-clean numbers, and the LibriSpeech / Common Voice community benchmarks tracked by Hugging Face's Open ASR Leaderboard (checked May 2026). Real-world WER varies with accent, codec, and domain — these are clean-audio averages.

The 50 MB number is the whole pitch for Vosk

The reason Vosk still has an enthusiastic following in 2026, half a decade after Whisper changed the field, is that one number: a working English recognizer in roughly 50 megabytes. Whisper Large-v3 is sixty times that. Whisper Tiny is one and a half times that and produces transcripts that are 10–15% WER — about the same as Vosk small — while being slower on a Pi because the Transformer decoder doesn't love CPU.

If you are shipping a voice-controlled smart speaker, a hospital bedside terminal, an in-car assistant, an accessibility tool that has to keep working in airplane mode, a kiosk in a place with intermittent connectivity, or an Android app where the model has to fit inside a sensible download, Vosk is the rational choice. Whisper does not fit that shape and never will. The architecture is wrong for the constraints.

The accuracy gap is the whole pitch for Whipscribe

The reverse argument is also clean. If you are not size-constrained — if you have an audio file or a URL, and you want the transcript to be correct — Whisper Large-v3 simply produces better text than Vosk does. The difference is biggest on:

For file-based, podcast-shaped, journalist-shaped, research-shaped audio, the gap between Vosk's ~10% WER and Whisper's ~2.7% WER is the difference between "draft I have to edit every paragraph of" and "transcript I can paste into a doc."

Worked example one — the smart-home microphone

Vosk wins this one

You're shipping a $79 smart-home device that listens for "lights on / lights off / play jazz"

The microphone has to wake on a CPU smaller than a phone's, the device must keep working when the wifi drops, and the latency target is under 300 ms from end-of-utterance to action. The vocabulary is roughly 200 commands. There is no internet round-trip in the budget — both for cost and for the privacy story you want on the box.

Whipscribe is the wrong tool here. It can't run on the device, the round-trip blows the latency budget, and you are paying for a 99-language Transformer to recognize twelve verbs. Vosk's small English model, with a custom n-gram language model trained on your 200 commands, will hit >95% accuracy at 100 ms streaming latency and add almost nothing to your BOM. This is the niche Vosk was designed for, and nothing in the Whisper family competes with it.

Worked example two — the podcast backlog

Whipscribe wins this one

You have 60 hours of recorded interviews and you need clean transcripts with speaker labels

The audio is in the cloud or on your laptop. Speakers have a mix of American, British, and Indian accents. Some episodes were recorded over Zoom and have compression artifacts. You want SRT for captions, DOCX for editorial, and JSON for downstream tooling. You don't have a GPU box, you don't want to maintain a Kaldi runtime, and the transcripts will be quoted in articles where errors are embarrassing.

Vosk is the wrong tool here. The accuracy is not good enough on accented and compressed conversational audio, the diarization is rougher than whisperX, and you'd be writing the file pipeline, the long-audio chunker, the speaker-merge logic, and the export formatters yourself. Whipscribe does this end-to-end on Whisper Large-v3 plus whisperX — paste a URL or drop a file, get back transcripts with speaker labels at 2.7% WER. At 60 hours on the $29/month Team plan, the marginal cost of all 60 transcripts is roughly $3.50 of plan budget, with 440 hours still left in the month.

Worked example three — the offline field journalist

Neither tool is a clean fit — but Vosk is closer

You're a reporter on a remote assignment with hours of interview audio and unreliable internet

You want rough drafts on the laptop right now and the polished, diarized transcript when you get back to wifi. The honest answer here is both: run Vosk locally for the rough draft you can read on the plane home, then re-run the same audio through Whipscribe when you have bandwidth for the publication-grade version. Vosk's small footprint means your laptop does the work without spinning fans for an hour the way local Whisper Large would; Whipscribe's accuracy means the version you actually quote from is correct.

This is the only segment where the two tools genuinely overlap, and it isn't really overlap — it's a relay.

Where Vosk's age starts to show

Three honest caveats on Vosk in 2026, from someone who likes the project:

  1. Community gravity has shifted. The most active speech-recognition work is now around Whisper, faster-whisper, whisperX, WhisperKit, and the Hugging Face ASR stack. Vosk's GitHub is still updated, but the pull-request volume, the third-party tooling, and the new-research-paper coverage have moved.
  2. Multilingual coverage is narrower. Vosk has solid models for ~20 languages. Whisper has 99 in one checkpoint. If you need Tamil or Swahili or Vietnamese with reasonable quality, Whisper is now the default starting point.
  3. Custom vocabulary still works, but it's more work than the modern alternatives. Vosk lets you constrain the recognizer with a custom JSGF grammar or a custom LM — this is one of its real strengths for command-and-control. Whisper's prompt mechanism is looser, but newer tools like whisper-prompted-decoding and external LM rescoring have closed enough of the gap that "I want a fixed vocabulary" is no longer a clear Vosk win for everything.

None of this makes Vosk a bad tool. It makes it a focused tool. The right way to think about it in 2026: Vosk is the Kaldi-era recognizer that survived because its niche — small, offline, streaming, embedded — is a niche Whisper structurally cannot serve.

Where Whipscribe is also not the right answer

To be fair in the other direction: Whipscribe is a hosted service with a cloud round-trip. Three places where that's wrong for you:

The pricing comparison, on the same axis

The two tools are priced for different shapes, but it's worth seeing them next to each other.

Tool / planWhat you getWhat it costs
VoskApache-2.0 software · pre-trained models for 20+ languages · runs on Pi-class hardware$0 software + your hardware + your engineering
Whipscribe Free30 minutes / day, every day. No sign-up, no credit card.$0
Whipscribe Pay-as-you-goPer-hour billing for spiky usage. Diarization included.$2 / hour of audio
Whipscribe Pro100 hours / month. Right for one person clearing meetings, interviews, or a podcast backlog.$12 / month
Whipscribe Team · 500 hr500 hours / month. Right for a podcast network, a research team, or a multi-hour-per-day inbound stream.$29 / month

The honest framing: if you're solving the embedded problem, Vosk's $0 is correct and Whipscribe's $29 is irrelevant — Whipscribe doesn't solve your problem. If you're solving the file-transcription problem, Vosk's $0 is a mirage because the integration time, the lower accuracy, and the missing pipeline (URL ingestion, diarization, exports, sharing) cost far more than $29 a month. The two tools aren't really competing on price; they're competing on shape.

Audio file or URL? Skip the Kaldi runtime
500 hours / month for $29 — Team plan

Whisper Large-v3 plus diarization on server GPUs. Paste a URL or drop a file. SRT, DOCX, VTT, JSON exports included. Speaker labels by default.

See pricing →

A short decision tree

If you're trying to figure out which one fits your project in under a minute:

  1. Is the device a Raspberry Pi, phone, or embedded board with no reliable internet? → Vosk.
  2. Do you need real-time streaming partials with sub-second latency? → Vosk.
  3. Is the audio recorded — a podcast, interview, meeting, lecture, YouTube video? → Whipscribe.
  4. Do you need diarization that you don't want to engineer? → Whipscribe.
  5. Do you have a URL — YouTube, podcast feed, Drive link — and want a transcript fast? → Whipscribe.
  6. Are you size-constrained to under a few hundred MB on the target device? → Vosk.
  7. Are you accuracy-constrained because the transcript will be quoted publicly? → Whipscribe.

If multiple answers point in opposite directions, you have two different problems and probably want both tools — Vosk for the on-device piece, Whipscribe for the file-based piece.

The honest summary

Vosk and Whipscribe are not competitors. Vosk is the right answer when your problem is shaped like a microphone on a constrained device. Whipscribe is the right answer when your problem is shaped like an audio file or a URL and you want the transcript to be correct. The mistake is using either tool for the other's job. We've never tried to compete with Vosk for embedded voice control, and you shouldn't try to use Vosk to transcribe a podcast in 2026.

Frequently asked

What is Vosk?

Vosk is an offline speech-recognition toolkit built on Kaldi, by Alpha Cephei. Apache-2.0 licensed, ships pre-trained acoustic models for 20+ languages, and is best known for its tiny footprint — the small English model is ~50 MB and runs in real time on a Raspberry Pi. It pre-dates Whisper and uses a different model family entirely.

How accurate is Vosk compared to Whisper Large-v3?

On clean English read speech, the small Vosk model is around 10–15% WER; the large Vosk model is around 6–9%. Whisper Large-v3 reports ~2.7% on the same kind of material. The gap widens on accented, conversational, and noisy audio. Vosk wins on size and latency; Whisper wins on accuracy.

Does Vosk run on a Raspberry Pi?

Yes. The 50 MB small models run in real time on a Pi 4, on flagship Android phones, on iOS, on cheap x86 boards, and inside browsers via WebAssembly. No GPU, no Python runtime, no cloud round-trip. This is the niche where Vosk is the rational answer in 2026.

Does Vosk support speaker diarization?

The Vosk API exposes speaker identification via x-vector embeddings if you load a separate speaker model, but it is not the polished diarization pipeline you get from pyannote-audio or whisperX. For podcasts, interviews, and meetings, Whisper plus whisperX produces noticeably better speaker labels.

Is Whipscribe built on Vosk?

No. Whipscribe runs OpenAI's Whisper Large-v3 on server GPUs via faster-whisper plus whisperX. Vosk and Whisper are different model families — Kaldi-based HMM-DNN versus Transformer encoder-decoder — and they serve different deployment shapes.

When should I pick Vosk over Whipscribe?

When audio cannot leave the device, when the device is a Raspberry Pi or phone or embedded board with no reliable internet, when you need real-time streaming with sub-second latency, or when you only need single-speaker voice control on a constrained vocabulary.

When should I pick Whipscribe over Vosk?

When you have audio files or URLs you want transcribed accurately with speaker labels, when accent and noise robustness matter, when you want diarization, SRT, DOCX, and JSON without engineering them yourself, or when you don't want to run a Kaldi runtime.

Can I use both?

Yes — the relay pattern is a real one. Run Vosk on the device for instant rough drafts in the field, then re-run the same audio through Whipscribe once you have bandwidth for the publication-grade transcript. The two tools cover different parts of the same workflow.

Audio file or URL? Skip the Kaldi runtime, get a Whisper Large-v3 transcript with speaker labels in minutes.

See pricing →