What is distil-whisper?

distil-whisper is a distilled version of OpenAI's Whisper model published by Hugging Face (Sanchit Gandhi and collaborators). It is trained as a student model whose teacher is Whisper Large-v3, with a smaller decoder and pseudo-labelled training data. The shipping checkpoints are distil-large-v3, distil-medium.en, and distil-small.en — together, roughly 6× faster than the teacher on CPU and 49% smaller in parameter count, with a word error rate within about one percentage point on out-of-distribution English audio. It is MIT-licensed.

How fast is distil-whisper compared to Whisper Large-v3?

Hugging Face's published numbers report distil-large-v3 at roughly 6.3× the speed of Whisper Large-v3 on a CPU short-form benchmark, with the largest gains coming from a much shallower decoder (2 layers vs 32). On GPU the speedup is more modest because Large-v3 already saturates the hardware, but distil-whisper still wins on memory and throughput per dollar. Real-world results depend on batch size, hardware, and chunking strategy.

Does distil-whisper support languages other than English?

The original distil-whisper checkpoints (distil-medium.en and distil-small.en) were English-only. distil-large-v3 is the same — trained primarily on English pseudo-labels. A multilingual distillation has been signalled by Hugging Face but the highest-accuracy multilingual transcription in 2026 is still Whisper Large-v3 itself. If your audio is in Spanish, Hindi, Mandarin, or any other non-English language, the teacher model is the safer choice today.

Should I use distil-whisper or Whipscribe?

Use distil-whisper if your workload is English-heavy, high-throughput, runs on hardware you already pay for, and you're willing to build the pipeline around it — chunking, exports, diarization, UI. Use Whipscribe if you want transcripts without owning any of that — a single paste-a-URL or upload step, multilingual coverage, speaker labels, and exports for free, hosted for you.

Is distil-whisper as accurate as Whisper Large-v3?

On clean, in-distribution English, very close — within a few tenths of a percentage point on LibriSpeech in published benchmarks. The gap widens to roughly one percentage point on out-of-distribution English audio (noisy phone calls, heavy accents, domain-specific jargon). On non-English content the gap is much larger because the distilled checkpoints aren't optimised for it. For research-grade or legal-grade transcription, the teacher is still the reference point.

Does distil-whisper do speaker diarization?

No. Speaker diarization is a separate task — distil-whisper transcribes audio, but it doesn't tell you who said what. To add diarization you typically pair it with pyannote.audio or a project like WhisperX, then align the diarization timestamps to your transcript. Whipscribe ships diarization included on every paid tier.

What hardware do I need to run distil-whisper?

distil-large-v3 will run on a modern CPU at usable speed for batch jobs (roughly real-time on a recent server CPU, faster on Apple Silicon via MPS or whisper.cpp ports). For real-time or high-volume workloads a single mid-range NVIDIA GPU — a T4, an L4, an A10, or anything with at least 8 GB of VRAM — handles concurrent streams comfortably. The distilled checkpoints make edge inference (Raspberry Pi 5, Jetson, browser via Transformers.js) genuinely viable.

Can I call Whipscribe from my own Python pipeline?

Yes. Whipscribe exposes a REST API and an MCP server. A Python pipeline can POST audio or a URL and get back JSON, SRT, VTT, DOCX, or TXT — same as any HTTP client. Some teams run distil-whisper for high-throughput English batch and call Whipscribe as a fallback for non-English files or when diarization matters.

distil-whisper vs Whipscribe (2026): a 6× faster English engine vs a hosted multilingual pipeline

May 8, 2026 · Neugence · 12 min read

distil-whisper is Hugging Face's distilled version of Whisper Large-v3 — about 6× faster on CPU short-form, 49% fewer parameters, within roughly one point of WER on out-of-distribution English. It is, very specifically, a faster engine. Whipscribe is the rest of the car: a hosted pipeline that takes a URL or a file, runs Whisper Large-v3 plus WhisperX diarization on a server GPU, and hands back transcripts with speaker labels, timestamps, and exports. This is a piece-vs-product decision. Below is the honest read on which one is right for which job.

The two things at a glance

distil-large-v3 size

756M params

vs Whisper Large-v3

−49% params

CPU speedup

~6.3×

WER gap (OOD)

~1 pt

Decoder layers

2 vs 32

License

MIT

Numbers from the Hugging Face distil-whisper repo, the project's arXiv paper ("Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling"), and the model cards for distil-large-v3, distil-medium.en, and distil-small.en. CPU figures are short-form (<30s) on a single thread; GPU and long-form gaps are smaller. Real numbers depend on hardware, batch size, and chunking.

What distil-whisper actually gives you

Three numbers tell the story.

6× faster on CPU short-form. The published benchmark in the Hugging Face blog post ("Distil-Whisper: a Whisper Distillation that Doesn't Compromise Performance," October 2023) reports distil-large-v3 at about 6.3× the throughput of Whisper Large-v3 on short-form audio on a CPU. The distillation cuts the decoder from 32 transformer layers to 2; the encoder stays full-fat. Because the decoder is the autoregressive bottleneck, shrinking it is where most of the speedup comes from.
49% fewer parameters. distil-large-v3 is roughly 756M parameters vs Whisper Large-v3's 1,550M. Smaller weights mean less RAM, less VRAM, smaller container images, and dramatically better fits on edge devices. The full model loads in well under 2 GB at FP16.
About 1 point of WER, on out-of-distribution English. The paper reports word error rates within 1% of Whisper Large-v3 on OOD evaluation sets, and within tenths of a point on in-distribution clean speech (LibriSpeech). For most English content — meetings, podcasts, customer-support calls, broadcast — the accuracy gap is invisible to a reader.

What you get on top of those numbers: an MIT license, native support in Hugging Face transformers, easy quantisation (bitsandbytes, ONNX, GGUF via whisper.cpp ports), and the option to run it via faster-whisper / CTranslate2 for further speedups. That's a serious open-source engine.

What distil-whisper is not

The model is the engine. Everything that turns an engine into a usable transcription product is your problem to build.

No diarization. distil-whisper transcribes; it does not tell you who said what. To add speaker labels you bolt on pyannote.audio or use the WhisperX project's diarization pipeline, then align timestamps. That's another model, another set of weights, and another integration test surface.
No URL ingestion. You hand the model a Mel-spectrogram. Pulling audio from a YouTube link, a podcast feed, a Vimeo video, or a Loom share is your job — yt-dlp, Apple Podcast feeds, a queue, a download box. With YouTube specifically, IP gating is a real operational headache for production servers.
No exports. The model returns text and timestamps. SRT, VTT, DOCX, JSON, SBV, TTML, LRC — every format your users will ask for is a serializer you write.
No long-form chunking. Whisper architectures look at 30-second windows. For an hour-long file you need a chunking strategy with overlap, hallucination guards, and sentence-boundary stitching. Hugging Face ships a long-form pipeline, but it has its own quirks and isn't free of edge cases.
No UI, no auth, no storage. Whatever surface a non-engineer will paste a URL into, you build. Whatever lets a journalist re-open last week's transcript, you build. Whatever shows a $-meter to your finance team, you build.
Mostly English-only checkpoints today. distil-large-v3 is English-trained; a multilingual distillation has been signalled but is not the default shipping artifact. For Spanish, Hindi, Mandarin, Tagalog, Arabic, German — the safer answer in 2026 is still the teacher (Whisper Large-v3) or a hosted service that runs it.

Read distil-whisper as a Whisper Large-v3 you can afford to run at scale on English audio. It does not change what a transcription product is — it changes the inference budget. Everything around the model is unchanged.

Side-by-side, with no varnish

Dimension	distil-whisper	Whipscribe
Shape of the thing	A model (3 checkpoints) on Hugging Face Hub	A hosted product — web, API, MCP, Chrome extension
Underlying ASR	Distilled Whisper Large-v3 (2-layer decoder)	Whisper Large-v3 (32-layer decoder) + WhisperX
Speed vs Whisper Large-v3	~6× on CPU short-form, smaller speedup on GPU	Server-GPU latency — minutes for an hour of audio
Multilingual coverage	English-focused; multilingual distillation in progress	99 languages via Large-v3 (full multilingual)
Out-of-distribution accuracy	~1 pt WER below Large-v3 on OOD English	Full Large-v3 accuracy; preferred for noisy / non-English
Speaker diarization	Not included — bolt on pyannote / WhisperX yourself	Included on every paid tier
URL ingestion (YouTube etc.)	Build it yourself with yt-dlp + queue	Paste a URL — handled server-side
Exports (SRT, VTT, DOCX, JSON)	Write the serializers	Built-in
Long-form chunking + hallucination guards	Roll your own (HF pipeline gets you started)	Production-tuned chunking + reconciliation
Hardware to run it	Your CPU / GPU; 8 GB VRAM comfortable for production	Ours
Pricing	Free model + your hardware + your dev time	$2/hr PAYG · $12/mo Pro (100 hr) · $29/mo Team (500 hr)
License	MIT	Commercial SaaS
Audience	ML engineers, infra teams, edge developers	Anyone with audio (podcaster, journalist, researcher, agent)

The honest summary of the table: distil-whisper is the right answer if "transcription" is something your engineering org owns end-to-end. Whipscribe is the right answer if "transcription" is something you want to consume.

When distil-whisper is the right call

Four shapes of work where distil-whisper genuinely wins:

High-throughput English-only batch. A call-centre with millions of minutes of recorded English support calls. A broadcast captioning pipeline running through archival footage. A media monitoring service consuming hundreds of US-English podcasts a day. The combination of "English" and "throughput per dollar" is exactly the surface distil-whisper was distilled for.
Edge inference where the parameter count gates feasibility. A Raspberry Pi 5 doing live transcription. A Jetson Nano in a kiosk. An on-device feature in a desktop app where you don't want to ship a 3 GB model. A whisper.cpp port of distil-large-v3 cuts disk and memory in half — sometimes the difference between "ships" and "doesn't ship."
You already have the pipeline. If you've built a transcription product around Whisper or faster-whisper and the bottleneck is now compute cost, distil-whisper is a near-drop-in model swap that gives you a 4–6× CPU speedup with a barely-perceptible quality regression on English. That's a high-leverage migration with a small surface area.
Privacy or air-gap requirements. Anything that legitimately can't leave the customer's network. distil-whisper runs locally; the inference surface is yours to lock down. A hosted pipeline can't satisfy "audio never leaves this VPC" — distil-whisper can.

When Whipscribe is the right call instead

The shape of the work is different. You're not running an inference pipeline; you have audio and you need transcripts.

Multilingual content. A journalist interviewing sources in three languages. A research team studying global media. A founder reading transcripts of customer calls in Mexico, Brazil, and the Philippines. distil-whisper's English specialisation is a feature for English-only teams — for everyone else, it's a regression. Whipscribe runs full Whisper Large-v3 across 99 languages with no accuracy compromise.
You want the product, not the engine. A podcaster who records a weekly episode. A grad student transcribing fieldwork interviews. A YouTube creator generating captions and chapter summaries. A founder turning sales calls into searchable notes. None of these people benefit from owning a model. They benefit from pasting a URL and getting a transcript with speaker labels.
You want diarization, exports, and URL ingestion without writing them. Speaker labels, SRT, VTT, DOCX, and JSON exports, YouTube / Vimeo / Loom URL ingestion — all of these are line items in a distil-whisper integration plan and free in Whipscribe.
You want an MCP-callable transcription endpoint. If your workflow lives in Claude Desktop or Cursor, Whipscribe's MCP server lets your AI agent transcribe URLs and files directly — no server, no glue code. There is no MCP layer over distil-whisper today; you'd be writing it.
The audio quality is variable. Out-of-distribution English — noisy phone calls, heavy regional accents, domain jargon, court recordings — is exactly where distil-whisper's ~1-point WER gap shows up most. For high-stakes transcription (legal, medical, journalism) the teacher model's accuracy is worth paying for.

Worked example: a 200-hour-per-month US podcast network

Let's make the choice concrete. You run a small podcast network: 15 shows, mostly US English, totalling about 200 hours of audio per month. You need transcripts for show notes, SEO pages, and a search-the-archive feature. You're choosing between rolling distil-whisper on your own infrastructure and using Whipscribe Team.

distil-whisper, self-hosted on a single GPU

Item	Estimate
L4 GPU instance (24 GB VRAM, on-demand cloud)	~$0.80 / hr × 730 hr/mo
Cloud GPU monthly cost	~$584 / mo
Storage + egress + spot-failover overhead	~$40 / mo
Engineer time (build + maintenance, amortised — pipeline, queue, diarization bolt-on, retries, monitoring)	10 hr/mo × your loaded rate
Total (hardware only, before engineer time)	~$624 / mo

Assumes you keep one L4 warm 24/7; spot pricing or auto-scaling can cut hardware ~60%, but adds engineering complexity. Diarization via pyannote on the same GPU is feasible but adds latency.

Whipscribe Team — 500 hours included

Item	Cost
Monthly subscription	$29 / mo
Diarization, exports, URL ingest	Included
Engineer time	0 — paste URL or call API
Total	$29 / mo

200 hr of audio fits comfortably within the 500-hr Team allowance. Per hour of audio: $0.145.

The hardware-only cost gap is roughly 20×. Once you add engineer time — building the YouTube ingest, the SRT serializer, the diarization alignment, and the on-call rotation when the GPU goes bad at 2 a.m. — the gap is much wider.

This is not an argument that distil-whisper is wrong. It's an argument that distil-whisper is right when 200 hours/month is the floor, not the ceiling — when you're absorbing 5,000 or 50,000 hours and the per-hour math flips. For a 200-hour podcast network, the math says buy the hours, finish the backlog, and put the engineering capacity into your show.

200 hours of audio, $29/mo, no infrastructure

Whipscribe Team — 500 hours / month

Same Whisper Large-v3 family. Server GPUs. Diarization, SRT / VTT / DOCX / JSON exports, URL ingestion, MCP endpoint — all included. The pipeline already exists.

See pricing →

The honest place where distil-whisper wins outright

To stay fair to a genuinely good open-source release: there is a class of work where distil-whisper is the correct answer and Whipscribe is the wrong one.

Air-gapped or on-prem deployments. Hospitals, government, defence, finance compliance — anywhere audio must not leave the network. A hosted product structurally cannot serve this. distil-whisper does, on hardware you already own.
Embedded / edge use cases at scale. An app that ships local English transcription on a million devices. Whipscribe is hosted; distil-whisper is the right tech.
Massive-scale English ingestion where per-call cost dominates. Once you're processing tens of thousands of hours per month and you have an ML team that already runs models in production, the per-hour math tilts toward owning the engine. distil-whisper is the cheapest production-quality way to do that on English audio in 2026.
You're building a transcription product yourself. If transcription is the product you're selling, you want to own the model. distil-whisper is a defensible base because most of your competitors are still on Whisper Large-v3 and pay for the speed difference every month.

The hybrid pattern that's quietly common

Some teams run both. distil-whisper handles the high-throughput English-only batch — call-centre archives, podcast back-catalogues, broadcast captioning — and Whipscribe takes everything else: non-English, ad-hoc requests from non-engineers, MCP calls from internal AI agents, anything where the marginal hour isn't worth a pipeline maintainer's attention.

The reason that pattern works is that the two tools answer different questions. distil-whisper is "we own the inference and we know what we're doing." Whipscribe is "we want a transcript and we have other work." Most companies have audio of both kinds.

The honest summary. distil-whisper is one of the best open-source releases of the year — a real ~6× speedup with a barely-noticeable quality cost on English. If you have an ML team, English-heavy audio, and serious throughput, run it. If you have audio in any language, want diarization for free, and don't want to own a GPU — Whipscribe is built for that exact shape of need. Pick the one whose job description matches yours.

Frequently asked

What exactly is distilled in distil-whisper?

The decoder. Whisper Large-v3's decoder has 32 transformer layers; distil-large-v3's has 2. The encoder is kept full-fat because that's where the acoustic understanding lives — shrinking it costs accuracy fast. Hugging Face trained the smaller decoder using teacher–student distillation on roughly 22,000 hours of pseudo-labelled audio. The result is roughly 6× faster on CPU with a ~1-point WER gap on out-of-distribution English.

How much faster is distil-whisper on a GPU?

The CPU speedup of ~6× is the headline number, but it's the friendliest case. On a modern GPU, Whisper Large-v3 already runs efficiently — the 32-layer decoder isn't the bottleneck because the GPU's parallelism hides a lot of the cost. Real-world GPU speedups for distil-large-v3 range from roughly 1.5× to 3× depending on batch size and chunking. The bigger GPU win is memory: half the parameters means more concurrent streams per card.

Can I run distil-whisper in the browser?

Yes — Hugging Face's Transformers.js has WASM-quantised builds of distil-medium.en and distil-small.en that run in modern browsers. It's not as fast as native, but it's the only credible "Whisper-quality" option that runs entirely client-side. For production traffic, server-side distil-whisper or a hosted API is still more reliable.

Is distil-whisper better than faster-whisper?

They're complementary. faster-whisper is a CTranslate2 reimplementation of OpenAI's Whisper that's faster than the reference implementation at the same accuracy. distil-whisper is a smaller, distilled model. You can run distil-whisper through faster-whisper and stack the speedups. For pure throughput on English the combination is one of the strongest open-source recipes available in 2026.

Does Whipscribe also support distil-whisper under the hood?

Whipscribe's production pipeline is Whisper Large-v3 plus WhisperX for word-level alignment and diarization. We've benchmarked distil-large-v3 internally; its English accuracy is excellent and we may use it for specific routes (e.g., explicit "fast English-only" tier), but the default pipeline runs the teacher model so multilingual users and high-stakes transcripts get full Large-v3 quality.

How do I add diarization to distil-whisper?

The standard recipe is to run pyannote.audio's speaker-diarization pipeline alongside transcription, then align the speaker timeline to the word-level timestamps. The WhisperX project automates this for Whisper-family models, including distilled variants. Expect to add a second model (~1 GB), a second inference pass, and an alignment step. Whipscribe ships diarization built in.

Can I fine-tune distil-whisper on my domain?

Yes. The model is on Hugging Face Hub under MIT, and the Hugging Face team has published a fine-tuning recipe for Whisper-family models that works for the distilled variants too. For domain-specific English audio (medical, legal, technical jargon) a fine-tune on a few hundred hours of in-domain data tends to close most of the accuracy gap to Large-v3 and sometimes exceeds it.

Where can I read the original distil-whisper paper?

"Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling" by Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush (Hugging Face, 2023). Published on arXiv. The repo at github.com/huggingface/distil-whisper has the README, training scripts, and links to all three model checkpoints on Hugging Face Hub.

Run distil-whisper for English throughput. Use Whipscribe for everything else — multilingual, diarization, URL ingestion, MCP, exports — without owning a GPU.

See pricing →

The two things at a glance

What distil-whisper actually gives you

What distil-whisper is not

Side-by-side, with no varnish

When distil-whisper is the right call

When Whipscribe is the right call instead

Worked example: a 200-hour-per-month US podcast network

distil-whisper, self-hosted on a single GPU

Whipscribe Team — 500 hours included

The honest place where distil-whisper wins outright

The hybrid pattern that's quietly common

Frequently asked

Related