insanely-fast-whisper vs Whipscribe (2026): peak GPU throughput vs hosted product
insanely-fast-whisper, the open-source CLI by Vaibhav Srivastav, transcribes 150 minutes of audio in under 100 seconds on an RTX 4090. That is roughly 90 times real-time, and as of May 2026 it is the throughput ceiling for Whisper Large-v3 on a single consumer GPU. Whipscribe is the hosted product where the GPU, the queue, the URL ingest, the diarization, and the export pipeline are someone else’s problem. Same model family underneath, two completely different jobs. Below is the install reality, the break-even math, and the honest verdict on which side of the line you are on.
The decision in one paragraph
If you have an NVIDIA GPU box, more than 1000 hours of audio per month, and engineering time to build the pipeline above raw inference, insanely-fast-whisper wins on per-call cost. If you do not own GPU infrastructure, your volume sits below that line, or you want a URL-ingest-to-DOCX product instead of a CLI to wire into your own queue, Whipscribe is the right tool. There is no third option that is both cheaper than self-hosting at scale and easier than a hosted endpoint at low volume — those two lines cross around the 1000 hours-per-month mark, and that crossover is what this post is about.
What insanely-fast-whisper actually gives you
The CLI is small on purpose. Underneath it stitches together three pieces that, until 2024, took experienced ML engineers a week to wire up properly:
- Hugging Face Transformers as the model loader and tokeniser, pinned to a Whisper Large-v3 (or distil-large-v3) checkpoint on the Hub.
- Flash-Attention-2 for the attention kernel — the IO-aware exact-attention implementation from Tri Dao’s 2023 paper. On Hopper and Ada-Lovelace GPUs (H100, RTX 4090, RTX 6000 Ada) this is the reason throughput jumps roughly 2x over a vanilla PyTorch attention pass.
- BetterTransformer / SDPA as the fallback path on cards where flash-attn does not yet ship a wheel — Ampere consumer cards (RTX 3090, A40), some Turing cards, and any GPU where the user has skipped the flash-attn install.
The CLI also does the practical things: --batch-size for tuning the encoder pass to your VRAM budget, --device-id for picking a GPU on a multi-GPU box, --diarization_model for plugging in pyannote (the diarization step is shelled out, not first-class), and a JSON output mode with word-level timestamps. License is Apache-2.0, distribution is PyPI under the name insanely-fast-whisper, and the canonical install path is pipx install insanely-fast-whisper.
The 90x real-time number, with the asterisks
The headline benchmark in the project README is 150 minutes of audio in roughly 98 seconds on an RTX 4090 — about 92 times real-time. That number is real, and it is reproducible if you set up the box exactly the way the README does. The asterisks matter:
| GPU | Realised real-time factor | What changes |
|---|---|---|
| RTX 409024 GB · Ada Lovelace | ~90x | Flash-Attention-2 native; published headline number; the consumer ceiling. |
| H10080 GB · Hopper | 100x+ | Flash-Attention-2 with FP8 path; faster but not 5x faster — the encoder is the bottleneck. |
| A10040 / 80 GB · Ampere | ~70–80x | Flash-Attention-2 supported; older tensor cores; common rented-cloud baseline. |
| RTX 309024 GB · Ampere | ~40–55x | Flash-Attention-2 works but slower kernels; the practical “own a 4090” alternative. |
| 12 GB consumer cardRTX 4070, RTX 3080 Ti | ~25–40x | Lower batch size; you trade VRAM for throughput. Still very fast on Large-v3. |
| CPU only | not supported | The project is explicit: GPU required. Use whisper.cpp for CPU paths. |
Numbers are community-reported medians across the project’s GitHub issues and the README; clean batched English audio with the encoder fully amortised. Multilingual, very short, or very noisy clips realise less.
Two practical points. First, the encoder pass is the fixed cost. For very short clips (under ~30 seconds) the encoder warm-up dominates and the realised throughput collapses; this is a batch tool, not a streaming one. Second, batch size matters more than people expect. The default batch size of 24 is sized for a 24 GB card; cut it to 4 or 8 on a 12 GB card and you reclaim VRAM but lose roughly half the throughput. Always benchmark your actual workload on your actual card before sizing capacity.
The install reality
One pipx command, then a fight with CUDA. That is the honest summary.
pipx install insanely-fast-whisperThe CLI itself installs cleanly. The interesting half is what happens when it tries to import flash_attn at first run. flash-attn is published as a binary wheel, but only for specific PyTorch builds, specific Python versions, and specific CUDA toolkits. If your environment does not match the precompiled grid, pip falls back to compiling from source — which needs nvcc on PATH, a matching CUDA 12.x toolchain, and roughly 5 to 30 minutes of compile time on a fresh box. Operators who have done this before pin all three versions in a Dockerfile and never look at it again. Operators who have not are usually surprised the first time.
What you build on top of the CLI
This is where the “free OSS” framing breaks down. insanely-fast-whisper hands you a transcript per file. A production transcription product needs roughly the following layers above that:
- URL ingest — yt-dlp for YouTube, podcast feed parsers, Drive / Dropbox / S3 clients, format conversion to 16 kHz mono WAV. None of this is in the CLI.
- Chunking and stitching — Whisper officially handles up to 30 seconds of audio per forward pass; the CLI does the chunking but you own the seam-fixing logic for sentence boundaries that fall on chunk edges.
- Diarization — pyannote runs as a separate model with separate weights, separate license, and separate failure modes. Wiring it to Whisper word timestamps so the speaker labels actually align is its own project.
- Exports — SRT, VTT, DOCX, JSON each have their own quirks. SRT line-wrapping for broadcast use is a half-day of code by itself.
- Queue and retries — Celery, Dramatiq, RQ, Temporal, your choice; you also own the retry-on-CUDA-OOM logic, the backoff on flaky network, and the dead-letter queue.
- Multi-tenancy — auth, usage metering, rate limits, per-tenant storage isolation, audit logs. If anyone other than you is going to see the output, this is mandatory.
- Monitoring — Prometheus exporters, GPU utilisation alerts, queue-depth dashboards, model-load-time histograms.
None of this is hard. All of it is real engineering time. A single experienced engineer can build a credible v0 in 2 to 4 weeks; a production-grade version with the rough edges sanded usually takes 2 to 3 months and a permanent ~10% slice of someone’s time after that to keep alive.
The decision matrix
| Dimension | insanely-fast-whisper | Whipscribe |
|---|---|---|
| What it is | A CLI you run on your own GPU box | A hosted product (web, API, MCP, Chrome extension) |
| Peak throughput | ~90x real-time on RTX 4090 | Bounded by your network upload, not the GPU |
| Hardware required | NVIDIA GPU, 12 GB+ VRAM, CUDA 12.x | A browser. Or our REST API. Or our MCP server. |
| Install | pipx + flash-attn compile (5–30 min on fresh box) | None. Open the page, drop a file. |
| License / cost | Apache-2.0, free + your GPU bill | $0 free tier, $2/hr PAYG, $12 / $29 monthly |
| URL ingest (YouTube, podcasts) | Build it yourself | Built in (paste a URL) |
| Diarization | Shells out to pyannote, you wire it up | WhisperX-based, on by default on every paid tier |
| Exports (SRT/VTT/DOCX/JSON) | JSON only out of the box | All formats included |
| Queue / retries / multi-tenant | You own the platform code | Operated for you |
| Streaming / real-time | Not the design point (batch tool) | Not yet (batch product) |
| MCP server for LLM agents | Build your own | Live at /claude |
| Languages | 99 (whatever Whisper supports) | 99 (same model family) |
| Sweet spot | 1000+ hours/month batch jobs on owned GPU | Below that volume, or anyone without a GPU |
The break-even math, with numbers
Here is the worked example that operators usually want first.
Scenario: a podcast network with 1000 hours of audio per month
The cost to run insanely-fast-whisper on a dedicated rented RTX 4090 box (typical 2026 pricing from cloud providers like RunPod, Lambda, or Vast.ai is in the $0.30 to $0.50 per hour range for community-tier 4090s, $200 to $400 per month for a reserved instance):
- Compute: 1000 hours of audio at 90x real-time = 11.1 hours of GPU time. At $0.40 per GPU-hour rented, that is roughly $4.50 per month in raw inference. Even budgeting for 5x overhead from cold starts, retries, and idle time, the GPU bill is under $25.
- Reserved capacity: if you reserve a $250-per-month 4090 instance for predictability, that is your floor. Per hour of audio, $250 / 1000 = $0.25 per hour.
- Engineering: the pipeline above (URL ingest, queue, diarization, exports, monitoring) costs an engineer roughly 4 weeks to build credibly and roughly 4 hours per week to keep alive after that. At $100 to $200 per engineer-hour fully-loaded, that is a $25k to $50k one-time cost plus $1.5k to $3k per month ongoing.
Run the same workload on Whipscribe Team:
- 500 hours / month for $29 (the Team tier), times two seats to cover 1000 hours = $58 per month, all in. URL ingest, diarization, exports, retries, MCP server included. Zero engineering.
- If the workload is exactly 1000 hours, PAYG at $2 per hour would be $2000 per month — Team is the right tier here, not PAYG.
Scenario: a journalist with 10 hours of interviews per month
insanely-fast-whisper would cost more in a single afternoon of CUDA debugging than a year of Whipscribe Pro at $12 per month. This is not the audience the project is for, and the project does not pretend otherwise. Use Whipscribe.
Scenario: a research lab with 50 hours per month and an existing GPU cluster
If the GPUs are already there and idle, insanely-fast-whisper is essentially free compute. If the lab does not need URL ingest, diarization, or polished exports, the CLI is the right answer. If they want any of those, Whipscribe Pro at $12 per month is cheaper than the engineering time to bolt them on.
Free 30 min/day. $2/hr PAYG. $12/mo Pro for 100 hours. $29/mo Team for 500 hours. No flash-attn compile.
See pricing →When insanely-fast-whisper is the right call
You operate the GPU yourself
- Call-center analytics platforms processing 5,000+ hours/month
- Broadcast captioning vendors with their own DC racks
- ML labs preparing training datasets from audio archives
- SaaS products where transcription is a feature, not the product
- Anyone who already has a CUDA platform team and an idle 4090 / A100 / H100
You want a transcript, not a pipeline
- Podcasters, journalists, researchers, founders, students
- Anyone below ~1000 hours/month on monthly volume
- Anyone who does not already own a GPU box
- LLM-agent integrations that need an MCP server today
- Teams that want URL ingest, diarization, and DOCX export without writing them
Pricing — what each side actually costs
| Plan | What you get | What it costs |
|---|---|---|
| insanely-fast-whisper | The CLI itself, Apache-2.0. You provide the GPU. | $0 + your GPU bill ($200–$400/mo for a rented 4090, plus engineering) |
| Whipscribe Free | 30 minutes / day, every day. No sign-up, no credit card. | $0 |
| Whipscribe PAYG | Per-hour billing for spiky usage. Diarization included. | $2 / hour of audio |
| Whipscribe Pro | 100 hours / month. Right for one person clearing a backlog. | $12 / month |
| Whipscribe Team · 500 hr | 500 hours / month. Right for a podcast network or research team. | $29 / month |
Can I use both?
Yes — and a non-trivial number of operators do. The hybrid pattern looks like this: insanely-fast-whisper on an in-house batch box for the historical archive (overnight processing, no SLA, cheap), Whipscribe as the customer-facing endpoint, the MCP server an LLM agent calls, and the burst-capacity fallback when the in-house GPU is queued or down. The two are not substitutes; one is a max-throughput inference engine, the other is the product wrapped around inference. Most teams that take transcription seriously end up running both in different places.
Frequently asked
What is insanely-fast-whisper?
An open-source CLI by Vaibhav Srivastav that wraps Hugging Face Transformers, Flash-Attention-2, and BetterTransformer SDPA into a one-command Whisper-Large-v3 runner. It hits roughly 90 times real-time on an RTX 4090 — 150 minutes of audio in ~98 seconds. Apache-2.0, published on PyPI, GPU-only.
How fast is it compared to faster-whisper?
On a top-end GPU, insanely-fast-whisper’s Flash-Attention-2 path is roughly 3 to 5 times faster than faster-whisper’s CTranslate2 path on Large-v3. faster-whisper is the more common production choice because it runs on smaller cards, installs without a flash-attn compile, and has a more friendly Python API. insanely-fast-whisper wins on a beefy GPU; faster-whisper wins on a typical one.
Why is the install painful?
flash-attn ships precompiled wheels only for specific PyTorch and CUDA combinations. Step off that grid and pip falls back to a 5–30 minute source compile that needs nvcc on PATH and a matching CUDA 12.x toolchain. Pin your versions in a Dockerfile and the problem disappears; do it ad-hoc and it surprises you every time.
Does insanely-fast-whisper run on a Mac or on CPU?
No. The project is NVIDIA-GPU-only by design; CPU paths are not the goal. For Apple Silicon, the right tool is whisper.cpp or WhisperKit. For CPU, whisper.cpp again.
When is it cheaper than Whipscribe?
Once your monthly volume crosses roughly 1000 hours of audio and you already employ a platform engineer comfortable with CUDA. Below that volume, Whipscribe Pro or Team is cheaper than the engineering time to build the pipeline above raw inference. Above ~5000 hours per month, self-hosting is almost always cheaper.
Can I use insanely-fast-whisper for streaming or real-time captioning?
Not really. The CLI is a batch tool — the encoder warm-up dominates short-clip throughput, and there is no streaming API. For live captioning, the appropriate stack is a streaming-first model (Deepgram, Speechmatics, or AssemblyAI’s real-time tier) rather than a batched Whisper pass.
Does Whipscribe support diarization, SRT, DOCX, JSON, and URL ingestion?
Yes — all by default on every paid tier and on the daily 30-minute free allowance. Paste a YouTube URL or upload a file, get back TXT / SRT / VTT / DOCX / JSON with speaker labels and word-level timestamps.
Can I call Whipscribe from a Python or Node app the same way I call insanely-fast-whisper?
Yes — Whipscribe exposes a REST API and an MCP server. Same audio in, same JSON out, no GPU on your side. If you are migrating an existing insanely-fast-whisper pipeline because the GPU bill or operator load is no longer worth it, the swap is one HTTP call.
If your monthly volume sits below the break-even, skip the CUDA fight. Same Whisper model family on server GPUs, plus URL ingest, diarization, MCP, and an extension — for $12 to $29 a month.
See pricing →