insanely-fast-whisper vs Whipscribe (2026): peak GPU throughput vs hosted product

May 8, 2026 · Neugence · 13 min read

insanely-fast-whisper, the open-source CLI by Vaibhav Srivastav, transcribes 150 minutes of audio in under 100 seconds on an RTX 4090. That is roughly 90 times real-time, and as of May 2026 it is the throughput ceiling for Whisper Large-v3 on a single consumer GPU. Whipscribe is the hosted product where the GPU, the queue, the URL ingest, the diarization, and the export pipeline are someone else’s problem. Same model family underneath, two completely different jobs. Below is the install reality, the break-even math, and the honest verdict on which side of the line you are on.

The decision in one paragraph

If you have an NVIDIA GPU box, more than 1000 hours of audio per month, and engineering time to build the pipeline above raw inference, insanely-fast-whisper wins on per-call cost. If you do not own GPU infrastructure, your volume sits below that line, or you want a URL-ingest-to-DOCX product instead of a CLI to wire into your own queue, Whipscribe is the right tool. There is no third option that is both cheaper than self-hosting at scale and easier than a hosted endpoint at low volume — those two lines cross around the 1000 hours-per-month mark, and that crossover is what this post is about.

What insanely-fast-whisper actually gives you

The CLI is small on purpose. Underneath it stitches together three pieces that, until 2024, took experienced ML engineers a week to wire up properly:

The CLI also does the practical things: --batch-size for tuning the encoder pass to your VRAM budget, --device-id for picking a GPU on a multi-GPU box, --diarization_model for plugging in pyannote (the diarization step is shelled out, not first-class), and a JSON output mode with word-level timestamps. License is Apache-2.0, distribution is PyPI under the name insanely-fast-whisper, and the canonical install path is pipx install insanely-fast-whisper.

What it is in one sentence. insanely-fast-whisper is the “how fast can a single NVIDIA GPU go on Whisper Large-v3” reference implementation, packaged as a CLI a single engineer can run from a fresh box.

The 90x real-time number, with the asterisks

The headline benchmark in the project README is 150 minutes of audio in roughly 98 seconds on an RTX 4090 — about 92 times real-time. That number is real, and it is reproducible if you set up the box exactly the way the README does. The asterisks matter:

GPU Realised real-time factor What changes
RTX 409024 GB · Ada Lovelace ~90x Flash-Attention-2 native; published headline number; the consumer ceiling.
H10080 GB · Hopper 100x+ Flash-Attention-2 with FP8 path; faster but not 5x faster — the encoder is the bottleneck.
A10040 / 80 GB · Ampere ~70–80x Flash-Attention-2 supported; older tensor cores; common rented-cloud baseline.
RTX 309024 GB · Ampere ~40–55x Flash-Attention-2 works but slower kernels; the practical “own a 4090” alternative.
12 GB consumer cardRTX 4070, RTX 3080 Ti ~25–40x Lower batch size; you trade VRAM for throughput. Still very fast on Large-v3.
CPU only not supported The project is explicit: GPU required. Use whisper.cpp for CPU paths.

Numbers are community-reported medians across the project’s GitHub issues and the README; clean batched English audio with the encoder fully amortised. Multilingual, very short, or very noisy clips realise less.

Two practical points. First, the encoder pass is the fixed cost. For very short clips (under ~30 seconds) the encoder warm-up dominates and the realised throughput collapses; this is a batch tool, not a streaming one. Second, batch size matters more than people expect. The default batch size of 24 is sized for a 24 GB card; cut it to 4 or 8 on a 12 GB card and you reclaim VRAM but lose roughly half the throughput. Always benchmark your actual workload on your actual card before sizing capacity.

The install reality

One pipx command, then a fight with CUDA. That is the honest summary.

pipx install insanely-fast-whisper

The CLI itself installs cleanly. The interesting half is what happens when it tries to import flash_attn at first run. flash-attn is published as a binary wheel, but only for specific PyTorch builds, specific Python versions, and specific CUDA toolkits. If your environment does not match the precompiled grid, pip falls back to compiling from source — which needs nvcc on PATH, a matching CUDA 12.x toolchain, and roughly 5 to 30 minutes of compile time on a fresh box. Operators who have done this before pin all three versions in a Dockerfile and never look at it again. Operators who have not are usually surprised the first time.

The Windows footnote. Native Windows is not an officially supported platform for flash-attn; the maintained path is WSL2. If your batch box happens to be a Windows workstation, plan on either a WSL2 install or a Linux dual-boot. This is a recurring topic in the project’s GitHub issues.

What you build on top of the CLI

This is where the “free OSS” framing breaks down. insanely-fast-whisper hands you a transcript per file. A production transcription product needs roughly the following layers above that:

None of this is hard. All of it is real engineering time. A single experienced engineer can build a credible v0 in 2 to 4 weeks; a production-grade version with the rough edges sanded usually takes 2 to 3 months and a permanent ~10% slice of someone’s time after that to keep alive.

The decision matrix

Dimension insanely-fast-whisper Whipscribe
What it isA CLI you run on your own GPU boxA hosted product (web, API, MCP, Chrome extension)
Peak throughput~90x real-time on RTX 4090Bounded by your network upload, not the GPU
Hardware requiredNVIDIA GPU, 12 GB+ VRAM, CUDA 12.xA browser. Or our REST API. Or our MCP server.
Installpipx + flash-attn compile (5–30 min on fresh box)None. Open the page, drop a file.
License / costApache-2.0, free + your GPU bill$0 free tier, $2/hr PAYG, $12 / $29 monthly
URL ingest (YouTube, podcasts)Build it yourselfBuilt in (paste a URL)
DiarizationShells out to pyannote, you wire it upWhisperX-based, on by default on every paid tier
Exports (SRT/VTT/DOCX/JSON)JSON only out of the boxAll formats included
Queue / retries / multi-tenantYou own the platform codeOperated for you
Streaming / real-timeNot the design point (batch tool)Not yet (batch product)
MCP server for LLM agentsBuild your ownLive at /claude
Languages99 (whatever Whisper supports)99 (same model family)
Sweet spot1000+ hours/month batch jobs on owned GPUBelow that volume, or anyone without a GPU

The break-even math, with numbers

Here is the worked example that operators usually want first.

Scenario: a podcast network with 1000 hours of audio per month

The cost to run insanely-fast-whisper on a dedicated rented RTX 4090 box (typical 2026 pricing from cloud providers like RunPod, Lambda, or Vast.ai is in the $0.30 to $0.50 per hour range for community-tier 4090s, $200 to $400 per month for a reserved instance):

Run the same workload on Whipscribe Team:

The honest break-even. At 1000 hours per month, insanely-fast-whisper’s steady-state per-hour cost beats Whipscribe’s once the engineering investment is amortised — but only over a 12 to 24 month horizon, and only if you actually have someone to operate the pipeline. If your team does not already have a platform engineer with NVIDIA experience, Whipscribe is the cheaper answer at any volume below roughly 5000 hours per month, because your time is not free.

Scenario: a journalist with 10 hours of interviews per month

insanely-fast-whisper would cost more in a single afternoon of CUDA debugging than a year of Whipscribe Pro at $12 per month. This is not the audience the project is for, and the project does not pretend otherwise. Use Whipscribe.

Scenario: a research lab with 50 hours per month and an existing GPU cluster

If the GPUs are already there and idle, insanely-fast-whisper is essentially free compute. If the lab does not need URL ingest, diarization, or polished exports, the CLI is the right answer. If they want any of those, Whipscribe Pro at $12 per month is cheaper than the engineering time to bolt them on.

Skip the CUDA fight
Same Whisper Large-v3 family. Server GPUs. URL ingest, diarization, MCP — all included.

Free 30 min/day. $2/hr PAYG. $12/mo Pro for 100 hours. $29/mo Team for 500 hours. No flash-attn compile.

See pricing →

When insanely-fast-whisper is the right call

Right tool when…

You operate the GPU yourself

  • Call-center analytics platforms processing 5,000+ hours/month
  • Broadcast captioning vendors with their own DC racks
  • ML labs preparing training datasets from audio archives
  • SaaS products where transcription is a feature, not the product
  • Anyone who already has a CUDA platform team and an idle 4090 / A100 / H100
Whipscribe is the right tool when…

You want a transcript, not a pipeline

  • Podcasters, journalists, researchers, founders, students
  • Anyone below ~1000 hours/month on monthly volume
  • Anyone who does not already own a GPU box
  • LLM-agent integrations that need an MCP server today
  • Teams that want URL ingest, diarization, and DOCX export without writing them

Pricing — what each side actually costs

PlanWhat you getWhat it costs
insanely-fast-whisperThe CLI itself, Apache-2.0. You provide the GPU.$0 + your GPU bill ($200–$400/mo for a rented 4090, plus engineering)
Whipscribe Free30 minutes / day, every day. No sign-up, no credit card.$0
Whipscribe PAYGPer-hour billing for spiky usage. Diarization included.$2 / hour of audio
Whipscribe Pro100 hours / month. Right for one person clearing a backlog.$12 / month
Whipscribe Team · 500 hr500 hours / month. Right for a podcast network or research team.$29 / month

Can I use both?

Yes — and a non-trivial number of operators do. The hybrid pattern looks like this: insanely-fast-whisper on an in-house batch box for the historical archive (overnight processing, no SLA, cheap), Whipscribe as the customer-facing endpoint, the MCP server an LLM agent calls, and the burst-capacity fallback when the in-house GPU is queued or down. The two are not substitutes; one is a max-throughput inference engine, the other is the product wrapped around inference. Most teams that take transcription seriously end up running both in different places.

Frequently asked

What is insanely-fast-whisper?

An open-source CLI by Vaibhav Srivastav that wraps Hugging Face Transformers, Flash-Attention-2, and BetterTransformer SDPA into a one-command Whisper-Large-v3 runner. It hits roughly 90 times real-time on an RTX 4090 — 150 minutes of audio in ~98 seconds. Apache-2.0, published on PyPI, GPU-only.

How fast is it compared to faster-whisper?

On a top-end GPU, insanely-fast-whisper’s Flash-Attention-2 path is roughly 3 to 5 times faster than faster-whisper’s CTranslate2 path on Large-v3. faster-whisper is the more common production choice because it runs on smaller cards, installs without a flash-attn compile, and has a more friendly Python API. insanely-fast-whisper wins on a beefy GPU; faster-whisper wins on a typical one.

Why is the install painful?

flash-attn ships precompiled wheels only for specific PyTorch and CUDA combinations. Step off that grid and pip falls back to a 5–30 minute source compile that needs nvcc on PATH and a matching CUDA 12.x toolchain. Pin your versions in a Dockerfile and the problem disappears; do it ad-hoc and it surprises you every time.

Does insanely-fast-whisper run on a Mac or on CPU?

No. The project is NVIDIA-GPU-only by design; CPU paths are not the goal. For Apple Silicon, the right tool is whisper.cpp or WhisperKit. For CPU, whisper.cpp again.

When is it cheaper than Whipscribe?

Once your monthly volume crosses roughly 1000 hours of audio and you already employ a platform engineer comfortable with CUDA. Below that volume, Whipscribe Pro or Team is cheaper than the engineering time to build the pipeline above raw inference. Above ~5000 hours per month, self-hosting is almost always cheaper.

Can I use insanely-fast-whisper for streaming or real-time captioning?

Not really. The CLI is a batch tool — the encoder warm-up dominates short-clip throughput, and there is no streaming API. For live captioning, the appropriate stack is a streaming-first model (Deepgram, Speechmatics, or AssemblyAI’s real-time tier) rather than a batched Whisper pass.

Does Whipscribe support diarization, SRT, DOCX, JSON, and URL ingestion?

Yes — all by default on every paid tier and on the daily 30-minute free allowance. Paste a YouTube URL or upload a file, get back TXT / SRT / VTT / DOCX / JSON with speaker labels and word-level timestamps.

Can I call Whipscribe from a Python or Node app the same way I call insanely-fast-whisper?

Yes — Whipscribe exposes a REST API and an MCP server. Same audio in, same JSON out, no GPU on your side. If you are migrating an existing insanely-fast-whisper pipeline because the GPU bill or operator load is no longer worth it, the swap is one HTTP call.

If your monthly volume sits below the break-even, skip the CUDA fight. Same Whisper model family on server GPUs, plus URL ingest, diarization, MCP, and an extension — for $12 to $29 a month.

See pricing →