How fast is insanely-fast-whisper compared to faster-whisper or whisper.cpp?

On a beefy GPU, insanely-fast-whisper is the throughput leader. Published benchmarks from the project README show roughly 90 times real-time on an RTX 4090 with Whisper Large-v3 and Flash-Attention-2 — about 98 seconds for 150 minutes of audio. faster-whisper on the same class of GPU lands around 12 to 20 times real-time on Large-v3 with int8 quantisation, and is the more common production choice because it is easier to install and runs on smaller cards. whisper.cpp is the cross-platform answer that runs anywhere from a Raspberry Pi to Apple Silicon, but on raw NVIDIA throughput it is several times slower than either Python-stack contender.

What hardware do I need to run insanely-fast-whisper?

An NVIDIA GPU with enough VRAM to fit Whisper Large-v3 plus Flash-Attention-2 buffers — practically that means a 12 GB consumer card minimum, with the published 90x real-time number measured on a 24 GB RTX 4090 and similar numbers on A100 and H100 datacentre cards. CPU inference is not the design goal; the project page is explicit that it requires a CUDA-capable GPU. macOS and Apple Silicon are not supported.

When is insanely-fast-whisper the right answer?

When you have a high-volume batch workload (rough rule of thumb: more than 1000 hours of audio per month), you already own or rent CUDA infrastructure, and you have engineering time to build the layers above raw inference — URL ingest, chunking, diarization, exports, retries, multi-tenant queueing, monitoring. Call-center analytics, broadcast captioning vendors, dataset preparation labs, and any SaaS that brands transcription as a feature are the natural users.

When is Whipscribe the right answer?

When you do not own a GPU box, when you want a transcript not a pipeline, or when your monthly volume sits below the break-even where renting a 4090 stops paying for itself. Whipscribe is $0 for 30 minutes a day, $2 per hour pay-as-you-go, $12 per month for 100 hours on Pro, or $29 per month for 500 hours on Team. URL ingest, diarization, SRT, DOCX, JSON, an MCP server, and a Chrome extension are included. No CUDA, no flash-attn, no queue to operate.

Yes — and many operators do. A common pattern is insanely-fast-whisper on an in-house batch box for your own historical archive, plus Whipscribe as the hosted endpoint customers see, the MCP server an LLM agent calls, or the fallback when the in-house GPU is queued. The two are complementary: one is a max-throughput inference engine, the other is the product wrapped around inference.

Is the 90x real-time number realistic for my workload?

It is realistic for clean, batched English audio on a top-end consumer GPU with Flash-Attention-2 enabled, batch_size around 24, and the model already loaded into VRAM. Cold-start latency, multilingual audio, noisy input, very short clips that do not amortise the encoder pass, and weaker GPUs all bring the realised throughput down. A100 typically lands around 70 to 80 times real-time, an H100 above 100 times, and a 12 GB consumer card with reduced batch size around 30 to 50 times. Always benchmark on your own audio and your own card before committing capacity plans.

insanely-fast-whisper vs Whipscribe (2026): peak GPU throughput vs hosted product

Q: What is insanely-fast-whisper?

insanely-fast-whisper is an open-source command-line tool by Vaibhav Srivastav (Vaibhavs10 on GitHub) that wraps Hugging Face Transformers, Flash-Attention-2, and BetterTransformer SDPA into a single CLI. It pushes a single NVIDIA GPU to its throughput ceiling for Whisper Large-v3 — roughly 150 minutes of audio in under 100 seconds on an RTX 4090, about 90 times real-time. It is licensed under Apache-2.0 and is published on PyPI; install is one pipx command after the CUDA stack is in place.

Q: Why is the install painful?

The CLI itself is a single pipx install, but flash-attn under the hood needs a specific PyTorch build, a CUDA 12.x toolchain, and a matching nvcc on PATH at install time. flash-attn does not ship a wheel for every Python and PyTorch combination, so first-install on a new box is frequently a 5 to 30 minute compile. Windows is unofficially supported via WSL2; native Windows installs run into long-known flash-attn build issues. Pinning PyTorch, transformers, and flash-attn versions together is the difference between a working pipeline and a broken one.

May 8, 2026 · Neugence · 13 min read

insanely-fast-whisper, the open-source CLI by Vaibhav Srivastav, transcribes 150 minutes of audio in under 100 seconds on an RTX 4090. That is roughly 90 times real-time, and as of May 2026 it is the throughput ceiling for Whisper Large-v3 on a single consumer GPU. Whipscribe is the hosted product where the GPU, the queue, the URL ingest, the diarization, and the export pipeline are someone else’s problem. Same model family underneath, two completely different jobs. Below is the install reality, the break-even math, and the honest verdict on which side of the line you are on.

The decision in one paragraph

If you have an NVIDIA GPU box, more than 1000 hours of audio per month, and engineering time to build the pipeline above raw inference, insanely-fast-whisper wins on per-call cost. If you do not own GPU infrastructure, your volume sits below that line, or you want a URL-ingest-to-DOCX product instead of a CLI to wire into your own queue, Whipscribe is the right tool. There is no third option that is both cheaper than self-hosting at scale and easier than a hosted endpoint at low volume — those two lines cross around the 1000 hours-per-month mark, and that crossover is what this post is about.

What insanely-fast-whisper actually gives you

The CLI is small on purpose. Underneath it stitches together three pieces that, until 2024, took experienced ML engineers a week to wire up properly:

Hugging Face Transformers as the model loader and tokeniser, pinned to a Whisper Large-v3 (or distil-large-v3) checkpoint on the Hub.
Flash-Attention-2 for the attention kernel — the IO-aware exact-attention implementation from Tri Dao’s 2023 paper. On Hopper and Ada-Lovelace GPUs (H100, RTX 4090, RTX 6000 Ada) this is the reason throughput jumps roughly 2x over a vanilla PyTorch attention pass.
BetterTransformer / SDPA as the fallback path on cards where flash-attn does not yet ship a wheel — Ampere consumer cards (RTX 3090, A40), some Turing cards, and any GPU where the user has skipped the flash-attn install.

The CLI also does the practical things: --batch-size for tuning the encoder pass to your VRAM budget, --device-id for picking a GPU on a multi-GPU box, --diarization_model for plugging in pyannote (the diarization step is shelled out, not first-class), and a JSON output mode with word-level timestamps. License is Apache-2.0, distribution is PyPI under the name insanely-fast-whisper, and the canonical install path is pipx install insanely-fast-whisper.

What it is in one sentence. insanely-fast-whisper is the “how fast can a single NVIDIA GPU go on Whisper Large-v3” reference implementation, packaged as a CLI a single engineer can run from a fresh box.

The 90x real-time number, with the asterisks

The headline benchmark in the project README is 150 minutes of audio in roughly 98 seconds on an RTX 4090 — about 92 times real-time. That number is real, and it is reproducible if you set up the box exactly the way the README does. The asterisks matter:

GPU	Realised real-time factor	What changes
RTX 409024 GB · Ada Lovelace	~90x	Flash-Attention-2 native; published headline number; the consumer ceiling.
H10080 GB · Hopper	100x+	Flash-Attention-2 with FP8 path; faster but not 5x faster — the encoder is the bottleneck.
A10040 / 80 GB · Ampere	~70–80x	Flash-Attention-2 supported; older tensor cores; common rented-cloud baseline.
RTX 309024 GB · Ampere	~40–55x	Flash-Attention-2 works but slower kernels; the practical “own a 4090” alternative.
12 GB consumer cardRTX 4070, RTX 3080 Ti	~25–40x	Lower batch size; you trade VRAM for throughput. Still very fast on Large-v3.
CPU only	not supported	The project is explicit: GPU required. Use whisper.cpp for CPU paths.

Numbers are community-reported medians across the project’s GitHub issues and the README; clean batched English audio with the encoder fully amortised. Multilingual, very short, or very noisy clips realise less.

Two practical points. First, the encoder pass is the fixed cost. For very short clips (under ~30 seconds) the encoder warm-up dominates and the realised throughput collapses; this is a batch tool, not a streaming one. Second, batch size matters more than people expect. The default batch size of 24 is sized for a 24 GB card; cut it to 4 or 8 on a 12 GB card and you reclaim VRAM but lose roughly half the throughput. Always benchmark your actual workload on your actual card before sizing capacity.

The install reality

One pipx command, then a fight with CUDA. That is the honest summary.

pipx install insanely-fast-whisper

The CLI itself installs cleanly. The interesting half is what happens when it tries to import flash_attn at first run. flash-attn is published as a binary wheel, but only for specific PyTorch builds, specific Python versions, and specific CUDA toolkits. If your environment does not match the precompiled grid, pip falls back to compiling from source — which needs nvcc on PATH, a matching CUDA 12.x toolchain, and roughly 5 to 30 minutes of compile time on a fresh box. Operators who have done this before pin all three versions in a Dockerfile and never look at it again. Operators who have not are usually surprised the first time.

The Windows footnote. Native Windows is not an officially supported platform for flash-attn; the maintained path is WSL2. If your batch box happens to be a Windows workstation, plan on either a WSL2 install or a Linux dual-boot. This is a recurring topic in the project’s GitHub issues.

What you build on top of the CLI

This is where the “free OSS” framing breaks down. insanely-fast-whisper hands you a transcript per file. A production transcription product needs roughly the following layers above that:

URL ingest — yt-dlp for YouTube, podcast feed parsers, Drive / Dropbox / S3 clients, format conversion to 16 kHz mono WAV. None of this is in the CLI.
Chunking and stitching — Whisper officially handles up to 30 seconds of audio per forward pass; the CLI does the chunking but you own the seam-fixing logic for sentence boundaries that fall on chunk edges.
Diarization — pyannote runs as a separate model with separate weights, separate license, and separate failure modes. Wiring it to Whisper word timestamps so the speaker labels actually align is its own project.
Exports — SRT, VTT, DOCX, JSON each have their own quirks. SRT line-wrapping for broadcast use is a half-day of code by itself.
Queue and retries — Celery, Dramatiq, RQ, Temporal, your choice; you also own the retry-on-CUDA-OOM logic, the backoff on flaky network, and the dead-letter queue.
Multi-tenancy — auth, usage metering, rate limits, per-tenant storage isolation, audit logs. If anyone other than you is going to see the output, this is mandatory.
Monitoring — Prometheus exporters, GPU utilisation alerts, queue-depth dashboards, model-load-time histograms.

None of this is hard. All of it is real engineering time. A single experienced engineer can build a credible v0 in 2 to 4 weeks; a production-grade version with the rough edges sanded usually takes 2 to 3 months and a permanent ~10% slice of someone’s time after that to keep alive.

The decision matrix

Dimension	insanely-fast-whisper	Whipscribe
What it is	A CLI you run on your own GPU box	A hosted product (web, API, MCP, Chrome extension)
Peak throughput	~90x real-time on RTX 4090	Bounded by your network upload, not the GPU
Hardware required	NVIDIA GPU, 12 GB+ VRAM, CUDA 12.x	A browser. Or our REST API. Or our MCP server.
Install	pipx + flash-attn compile (5–30 min on fresh box)	None. Open the page, drop a file.
License / cost	Apache-2.0, free + your GPU bill	$0 free tier, $2/hr PAYG, $12 / $29 monthly
URL ingest (YouTube, podcasts)	Build it yourself	Built in (paste a URL)
Diarization	Shells out to pyannote, you wire it up	WhisperX-based, on by default on every paid tier
Exports (SRT/VTT/DOCX/JSON)	JSON only out of the box	All formats included
Queue / retries / multi-tenant	You own the platform code	Operated for you
Streaming / real-time	Not the design point (batch tool)	Not yet (batch product)
MCP server for LLM agents	Build your own	Live at /claude
Languages	99 (whatever Whisper supports)	99 (same model family)
Sweet spot	1000+ hours/month batch jobs on owned GPU	Below that volume, or anyone without a GPU

The break-even math, with numbers

Here is the worked example that operators usually want first.

Scenario: a podcast network with 1000 hours of audio per month

The cost to run insanely-fast-whisper on a dedicated rented RTX 4090 box (typical 2026 pricing from cloud providers like RunPod, Lambda, or Vast.ai is in the $0.30 to $0.50 per hour range for community-tier 4090s, $200 to $400 per month for a reserved instance):

Compute: 1000 hours of audio at 90x real-time = 11.1 hours of GPU time. At $0.40 per GPU-hour rented, that is roughly $4.50 per month in raw inference. Even budgeting for 5x overhead from cold starts, retries, and idle time, the GPU bill is under $25.
Reserved capacity: if you reserve a $250-per-month 4090 instance for predictability, that is your floor. Per hour of audio, $250 / 1000 = $0.25 per hour.
Engineering: the pipeline above (URL ingest, queue, diarization, exports, monitoring) costs an engineer roughly 4 weeks to build credibly and roughly 4 hours per week to keep alive after that. At $100 to $200 per engineer-hour fully-loaded, that is a $25k to $50k one-time cost plus $1.5k to $3k per month ongoing.

Run the same workload on Whipscribe Team:

500 hours / month for $29 (the Team tier), times two seats to cover 1000 hours = $58 per month, all in. URL ingest, diarization, exports, retries, MCP server included. Zero engineering.
If the workload is exactly 1000 hours, PAYG at $2 per hour would be $2000 per month — Team is the right tier here, not PAYG.

The honest break-even. At 1000 hours per month, insanely-fast-whisper’s steady-state per-hour cost beats Whipscribe’s once the engineering investment is amortised — but only over a 12 to 24 month horizon, and only if you actually have someone to operate the pipeline. If your team does not already have a platform engineer with NVIDIA experience, Whipscribe is the cheaper answer at any volume below roughly 5000 hours per month, because your time is not free.

Scenario: a journalist with 10 hours of interviews per month

insanely-fast-whisper would cost more in a single afternoon of CUDA debugging than a year of Whipscribe Pro at $12 per month. This is not the audience the project is for, and the project does not pretend otherwise. Use Whipscribe.

Scenario: a research lab with 50 hours per month and an existing GPU cluster

If the GPUs are already there and idle, insanely-fast-whisper is essentially free compute. If the lab does not need URL ingest, diarization, or polished exports, the CLI is the right answer. If they want any of those, Whipscribe Pro at $12 per month is cheaper than the engineering time to bolt them on.

Skip the CUDA fight

Same Whisper Large-v3 family. Server GPUs. URL ingest, diarization, MCP — all included.

Free 30 min/day. $2/hr PAYG. $12/mo Pro for 100 hours. $29/mo Team for 500 hours. No flash-attn compile.

See pricing →

When insanely-fast-whisper is the right call

Right tool when…

You operate the GPU yourself

Call-center analytics platforms processing 5,000+ hours/month
Broadcast captioning vendors with their own DC racks
ML labs preparing training datasets from audio archives
SaaS products where transcription is a feature, not the product
Anyone who already has a CUDA platform team and an idle 4090 / A100 / H100

Whipscribe is the right tool when…

You want a transcript, not a pipeline

Podcasters, journalists, researchers, founders, students
Anyone below ~1000 hours/month on monthly volume
Anyone who does not already own a GPU box
LLM-agent integrations that need an MCP server today
Teams that want URL ingest, diarization, and DOCX export without writing them

Pricing — what each side actually costs

Plan	What you get	What it costs
insanely-fast-whisper	The CLI itself, Apache-2.0. You provide the GPU.	$0 + your GPU bill ($200–$400/mo for a rented 4090, plus engineering)
Whipscribe Free	30 minutes / day, every day. No sign-up, no credit card.	$0
Whipscribe PAYG	Per-hour billing for spiky usage. Diarization included.	$2 / hour of audio
Whipscribe Pro	100 hours / month. Right for one person clearing a backlog.	$12 / month
Whipscribe Team · 500 hr	500 hours / month. Right for a podcast network or research team.	$29 / month

Can I use both?

Yes — and a non-trivial number of operators do. The hybrid pattern looks like this: insanely-fast-whisper on an in-house batch box for the historical archive (overnight processing, no SLA, cheap), Whipscribe as the customer-facing endpoint, the MCP server an LLM agent calls, and the burst-capacity fallback when the in-house GPU is queued or down. The two are not substitutes; one is a max-throughput inference engine, the other is the product wrapped around inference. Most teams that take transcription seriously end up running both in different places.

Frequently asked

What is insanely-fast-whisper?

An open-source CLI by Vaibhav Srivastav that wraps Hugging Face Transformers, Flash-Attention-2, and BetterTransformer SDPA into a one-command Whisper-Large-v3 runner. It hits roughly 90 times real-time on an RTX 4090 — 150 minutes of audio in ~98 seconds. Apache-2.0, published on PyPI, GPU-only.

How fast is it compared to faster-whisper?

On a top-end GPU, insanely-fast-whisper’s Flash-Attention-2 path is roughly 3 to 5 times faster than faster-whisper’s CTranslate2 path on Large-v3. faster-whisper is the more common production choice because it runs on smaller cards, installs without a flash-attn compile, and has a more friendly Python API. insanely-fast-whisper wins on a beefy GPU; faster-whisper wins on a typical one.

Why is the install painful?

flash-attn ships precompiled wheels only for specific PyTorch and CUDA combinations. Step off that grid and pip falls back to a 5–30 minute source compile that needs nvcc on PATH and a matching CUDA 12.x toolchain. Pin your versions in a Dockerfile and the problem disappears; do it ad-hoc and it surprises you every time.

Does insanely-fast-whisper run on a Mac or on CPU?

No. The project is NVIDIA-GPU-only by design; CPU paths are not the goal. For Apple Silicon, the right tool is whisper.cpp or WhisperKit. For CPU, whisper.cpp again.

When is it cheaper than Whipscribe?

Once your monthly volume crosses roughly 1000 hours of audio and you already employ a platform engineer comfortable with CUDA. Below that volume, Whipscribe Pro or Team is cheaper than the engineering time to build the pipeline above raw inference. Above ~5000 hours per month, self-hosting is almost always cheaper.

Can I use insanely-fast-whisper for streaming or real-time captioning?

Not really. The CLI is a batch tool — the encoder warm-up dominates short-clip throughput, and there is no streaming API. For live captioning, the appropriate stack is a streaming-first model (Deepgram, Speechmatics, or AssemblyAI’s real-time tier) rather than a batched Whisper pass.

Does Whipscribe support diarization, SRT, DOCX, JSON, and URL ingestion?

Yes — all by default on every paid tier and on the daily 30-minute free allowance. Paste a YouTube URL or upload a file, get back TXT / SRT / VTT / DOCX / JSON with speaker labels and word-level timestamps.

Can I call Whipscribe from a Python or Node app the same way I call insanely-fast-whisper?

Yes — Whipscribe exposes a REST API and an MCP server. Same audio in, same JSON out, no GPU on your side. If you are migrating an existing insanely-fast-whisper pipeline because the GPU bill or operator load is no longer worth it, the swap is one HTTP call.

If your monthly volume sits below the break-even, skip the CUDA fight. Same Whisper model family on server GPUs, plus URL ingest, diarization, MCP, and an extension — for $12 to $29 a month.

See pricing →

The decision in one paragraph

What insanely-fast-whisper actually gives you

The 90x real-time number, with the asterisks

The install reality

What you build on top of the CLI

The decision matrix

The break-even math, with numbers

Scenario: a podcast network with 1000 hours of audio per month

Scenario: a journalist with 10 hours of interviews per month

Scenario: a research lab with 50 hours per month and an existing GPU cluster

When insanely-fast-whisper is the right call

You operate the GPU yourself

You want a transcript, not a pipeline

Pricing — what each side actually costs

Can I use both?

Frequently asked

Related