stable-ts vs Whipscribe (2026): the timestamp fix-up library, honestly compared

Q: Does stable-ts include speaker diarization?

No. stable-ts is a timestamp-stabilization library — it only operates on the alignment between transcript and audio. For diarization you need a separate model (pyannote-audio is the open-source default) or a pipeline that bundles them, like whisperX. Whipscribe runs whisperX internally so every transcript comes back diarized by default.

Q: When is Whipscribe the right choice over stable-ts?

When the job is 'get a transcript' rather than 'build a captioning library.' Whipscribe ships SRT, VTT, DOCX, TXT, and JSON exports out of the box, with diarization included on every plan. Pricing is $0 free / $2 PAYG / $12 Pro 100hr / $29 Team 500hr. You don't operate a GPU, you don't write the URL fetcher, you don't bake a HuggingFace token into a Docker image, and you don't maintain a Python pipeline you didn't sign up for.

Q: Can I use stable-ts with faster-whisper or whisper.cpp?

Yes. stable-ts originally wrapped the openai-whisper package, but recent versions support faster-whisper and whisper.cpp as backends — you keep the inference engine you've already standardized on and only add the timestamp-stabilization pass on top. This is the common production pattern: faster-whisper for speed, stable-ts for caption-grade timing.

May 8, 2026 · Neugence · 12 min read

stable-ts is the open-source library that fixes Whisper's weakest seam — word-level timestamp drift — by replacing the model's native timestamp head with a dynamic-programming alignment over its own cross-attention weights, plus a monotonicity-regularization pass. It is the right tool when you are publishing caption-grade subtitles or building a karaoke pipeline. Whipscribe is a hosted product that ships diarized SRT/VTT/DOCX/JSON out of the box, takes a URL or a file, and never asks you to operate a GPU. This post is the honest version of when each one is the right pick.

Disclosure up front. stable-ts is a small, focused, MIT-licensed library by jianfch (~2.2k stars on GitHub, last push late 2025). It does one thing very well: it produces stable, accurate word-level timestamps on top of Whisper. Whipscribe does not run stable-ts by default — our diarization pipeline is whisperX, which uses a different alignment strategy (wav2vec2 forced alignment) to solve the overlapping problem. Both approaches tighten word boundaries; the engineering choice between them is mostly about whether you want a separate phoneme model in your stack. We mention this so you don't read this post thinking "Whipscribe = stable-ts" the way our whisperX post is "Whipscribe = whisperX." It isn't. They're cousins.

What stable-ts actually is

stable-ts is a Python library that wraps an OpenAI-Whisper-compatible inference call (vanilla openai-whisper, faster-whisper, or whisper.cpp via the appropriate adapter) and replaces the timestamp pathway with a more careful one. The headline ideas, drawn from the README and the issues tracker on GitHub:

Dynamic programming over cross-attention weights. Whisper's decoder produces cross-attention maps as a side effect of generating each token. Those maps already contain information about where in the audio each token was attending. stable-ts runs a DTW-style (dynamic-time-warping) alignment over those weights to recover monotonic, well-bounded word timestamps — instead of relying on Whisper's special <|t|> timestamp tokens, which are produced by a separate decoding head and drift on long segments.
Regularization for monotonicity and silence. The DP pass enforces that word boundaries advance forward in time (no overlapping words, no zero-length words), and it consults voice-activity detection to shrink boundaries that bleed into silence. The output is a per-word timeline that looks like what a human captioner would mark.
Regrouping options. stable-ts ships refine and regroup primitives — declarative rules that let you split, merge, or re-balance segments by length, character count, gap, or punctuation. This is the part of subtitle authoring that turns an ASR transcript into broadcast-grade captions: max 32 characters per line, max 2 lines per cue, no orphaned punctuation, breaks at clause boundaries.
Denoising integration. Recent versions wire in optional denoising (denoiser="demucs" and friends) so the audio fed into Whisper is cleaner before the alignment pass runs. Useful for music-bed podcasts, lavalier-mic field recordings, and anything with consistent room noise.

The library is MIT-licensed, ~2.2k GitHub stars, single maintainer, last push 2025-10-29 as of this writing. It is a sharp tool, well-scoped, and the audience is exactly the people who are going to read the source.

What stable-ts is not

The library does one thing. It does not do most of the other things a transcription pipeline needs:

No speaker diarization. stable-ts only operates on the alignment between transcript and audio. If you need "Speaker 1 / Speaker 2" labels you bolt on pyannote-audio or use whisperX as your inference layer instead.
No URL ingestion. stable-ts takes a local audio path. YouTube, Spotify share links, Zoom-cloud recordings — that's your fetch step (yt-dlp + ffmpeg + retry logic).
No exports beyond what's in the box. SRT, VTT, and JSON are supported out of the library; DOCX, plain text reflowed for reading, structured XML for broadcast captioning systems — those are yours to write.
No GPU box. stable-ts inherits whatever inference engine you point it at. CPU works for tiny models; serious throughput needs a GPU you operate yourself.
No queue, no auth, no UI. It's a library, not a product.

None of that is stable-ts's fault — it is a focused library doing exactly what it advertises. But the gap between "I have pip install stable-ts running in a notebook" and "I have a captioning pipeline in production" is real, and the gap is most of the engineering work in any transcription product.

The thing it actually fixes (and why captioners care)

If you have ever generated SRT subtitles from raw Whisper output and watched them play back next to the video, you have probably noticed the drift. A word lights up half a second after the speaker says it. Two words run into a single cue with one timestamp covering both. The last word of a segment hangs around for 800 milliseconds of silence after the speaker stops. For a transcript meant to be read, none of this matters. For karaoke, sing-along apps, broadcast captions, or short-form vertical clips where every cue is on screen for two seconds, all of it matters.

The drift is real because Whisper's timestamp tokens are predicted by a separate head with its own loss, and that head is allowed to be slightly wrong if the text token loss goes down. stable-ts ignores those tokens and reads the cross-attention directly — and the cross-attention is what the model was actually doing when it produced each word. The result is timestamps that match what a human would mark, with sub-100ms agreement on clean audio in the community benchmarks reported on the project's issues tracker and in subtitle-pipeline write-ups on Reddit's r/MachineLearning and r/whisper.

Plain-language rule of thumb: if your subtitles will be read silently while watching the video, vanilla Whisper word timestamps are good enough. If they will be sung along to, lit up word-by-word in karaoke style, or judged by a broadcast QC team — use stable-ts (or whisperX) to produce the timing, not raw Whisper.

The honest side-by-side

Different surfaces, different jobs. The model layer underneath can be the same Whisper checkpoint in both cases.

Dimension	stable-ts (self-hosted library)	Whipscribe (hosted product)
What you operate	A Python library on a GPU box you own or rent	Nothing — paste a URL or upload a file
Word-level timestamps	DP alignment over Whisper cross-attention (sub-100ms on clean audio)	wav2vec2 forced alignment via whisperX
Speaker diarization	Not included — bolt on pyannote yourself	Included on every plan (pyannote-3.1)
License	MIT (jianfch/stable-ts)	SaaS — visitor doesn't inherit model-card terms
Setup time	2–4 hours of devops (engine + GPU + fetch + queue)	~30 seconds (paste URL → transcript)
URL ingestion (YouTube, podcast, share links)	You build it (yt-dlp + ffmpeg + retry)	Built in, paste a link
Subtitle regrouping (max chars/line, clause breaks)	Excellent — `regroup` + `refine` primitives	SRT/VTT exports with sensible defaults; finer cue rules not exposed
Exports (SRT, VTT, DOCX, TXT, JSON)	SRT, VTT, JSON in-library; DOCX is your problem	All five formats included on every plan
Denoising	Optional integration with demucs / noisereduce	Server-side audio normalization on every upload
API / MCP for AI agents	Build your own HTTP wrapper	REST API + native MCP server (Claude / ChatGPT)
Cost	Free code + your GPU + your dev time + ongoing maintenance	$0 free / $2 PAYG / $12 Pro / $29 Team
Best for	Subtitle pipelines, karaoke apps, broadcast captioning, on-prem	Creators, podcasters, journalists, researchers, AI agents

Worked example: a YouTube creator publishing 10 videos a month

This is the most common shape we see when someone Googles "Whisper SRT subtitles." A creator with a regular publishing cadence, a need for accurate captions on every upload, and a backlog they'd like to not think about. Let's run it both ways.

Path A: self-host stable-ts

Day 1 morning: Spin up a cloud GPU (RTX 4090, ~$0.50/hr). pip install stable-ts faster-whisper. Pick the engine; pick the model size (Large-v3 for English). Test on a 5-minute clip — looks great, timestamps are tight.
Day 1 afternoon: Write the YouTube fetch step (yt-dlp + ffmpeg audio extract). Write the cue-formatting rules — max 32 chars, max 2 lines, break at clause boundaries — using stable-ts regroup chains. Iterate on three real videos until the cues look right.
Day 2: Wrap it in a small CLI you can drop a YouTube URL into. Add a sanity check that compares cue density against your baseline. Spend the rest of the day debugging an edge case where punctuation in song lyrics breaks the regrouper.
Ongoing: Each upload: copy the URL, run the CLI, get an SRT, upload it manually to YouTube Studio. About 5 minutes of GPU time per 20-minute video, plus a minute of your attention.
Cost: ~$5/month of GPU rental + ~10 hours of upfront engineering + ~1 minute per video forever. The captioning quality is excellent and you own the pipeline.

Path B: Whipscribe Free or Pro

Hour 1: Create an account. Paste the YouTube URL of your latest video. The transcript comes back diarized in single-digit minutes. Click "Download SRT." Upload to YouTube Studio.
Each subsequent video: Same flow — paste, wait, download, upload. About 2 minutes of your attention per video.
Cost: 10 videos × ~20 minutes = ~3.3 hours of audio per month. Whipscribe Free gives 30 minutes of audio per day with no sign-up, so for a creator publishing 2–3 videos per week the free tier is enough. If your videos are longer or you batch a backlog: Pro at $12/month covers 100 hours.

The point isn't that stable-ts is overkill — the captioning quality you get from a well-tuned stable-ts pipeline is the best you can get from open-source Whisper, full stop. The point is that for the YouTube-creator use case, the marginal precision over Whipscribe's whisperX-based timing is invisible to viewers, and the engineering cost is real. If you would notice a 50ms timestamp error in your captions, build the stable-ts pipeline. If you wouldn't, use the hosted product.

When stable-ts is the right call

You're building a captioning or subtitle pipeline as a product. Karaoke apps, sing-along language-learning tools, broadcast caption authoring software, music-video lyric sync. The precision matters because it's the product. stable-ts is the right tool — embed it, tune the regrouping rules to your domain, and own the pipeline.
You have an in-house ASR pipeline already and want a drop-in timing fix. You run faster-whisper on your own GPUs, you ship transcripts in production, and your customers have started complaining about karaoke-style cue alignment. pip install stable-ts, swap your transcribe() call, ship the improvement.
You need offline operation with no audio leaving the device. Field journalists, hospital settings, classified work. stable-ts runs locally; the hosted product does not.
You want maximum control over cue formatting. Broadcast TV captions have specific rules (CEA-608/708, EBU-TT, BBC subtitle guidelines) that need exact character counts, line breaks, and reading speeds. stable-ts's regroup chain is the most expressive caption-formatting API in the open-source Whisper ecosystem.

When Whipscribe is the right call

You publish content and need transcripts. YouTube creators, podcasters, journalists, course creators. The transcript and the SRT are a means to an end (show notes, search, accessibility). You don't need broadcast-grade timing; you need a transcript by tomorrow.
You need diarization out of the box. Multi-speaker interviews, panel discussions, focus groups. stable-ts doesn't do this; Whipscribe does, on every plan.
You're calling transcription from an AI agent. A Claude or ChatGPT agent that needs to transcribe a URL as part of a workflow doesn't have a GPU, doesn't have a Python pipeline, and shouldn't have one. Whipscribe ships an MCP server (mcp.whipscribe.com) for exactly this case.
You value an evening more than $29. The honest version. The setup time on stable-ts (and any self-hosted ASR pipeline) is real, and most people who can be served by a hosted product should use one.

Skip the GPU box, keep the SRT

Free for 30 min/day · $12/mo for 100 hours · $29/mo for 500 hours

Diarized transcripts. SRT, VTT, DOCX, TXT, JSON exports. URL ingest from YouTube, Spotify, Zoom. MCP server for Claude and ChatGPT agents. Your laptop and your GPU stay free.

See pricing →

Credit where credit is due

stable-ts exists because jianfch wrote and maintains it under MIT, mostly as a single-maintainer project, and has done the careful work of making Whisper's timestamps trustworthy for caption pipelines that read them. The technique — DP alignment over cross-attention — predates the library in the speech-research literature, but jianfch's contribution is the well-engineered Python wrapper that the rest of us can pip install. If you ship a captioning product on top of it, sponsoring the project on GitHub is a reasonable thing to do; the work being done there is real.

Frequently asked

What does stable-ts actually fix in Whisper?

Whisper's native timestamps are produced by a separate prediction head and are notoriously imprecise — segment boundaries drift and individual word boundaries can be off by 200–500 milliseconds. stable-ts replaces that path with a dynamic-programming alignment over the model's own cross-attention weights, plus a regularization pass that enforces monotonicity and trims silence. The transcript text is the same as Whisper produces; the timestamps are tightened to the actual word boundaries in the audio.

Is stable-ts more accurate than Whisper's built-in word timestamps?

For word-boundary precision, yes — measurably. The tradeoff is speed: stable-ts does extra work per segment, so it runs slower than vanilla Whisper or faster-whisper. For broadcast-grade subtitles, karaoke effects, or short-form clip extraction where word-level offsets matter, the precision is worth the wall-clock cost. For a transcript that humans will read and search, vanilla word timestamps are usually good enough.

Does stable-ts include speaker diarization?

No. stable-ts only operates on the alignment between transcript and audio. For diarization you need pyannote-audio (the open-source default) or a pipeline that bundles them, like whisperX. Whipscribe runs whisperX internally so every transcript comes back diarized by default.

What license is stable-ts?

stable-ts is MIT-licensed (jianfch/stable-ts on GitHub). You can audit, fork, or embed it in commercial products without licensing fees. The underlying Whisper model is also MIT (OpenAI). Dependencies on faster-whisper or whisper.cpp follow each project's MIT-equivalent terms.

Does Whipscribe use stable-ts?

No — not by default. Whipscribe's diarization pipeline is whisperX, which uses wav2vec2 forced alignment to tighten word-level timestamps. The two libraries solve overlapping problems with different strategies: stable-ts uses Whisper's own cross-attention; whisperX runs a separate phoneme model. For most transcript use cases the two are interchangeable.

When is stable-ts the right choice over Whipscribe?

Three honest cases. (1) You're building a captioning or subtitle pipeline where millisecond-precise word boundaries matter — karaoke videos, broadcast TV captions, foreign-language sing-along apps. (2) You're embedding ASR in a product where you already operate the GPU and the pipeline is part of your value. (3) You need offline operation with no audio leaving the device.

When is Whipscribe the right choice over stable-ts?

When the job is "get a transcript" rather than "build a captioning library." Whipscribe ships SRT, VTT, DOCX, TXT, and JSON exports out of the box, with diarization included on every plan. Pricing is $0 free / $2 PAYG / $12 Pro 100hr / $29 Team 500hr. You don't operate a GPU, you don't write the URL fetcher, and you don't maintain a Python pipeline you didn't sign up for.

Can I use stable-ts with faster-whisper or whisper.cpp?

Yes. stable-ts originally wrapped openai-whisper, but recent versions support faster-whisper and whisper.cpp as backends — you keep the inference engine you've already standardized on and only add the timestamp-stabilization pass on top. faster-whisper for speed, stable-ts for caption-grade timing is the common production pattern.

If your job is "publish the captions," skip the library and use the product. SRT, VTT, DOCX, JSON. Diarization included. URL ingest. MCP server. Free for 30 min/day.

See pricing →

What stable-ts actually is

What stable-ts is not

The thing it actually fixes (and why captioners care)

The honest side-by-side

Worked example: a YouTube creator publishing 10 videos a month

Path A: self-host stable-ts

Path B: Whipscribe Free or Pro

When stable-ts is the right call

When Whipscribe is the right call

Credit where credit is due

Frequently asked

Related