stable-ts vs Whipscribe (2026): the timestamp fix-up library, honestly compared
stable-ts is the open-source library that fixes Whisper's weakest seam — word-level timestamp drift — by replacing the model's native timestamp head with a dynamic-programming alignment over its own cross-attention weights, plus a monotonicity-regularization pass. It is the right tool when you are publishing caption-grade subtitles or building a karaoke pipeline. Whipscribe is a hosted product that ships diarized SRT/VTT/DOCX/JSON out of the box, takes a URL or a file, and never asks you to operate a GPU. This post is the honest version of when each one is the right pick.
What stable-ts actually is
stable-ts is a Python library that wraps an OpenAI-Whisper-compatible inference call (vanilla openai-whisper, faster-whisper, or whisper.cpp via the appropriate adapter) and replaces the timestamp pathway with a more careful one. The headline ideas, drawn from the README and the issues tracker on GitHub:
- Dynamic programming over cross-attention weights. Whisper's decoder produces cross-attention maps as a side effect of generating each token. Those maps already contain information about where in the audio each token was attending. stable-ts runs a DTW-style (dynamic-time-warping) alignment over those weights to recover monotonic, well-bounded word timestamps — instead of relying on Whisper's special
<|t|>timestamp tokens, which are produced by a separate decoding head and drift on long segments. - Regularization for monotonicity and silence. The DP pass enforces that word boundaries advance forward in time (no overlapping words, no zero-length words), and it consults voice-activity detection to shrink boundaries that bleed into silence. The output is a per-word timeline that looks like what a human captioner would mark.
- Regrouping options. stable-ts ships
refineandregroupprimitives — declarative rules that let you split, merge, or re-balance segments by length, character count, gap, or punctuation. This is the part of subtitle authoring that turns an ASR transcript into broadcast-grade captions: max 32 characters per line, max 2 lines per cue, no orphaned punctuation, breaks at clause boundaries. - Denoising integration. Recent versions wire in optional denoising (
denoiser="demucs"and friends) so the audio fed into Whisper is cleaner before the alignment pass runs. Useful for music-bed podcasts, lavalier-mic field recordings, and anything with consistent room noise.
The library is MIT-licensed, ~2.2k GitHub stars, single maintainer, last push 2025-10-29 as of this writing. It is a sharp tool, well-scoped, and the audience is exactly the people who are going to read the source.
What stable-ts is not
The library does one thing. It does not do most of the other things a transcription pipeline needs:
- No speaker diarization. stable-ts only operates on the alignment between transcript and audio. If you need "Speaker 1 / Speaker 2" labels you bolt on pyannote-audio or use whisperX as your inference layer instead.
- No URL ingestion. stable-ts takes a local audio path. YouTube, Spotify share links, Zoom-cloud recordings — that's your fetch step (yt-dlp + ffmpeg + retry logic).
- No exports beyond what's in the box. SRT, VTT, and JSON are supported out of the library; DOCX, plain text reflowed for reading, structured XML for broadcast captioning systems — those are yours to write.
- No GPU box. stable-ts inherits whatever inference engine you point it at. CPU works for tiny models; serious throughput needs a GPU you operate yourself.
- No queue, no auth, no UI. It's a library, not a product.
None of that is stable-ts's fault — it is a focused library doing exactly what it advertises. But the gap between "I have pip install stable-ts running in a notebook" and "I have a captioning pipeline in production" is real, and the gap is most of the engineering work in any transcription product.
The thing it actually fixes (and why captioners care)
If you have ever generated SRT subtitles from raw Whisper output and watched them play back next to the video, you have probably noticed the drift. A word lights up half a second after the speaker says it. Two words run into a single cue with one timestamp covering both. The last word of a segment hangs around for 800 milliseconds of silence after the speaker stops. For a transcript meant to be read, none of this matters. For karaoke, sing-along apps, broadcast captions, or short-form vertical clips where every cue is on screen for two seconds, all of it matters.
The drift is real because Whisper's timestamp tokens are predicted by a separate head with its own loss, and that head is allowed to be slightly wrong if the text token loss goes down. stable-ts ignores those tokens and reads the cross-attention directly — and the cross-attention is what the model was actually doing when it produced each word. The result is timestamps that match what a human would mark, with sub-100ms agreement on clean audio in the community benchmarks reported on the project's issues tracker and in subtitle-pipeline write-ups on Reddit's r/MachineLearning and r/whisper.
The honest side-by-side
Different surfaces, different jobs. The model layer underneath can be the same Whisper checkpoint in both cases.
| Dimension | stable-ts (self-hosted library) | Whipscribe (hosted product) |
|---|---|---|
| What you operate | A Python library on a GPU box you own or rent | Nothing — paste a URL or upload a file |
| Word-level timestamps | DP alignment over Whisper cross-attention (sub-100ms on clean audio) | wav2vec2 forced alignment via whisperX |
| Speaker diarization | Not included — bolt on pyannote yourself | Included on every plan (pyannote-3.1) |
| License | MIT (jianfch/stable-ts) | SaaS — visitor doesn't inherit model-card terms |
| Setup time | 2–4 hours of devops (engine + GPU + fetch + queue) | ~30 seconds (paste URL → transcript) |
| URL ingestion (YouTube, podcast, share links) | You build it (yt-dlp + ffmpeg + retry) | Built in, paste a link |
| Subtitle regrouping (max chars/line, clause breaks) | Excellent — regroup + refine primitives |
SRT/VTT exports with sensible defaults; finer cue rules not exposed |
| Exports (SRT, VTT, DOCX, TXT, JSON) | SRT, VTT, JSON in-library; DOCX is your problem | All five formats included on every plan |
| Denoising | Optional integration with demucs / noisereduce | Server-side audio normalization on every upload |
| API / MCP for AI agents | Build your own HTTP wrapper | REST API + native MCP server (Claude / ChatGPT) |
| Cost | Free code + your GPU + your dev time + ongoing maintenance | $0 free / $2 PAYG / $12 Pro / $29 Team |
| Best for | Subtitle pipelines, karaoke apps, broadcast captioning, on-prem | Creators, podcasters, journalists, researchers, AI agents |
Worked example: a YouTube creator publishing 10 videos a month
This is the most common shape we see when someone Googles "Whisper SRT subtitles." A creator with a regular publishing cadence, a need for accurate captions on every upload, and a backlog they'd like to not think about. Let's run it both ways.
Path A: self-host stable-ts
- Day 1 morning
- Spin up a cloud GPU (RTX 4090, ~$0.50/hr).
pip install stable-ts faster-whisper. Pick the engine; pick the model size (Large-v3 for English). Test on a 5-minute clip — looks great, timestamps are tight. - Day 1 afternoon
- Write the YouTube fetch step (yt-dlp + ffmpeg audio extract). Write the cue-formatting rules — max 32 chars, max 2 lines, break at clause boundaries — using stable-ts
regroupchains. Iterate on three real videos until the cues look right. - Day 2
- Wrap it in a small CLI you can drop a YouTube URL into. Add a sanity check that compares cue density against your baseline. Spend the rest of the day debugging an edge case where punctuation in song lyrics breaks the regrouper.
- Ongoing
- Each upload: copy the URL, run the CLI, get an SRT, upload it manually to YouTube Studio. About 5 minutes of GPU time per 20-minute video, plus a minute of your attention.
- Cost
- ~$5/month of GPU rental + ~10 hours of upfront engineering + ~1 minute per video forever. The captioning quality is excellent and you own the pipeline.
Path B: Whipscribe Free or Pro
- Hour 1
- Create an account. Paste the YouTube URL of your latest video. The transcript comes back diarized in single-digit minutes. Click "Download SRT." Upload to YouTube Studio.
- Each subsequent video
- Same flow — paste, wait, download, upload. About 2 minutes of your attention per video.
- Cost
- 10 videos × ~20 minutes = ~3.3 hours of audio per month. Whipscribe Free gives 30 minutes of audio per day with no sign-up, so for a creator publishing 2–3 videos per week the free tier is enough. If your videos are longer or you batch a backlog: Pro at $12/month covers 100 hours.
The point isn't that stable-ts is overkill — the captioning quality you get from a well-tuned stable-ts pipeline is the best you can get from open-source Whisper, full stop. The point is that for the YouTube-creator use case, the marginal precision over Whipscribe's whisperX-based timing is invisible to viewers, and the engineering cost is real. If you would notice a 50ms timestamp error in your captions, build the stable-ts pipeline. If you wouldn't, use the hosted product.
When stable-ts is the right call
- You're building a captioning or subtitle pipeline as a product. Karaoke apps, sing-along language-learning tools, broadcast caption authoring software, music-video lyric sync. The precision matters because it's the product. stable-ts is the right tool — embed it, tune the regrouping rules to your domain, and own the pipeline.
- You have an in-house ASR pipeline already and want a drop-in timing fix. You run faster-whisper on your own GPUs, you ship transcripts in production, and your customers have started complaining about karaoke-style cue alignment.
pip install stable-ts, swap yourtranscribe()call, ship the improvement. - You need offline operation with no audio leaving the device. Field journalists, hospital settings, classified work. stable-ts runs locally; the hosted product does not.
- You want maximum control over cue formatting. Broadcast TV captions have specific rules (CEA-608/708, EBU-TT, BBC subtitle guidelines) that need exact character counts, line breaks, and reading speeds. stable-ts's
regroupchain is the most expressive caption-formatting API in the open-source Whisper ecosystem.
When Whipscribe is the right call
- You publish content and need transcripts. YouTube creators, podcasters, journalists, course creators. The transcript and the SRT are a means to an end (show notes, search, accessibility). You don't need broadcast-grade timing; you need a transcript by tomorrow.
- You need diarization out of the box. Multi-speaker interviews, panel discussions, focus groups. stable-ts doesn't do this; Whipscribe does, on every plan.
- You're calling transcription from an AI agent. A Claude or ChatGPT agent that needs to transcribe a URL as part of a workflow doesn't have a GPU, doesn't have a Python pipeline, and shouldn't have one. Whipscribe ships an MCP server (
mcp.whipscribe.com) for exactly this case. - You value an evening more than $29. The honest version. The setup time on stable-ts (and any self-hosted ASR pipeline) is real, and most people who can be served by a hosted product should use one.
Diarized transcripts. SRT, VTT, DOCX, TXT, JSON exports. URL ingest from YouTube, Spotify, Zoom. MCP server for Claude and ChatGPT agents. Your laptop and your GPU stay free.
See pricing →Credit where credit is due
stable-ts exists because jianfch wrote and maintains it under MIT, mostly as a single-maintainer project, and has done the careful work of making Whisper's timestamps trustworthy for caption pipelines that read them. The technique — DP alignment over cross-attention — predates the library in the speech-research literature, but jianfch's contribution is the well-engineered Python wrapper that the rest of us can pip install. If you ship a captioning product on top of it, sponsoring the project on GitHub is a reasonable thing to do; the work being done there is real.
Frequently asked
What does stable-ts actually fix in Whisper?
Whisper's native timestamps are produced by a separate prediction head and are notoriously imprecise — segment boundaries drift and individual word boundaries can be off by 200–500 milliseconds. stable-ts replaces that path with a dynamic-programming alignment over the model's own cross-attention weights, plus a regularization pass that enforces monotonicity and trims silence. The transcript text is the same as Whisper produces; the timestamps are tightened to the actual word boundaries in the audio.
Is stable-ts more accurate than Whisper's built-in word timestamps?
For word-boundary precision, yes — measurably. The tradeoff is speed: stable-ts does extra work per segment, so it runs slower than vanilla Whisper or faster-whisper. For broadcast-grade subtitles, karaoke effects, or short-form clip extraction where word-level offsets matter, the precision is worth the wall-clock cost. For a transcript that humans will read and search, vanilla word timestamps are usually good enough.
Does stable-ts include speaker diarization?
No. stable-ts only operates on the alignment between transcript and audio. For diarization you need pyannote-audio (the open-source default) or a pipeline that bundles them, like whisperX. Whipscribe runs whisperX internally so every transcript comes back diarized by default.
What license is stable-ts?
stable-ts is MIT-licensed (jianfch/stable-ts on GitHub). You can audit, fork, or embed it in commercial products without licensing fees. The underlying Whisper model is also MIT (OpenAI). Dependencies on faster-whisper or whisper.cpp follow each project's MIT-equivalent terms.
Does Whipscribe use stable-ts?
No — not by default. Whipscribe's diarization pipeline is whisperX, which uses wav2vec2 forced alignment to tighten word-level timestamps. The two libraries solve overlapping problems with different strategies: stable-ts uses Whisper's own cross-attention; whisperX runs a separate phoneme model. For most transcript use cases the two are interchangeable.
When is stable-ts the right choice over Whipscribe?
Three honest cases. (1) You're building a captioning or subtitle pipeline where millisecond-precise word boundaries matter — karaoke videos, broadcast TV captions, foreign-language sing-along apps. (2) You're embedding ASR in a product where you already operate the GPU and the pipeline is part of your value. (3) You need offline operation with no audio leaving the device.
When is Whipscribe the right choice over stable-ts?
When the job is "get a transcript" rather than "build a captioning library." Whipscribe ships SRT, VTT, DOCX, TXT, and JSON exports out of the box, with diarization included on every plan. Pricing is $0 free / $2 PAYG / $12 Pro 100hr / $29 Team 500hr. You don't operate a GPU, you don't write the URL fetcher, and you don't maintain a Python pipeline you didn't sign up for.
Can I use stable-ts with faster-whisper or whisper.cpp?
Yes. stable-ts originally wrapped openai-whisper, but recent versions support faster-whisper and whisper.cpp as backends — you keep the inference engine you've already standardized on and only add the timestamp-stabilization pass on top. faster-whisper for speed, stable-ts for caption-grade timing is the common production pattern.
If your job is "publish the captions," skip the library and use the product. SRT, VTT, DOCX, JSON. Diarization included. URL ingest. MCP server. Free for 30 min/day.
See pricing →