The recognizer doesn't need to hear your keyboard. Or the air conditioner. Or the three seconds of silence while you think.
We use RNNoise for voice activity detection. It's a recurrent neural network trained on noisy speech that does two things well: it figures out when you're actually talking, and it suppresses background noise without destroying consonants.
TL;DR
- Dictation accuracy is often limited by input segmentation, not the recognizer.
- RNNoise helps separate speech from noise and trims dead air before transcription.
- Cleaner segments improve punctuation and reduce “hallucinated” filler words.
Why this matters for dictation
Most accuracy problems aren't model problems. They're input problems. Feed the recognizer clean speech segments, and it does better work. Feed it room noise and silence, and it hallucinates words or drops them.
RNNoise runs continuously while you hold the hotkey. It tracks speech vs non-speech in real time and gently pushes down ambient noise. The result: cleaner segments reach the Whisper-style model.
What we tried before
Simple energy-based VAD (volume threshold) fails in inconsistent environments. Loud typing triggers it. Quiet speech doesn't. We needed something that understood speech patterns, not just amplitude.
Heavy noise gates over-process the signal. They clip consonants and sibilants, making "s" sounds disappear and "t" sounds muddy. The transcript looks wrong even when the model heard correctly.
RNNoise hits a middle ground: aggressive enough to matter, gentle enough to preserve detail.
The pipeline
- Audio comes in at 48 kHz
- RNNoise detects speech regions and suppresses background
- We segment into phrases with padding to preserve context
- Audio conditioning (LUFS normalization, high-pass filter)
- Resample to 16 kHz for the model
- Transcribe
Steps 2 and 3 are where most of the "why does this sound cleaner" happens.
Related: Audio conditioning · Resampling 48 kHz → 16 kHz
If you've used dictation in a noisy room and gotten garbage, the problem was probably upstream of the model. We fix it there.
Related articles
An ergonomic split keyboard can improve wrist and shoulder posture — but it won’t fix typing volume. Here’s what split keyboard ergonomics really change, what studies suggest, and when voice typing helps more.
Linear productivity is mostly about clarity: fewer meetings, fewer follow-ups, and issues that are easy to execute. Here’s a practical workflow (with templates) that makes teams faster.
