At a glance
Fast because it is local. Accurate because the pipeline is not sloppy.
Most dictation marketing hand-waves the pipeline. That is dumb. The pipeline is the product. If the signal is bad, batching is naive, or the runtime is inconsistent, people feel it immediately.
What matters
- Runs the full dictation pipeline on-device.
- Uses signal cleanup before recognition, not just transcript cleanup after.
- Streams in batches so long dictation sessions do not stall at the end.
- Keeps network calls minimal and optional.
Pipeline
The on-device pipeline, step by step.
Voice activity detection
RNNoise-based VAD continuously tracks speech and suppresses background noise before recognition starts doing expensive work.
Segmentation
Phrase boundaries use padding and merging so short pauses do not shatter context between utterances.
Audio conditioning
We normalise to −14 LUFS with K-weighting, then apply a 50 Hz second-order Butterworth high-pass filter.
Resampling
48 kHz audio is converted to 16 kHz to match Whisper-derived model expectations cleanly and consistently.
Context batching
Dictation streams in ~30 second blocks so longer thoughts stay coherent instead of resetting every few seconds.
Core ML + Metal
Optimised operators keep inference responsive while staying energy-efficient on actual Macs, not just benchmark charts.
Speed
Why it feels faster in practice.
No uploads
Audio never leaves your Mac, so you avoid network latency, compression artefacts, and cloud queueing.
Quick finalisation
When you release the hotkey, the current batch finalises in roughly 2–3 seconds on an M-series Mac instead of replaying the full recording.
Energy-aware scheduling
On-device inference avoids long-running cloud calls, which helps keep laptops cooler and quieter during real work.
Accuracy
Why it stays accurate under normal human mess.
Signal-first improvements
We improve the audio before recognition instead of leaning on heavy prompt hacks to rescue a bad transcript afterwards.
Noise resilience
RNNoise lowers ambient distractions without flattening consonants and sibilants into mush.
Domain vocabulary
Custom word lists feed into the recogniser directly, which helps technical jargon and product names survive first contact.
Trust
Privacy and reviewability are part of the engineering story.
Mac App Store distribution
Voice Type ships with Apple notarisation and sandboxing enabled.
Authentic transparency
App Store reviews are shown without filtering, including critical feedback.
Minimal network calls
We only ping Apple for receipt checks and, if you enable it, your optional rewrite provider.
Optional rewrite
Dictation stays local. Rewrite is opt-in.
If you enable bring-your-own-key rewriting, transcripts can be handed to a fast LLM for formatting, summarising, or drafting. The important part is that dictation itself does not need that path to feel immediate.
That separation matters. Speech capture should be dependable first. Language polish can be optional second.
