Skip to main content

Engineering

How Voice Type stays fast and accurate

Voice Type runs the full dictation pipeline on your Mac. Here is how the audio processing, batching, and Core ML acceleration work together.

On-device pipelineCore ML + MetalNo default cloud upload

Voice Type keeps the pipeline simple where it matters: clean the signal, batch intelligently, run locally, and keep dictation fast enough to use throughout the day.

At a glance

Fast because it is local. Accurate because the pipeline is not sloppy.

Most dictation marketing hand-waves the pipeline. That is dumb. The pipeline is the product. If the signal is bad, batching is naive, or the runtime is inconsistent, people feel it immediately.

What matters

  • Runs the full dictation pipeline on-device.
  • Uses signal cleanup before recognition, not just transcript cleanup after.
  • Streams in batches so long dictation sessions do not stall at the end.
  • Keeps network calls minimal and optional.

Pipeline

The on-device pipeline, step by step.

Voice activity detection

RNNoise-based VAD continuously tracks speech and suppresses background noise before recognition starts doing expensive work.

Segmentation

Phrase boundaries use padding and merging so short pauses do not shatter context between utterances.

Audio conditioning

We normalise to −14 LUFS with K-weighting, then apply a 50 Hz second-order Butterworth high-pass filter.

Resampling

48 kHz audio is converted to 16 kHz to match Whisper-derived model expectations cleanly and consistently.

Context batching

Dictation streams in ~30 second blocks so longer thoughts stay coherent instead of resetting every few seconds.

Core ML + Metal

Optimised operators keep inference responsive while staying energy-efficient on actual Macs, not just benchmark charts.

Speed

Why it feels faster in practice.

No uploads

Audio never leaves your Mac, so you avoid network latency, compression artefacts, and cloud queueing.

Quick finalisation

When you release the hotkey, the current batch finalises in roughly 2–3 seconds on an M-series Mac instead of replaying the full recording.

Energy-aware scheduling

On-device inference avoids long-running cloud calls, which helps keep laptops cooler and quieter during real work.

Accuracy

Why it stays accurate under normal human mess.

Signal-first improvements

We improve the audio before recognition instead of leaning on heavy prompt hacks to rescue a bad transcript afterwards.

Noise resilience

RNNoise lowers ambient distractions without flattening consonants and sibilants into mush.

Domain vocabulary

Custom word lists feed into the recogniser directly, which helps technical jargon and product names survive first contact.

Trust

Privacy and reviewability are part of the engineering story.

Mac App Store distribution

Voice Type ships with Apple notarisation and sandboxing enabled.

Authentic transparency

App Store reviews are shown without filtering, including critical feedback.

Minimal network calls

We only ping Apple for receipt checks and, if you enable it, your optional rewrite provider.

Optional rewrite

Dictation stays local. Rewrite is opt-in.

If you enable bring-your-own-key rewriting, transcripts can be handed to a fast LLM for formatting, summarising, or drafting. The important part is that dictation itself does not need that path to feel immediate.

That separation matters. Speech capture should be dependable first. Language polish can be optional second.