Skip to main content

Interactive demo

Where dictation latency actually comes from

This page is a model, not a fake benchmark chart. It breaks the delay into understandable parts so you can see why cloud dictation often feels fine for a one-liner and lousy once it becomes a habit.

The useful question is not whether cloud speech recognition is “fast.” It is where the time goes after you stop speaking. A local workflow can remove upload delay and server round-trips. A cloud workflow cannot.

LLM rewrite

BYOK cleanup

60ms RTT · 5 Mbps up

Cloud dictation

Upload audio → cloud STT

14s
Handshake120msAudio upload7.7sTranscribe6.0sOverhead300ms

Voice Type

Local STT → LLM rewrite

2.8s
Final transcript1.2sLLM handshake120msLLM rewrite1.5s

Audio stays on your Mac. Only text goes to your LLM.

Voice Type is 11s faster (80% less wait)

Calculation details

• Cloud audio: 128 kbps, STT ~50× real-time

• On-device: ~1.2s finalize on M1 Pro

• LLM: Cerebras ~1000 tokens/sec

• Speech: ~225 wpm (fast talker)

How to read the demo

  • The cloud path includes network handshake, upload time, recognition time, and service overhead.
  • The local path keeps speech recognition on-device and only adds optional extra time if you ask for an external text rewrite.
  • The point is not to produce one magic number. It is to show which pieces of the path disappear when audio does not have to leave the machine.

What the model assumes

The interactive uses simple, explicit assumptions: network round-trip time, upload bandwidth, a rough cloud transcription speed, and a local finalization step for the last chunk of speech. That makes it honest. Real providers differ, but the structure of the delay stays the same.

This mirrors how streaming speech systems are typically documented: audio is chunked, sent over a live stream, processed incrementally, and finalized after speech ends. Larger network delays and slower uploads still have to be paid for somewhere.

Why this matters in actual use

  • For a one-line note on fast fiber, cloud latency can feel acceptable.
  • For repeated short dictation bursts during normal work, every extra second compounds into friction.
  • For bad hotel Wi-Fi, trains, or offline situations, the difference stops being about speed and becomes about whether the workflow works at all.

References

Voice Type keeps the audio path local. Optional rewrites can still use your own LLM, but the speech recognition step does not have to wait on the network.

Try the free 7-day trial