Three years ago, running a Whisper-class model on a laptop meant choosing between accuracy and usability. The CPU couldn't keep up. The GPU was power-hungry. Cloud was the only practical option.

Apple Silicon changed this.

TL;DR

Core ML schedules ML workloads across CPU/GPU/Neural Engine efficiently.
Metal lets us keep parts of the audio pipeline fast and close to the hardware.
The result is predictable, low-latency dictation without uploads or server round-trips.

What Core ML does

Core ML is Apple's framework for running ML models on-device. It handles the boring parts: memory management, operator fusion, scheduling across CPU/GPU/Neural Engine. You give it a model, it figures out how to run it efficiently.

For Whisper-style models, Core ML can split work across the Neural Engine (for matrix operations) and GPU (for attention layers). The result: fast inference without thermal throttling.

What Metal adds

Metal is Apple's low-level GPU API. Core ML uses it under the hood, but we also use Metal directly for custom audio processing kernels.

The pre-processing pipeline (noise suppression, normalization, resampling) runs on Metal. This keeps the CPU free for UI work and avoids memory copies between CPU and GPU.

Real numbers

On an M1 Mac, finalizing 30 seconds of audio takes about 2-3 seconds. That's the model inference time after you release the hotkey.

On an M3, it's closer to 1.5-2 seconds.

Intel Macs still work but run slower. The model uses CPU fallbacks, which means longer finalization and more power draw.

Why this matters for battery

Cloud dictation keeps the radio on. Uploading audio, waiting for response, downloading text. Each step draws power.

On-device inference is a burst: work hard for 2 seconds, then idle. The Neural Engine is remarkably efficient for this pattern. We've had users dictate for hours on battery.

Trade-offs

Model size vs accuracy. Smaller models run faster but miss more words. We offer multiple model sizes (27 MB to 550 MB) so you can pick the trade-off that fits your hardware.

Larger models make sense on M2/M3 with plenty of unified memory. Smaller models work better on 8GB machines or older Intel Macs.

On-device used to mean compromise. Now it means fast, private, and battery-friendly. The hardware caught up.

Blog

Core ML and Metal: why Apple Silicon changed on-device dictation

TL;DR

What Core ML does

What Metal adds

Real numbers

Why this matters for battery

Trade-offs

Related articles

Learn

Company