Three years ago, running a Whisper-class model on a laptop meant choosing between accuracy and usability. The CPU couldn't keep up. The GPU was power-hungry. Cloud was the only practical option.
Apple Silicon changed this.
What Core ML does
Core ML is Apple's framework for running ML models on-device. It handles the boring parts: memory management, operator fusion, scheduling across CPU/GPU/Neural Engine. You give it a model, it figures out how to run it efficiently.
For Whisper-style models, Core ML can split work across the Neural Engine (for matrix operations) and GPU (for attention layers). The result: fast inference without thermal throttling.
What Metal adds
Metal is Apple's low-level GPU API. Core ML uses it under the hood, but we also use Metal directly for custom audio processing kernels.
The pre-processing pipeline (noise suppression, normalization, resampling) runs on Metal. This keeps the CPU free for UI work and avoids memory copies between CPU and GPU.
Real numbers
On an M1 Mac, finalizing 30 seconds of audio takes about 2-3 seconds. That's the model inference time after you release the hotkey.
On an M3, it's closer to 1.5-2 seconds.
Intel Macs still work but run slower. The model uses CPU fallbacks, which means longer finalization and more power draw.
Why this matters for battery
Cloud dictation keeps the radio on. Uploading audio, waiting for response, downloading text. Each step draws power.
On-device inference is a burst: work hard for 2 seconds, then idle. The Neural Engine is remarkably efficient for this pattern. We've had users dictate for hours on battery.
Trade-offs
Model size vs accuracy. Smaller models run faster but miss more words. We offer multiple model sizes (27 MB to 550 MB) so you can pick the trade-off that fits your hardware.
Larger models make sense on M2/M3 with plenty of unified memory. Smaller models work better on 8GB machines or older Intel Macs.
On-device used to mean compromise. Now it means fast, private, and battery-friendly. The hardware caught up.
