Skip to main content

Blog

48 kHz in, 16 kHz out: why resampling matters

Matching your mic's output to what Whisper expects.

28 Nov 2025

Your Mac's microphone captures audio at 48 kHz. Whisper-style models expect 16 kHz. The conversion matters more than you'd think.

The naive approach breaks things

Simple decimation (just dropping samples) creates aliasing artifacts. High frequencies fold back into the audible range as noise. The model hears phantom sounds that weren't in your voice.

Cheap resampling libraries optimize for speed, not quality. Fine for ringtones. Bad for speech recognition where subtle differences between consonants matter.

Band-limited resampling

We use band-limited resampling: apply a low-pass filter at the Nyquist frequency (8 kHz), then decimate. This removes frequencies that would alias before they can cause problems.

The filter matters. Too aggressive and you lose the high-frequency content that distinguishes "s" from "f" from "th". Too gentle and aliasing sneaks through.

Why not just record at 16 kHz?

macOS audio APIs default to 48 kHz. Fighting the system adds latency and edge cases. Better to accept 48 kHz and resample correctly.

Plus, we process at 48 kHz for the earlier pipeline stages (VAD, noise suppression). Higher sample rate means more information to work with when detecting speech boundaries.

The full chain

  1. Capture at 48 kHz (Mac default)
  2. RNNoise VAD + noise suppression (48 kHz)
  3. Segmentation into phrases (48 kHz)
  4. LUFS normalization + high-pass filter (48 kHz)
  5. Band-limited resample to 16 kHz
  6. Feed to Whisper-style model

Each step operates at the sample rate that makes sense for it. The model gets clean 16 kHz audio that matches its training distribution.


Small details compound. A 1% improvement in each pipeline stage adds up to noticeably better transcripts.

Related articles