whisper.cpp Local Inference on Mac: Offline Transcription with CoreML & Apple Silicon

April 29, 2026

TL;DR: I built a fully offline, free transcription app for Mac called Local Whisper that captures both microphone input and system audio (including your meeting partner's voice). It runs whisper.cpp with CoreML acceleration on Apple Silicon, achieving near-real-time transcription without sending any data to the cloud. This post covers the architecture, the real-time optimization tricks, and the system audio capture method that took the most effort to crack.

Are you looking for a transcription app on Mac that works completely offline and costs nothing?

You want to transcribe both your voice and the other person's voice in web meetings
You don't want to pay a monthly subscription just for occasional use
You don't want your audio data uploaded to any cloud server
You need something fully local, secure, and auditable

This article walks through Local Whisper — an open-source, fully offline transcription app I built for Mac — covering its technical design, the real-time optimization strategies, and the system audio capture approach that makes it possible.

GitHub: https://github.com/y-dai20/local-whisper

What Is Local Whisper?

Local Whisper is a Mac application that runs OpenAI's Whisper speech-to-text model entirely on your local machine — no internet connection required.

The name comes directly from what it does: it runs Whisper locally. While it can technically run on Windows, macOS (especially on Apple Silicon Macs) is significantly faster thanks to CoreML support and the Neural Engine.

Feature Summary

Feature	Status
Fully offline (no external data transmission)	✅
No subscription / completely free	✅
Microphone audio capture	✅
System audio capture (meeting partner's voice)	✅
Web meeting transcription without bots	✅
Open source and auditable	✅

Why I Built This

The transcription app market is crowded, but almost every option has at least one of these problems:

Monthly subscription fees — even for light, occasional use
Mandatory cloud upload — your audio leaves your device
Bot-based meeting capture — requires joining meetings as a separate participant

I didn't want to pay $10–20/month for something I'd use a handful of times per week. More importantly, I wasn't comfortable sending audio data — which often contains sensitive conversations — to an external server.

The goal was simple:

A transcription tool that is fully self-contained, based on open-source components, and where I can audit every part of the pipeline.

Core Engine: whisper.cpp + CoreML

For fully offline, high-accuracy transcription, I chose whisper.cpp — a C/C++ port of OpenAI's Whisper model that is highly optimized for local inference.

whisper.cpp is a C/C++ implementation of the Whisper architecture that supports a wide range of quantization formats and hardware acceleration backends, including CoreML on Apple Silicon and CUDA on NVIDIA GPUs.

Model Configuration

text

Model:         Whisper large-v3-turbo
Quantization:  Q5_0
Acceleration:  CoreML (Apple Neural Engine)

Why large-v3-turbo?
It hits the sweet spot between accuracy and speed. The turbo variant uses a smaller decoder while keeping the full encoder from large-v3, making it significantly faster than large-v3 while retaining most of its accuracy.

Why Q5_0 quantization?
Q5_0 is a 5-bit integer quantization format. Compared to full fp16, it:

Reduces model size by roughly 65%
Has minimal accuracy degradation for transcription tasks
Fits comfortably in memory on M-series Macs

Quantization is a technique to reduce model size and speed up inference by representing weights with lower-precision numbers (e.g., 5-bit integers instead of 16-bit floats). It trades a small amount of accuracy for significant performance gains.

Benchmark Result

With CoreML enabled on an Apple Silicon Mac:

~7 seconds of audio transcribed in approximately 1 second

That's a real-time factor (RTF) of roughly 0.14x — meaning the model runs about 7× faster than real-time. This makes real-time transcription genuinely practical on M-series hardware.

Real-Time Optimization: Two Key Techniques

Feeding long audio directly to Whisper causes latency to accumulate over time. Even at 7× real-time speed, if you buffer 30 seconds before running inference, your output is always 30+ seconds behind. Two optimizations brought this down to near-real-time behavior.

Technique 1: Silent Segment Removal with Silero VAD

VAD (Voice Activity Detection) is the process of detecting which portions of an audio stream contain speech, and which contain silence or background noise. Silero VAD is a lightweight neural VAD model from Silero AI that runs efficiently on CPU.

Without VAD, Whisper processes silence just like speech. This causes:

Unnecessary encoder computation on empty audio frames
Cumulative latency growth over long sessions
Degraded real-time performance

The pipeline in Local Whisper:

text

Audio Stream
    ↓
Split into small chunks (e.g., 512 samples)
    ↓
Silero VAD → Is this chunk speech?
    ↓ (YES)         ↓ (NO)
Pass to Whisper    Discard
    ↓
Transcribed text

Additionally, if silence continues for more than 1 second, the session is ended and the next utterance starts a fresh session. This prevents the context buffer from growing indefinitely.

Practical impact: Latency reduction is perceptible even in casual use — speech responses appear noticeably faster than without VAD.

Technique 2: Reducing `audio_ctx` to Cut Encoder Time in Half

Whisper's inference time is dominated by the encoder — the part that converts raw audio into a latent representation before decoding text.

Whisper uses a Transformer encoder-decoder architecture. The encoder processes the full audio spectrogram and is the computational bottleneck. The decoder generates tokens autoregressively but is typically much faster.

The default audio_ctx parameter is 1500, which corresponds to ~30 seconds of audio context. For real-time use where you're processing short chunks, this is wasteful.

text

Default:   audio_ctx = 1500  (~30 seconds)  → Encoder: ~1000ms
Optimized: audio_ctx = 750   (~15 seconds)  → Encoder: ~500ms

Result: Encoder latency cut by approximately 50%.

For real-time transcription, the tradeoff is clear:

You don't need 30 seconds of audio context per chunk. You need low latency.

Reducing audio_ctx to 750 sacrifices some ability to handle very long unbroken utterances, but in practice — where sentences and thoughts are measured in seconds — this is rarely a constraint.

Combined Effect

Optimization	Latency Reduction
Silero VAD (silence removal)	Significant (session-dependent)
audio_ctx 1500 → 750	~50% encoder speedup
CoreML acceleration	Baseline ~7× real-time factor

Together, these make real-time transcription on Apple Silicon Macs a genuinely viable workflow.

System Audio Capture on Mac: ScreenCaptureKit

This was the hardest part of the project. Capturing system audio on Mac — meaning the audio output from apps like Zoom, Google Meet, or a browser — is not straightforward.

The common approach is to use a virtual audio device like BlackHole, which routes system audio through a virtual loopback device. This works well for personal use but is awkward to bundle in a distributable app — it requires users to install a kernel extension separately.

Instead, I used Apple's official ScreenCaptureKit framework (introduced in macOS 12.3):

text

Architecture:
┌─────────────────────────────────┐
│  ScreenCaptureKit (Swift)       │  ← Captures system audio + screen
│    ↓                            │
│  Bridge layer                   │  ← Swift ↔ Rust FFI
│    ↓                            │
│  Tauri (Rust)                   │  ← App logic
│    ↓                            │
│  whisper.cpp                    │  ← Transcription engine
└─────────────────────────────────┘

ScreenCaptureKit is Apple's modern screen and audio capture API, available from macOS 12.3 onwards. It provides granular control over what to capture (specific apps, windows, or the whole screen) and supports audio capture without requiring virtual devices.

ScreenCaptureKit requires macOS 12.3 or later. Users on older macOS versions will not be able to capture system audio with this approach.

What This Enables

Microphone input
System audio output (meeting partner's voice, browser audio, etc.)
Screen capture (for future features)

All three simultaneously, in a single app, with no third-party audio drivers needed.

The end result effectively behaves like a transcription-enabled meeting tool — both sides of a conversation are captured and transcribed locally, in real time.

Full Technical Stack

text

Speech Recognition:  whisper.cpp
Model:               Whisper large-v3-turbo (Q5_0 quantization)
Hardware Accel:      CoreML (Apple Neural Engine)
VAD:                 Silero VAD
UI Framework:        Tauri (Rust + Web frontend)
System Audio:        ScreenCaptureKit (Swift, bridged to Rust)

Why Tauri instead of Electron?

	Tauri	Electron
Binary size	Small (~5–20 MB)	Large (100+ MB)
Runtime	Native WebView	Bundled Chromium
Language	Rust	Node.js/JS
Build speed	Fast	Slower
Native API access	Via Rust + plugins	Via Node.js

For a local desktop app that needs to call Swift APIs, bridge to C++ (whisper.cpp), and ship a lean binary, Tauri was the natural choice.

Development Timeline

Total development time: approximately 3 days for a working proof-of-concept.

AI-assisted coding (Claude + Codex) was used throughout — primarily for the Rust/Swift bridge boilerplate, the Silero VAD integration, and the Tauri plugin scaffolding. The core architecture decisions were manual, but implementation speed was significantly accelerated by AI assistance.

This is the kind of project that would have taken 2–3 weeks to prototype solo just a few years ago.

Local Whisper vs. Cloud Transcription Services

Criterion	Local Whisper	Cloud-based services
Offline operation	✅ Always	❌ Requires internet
Subscription cost	Free	$10–30/month typical
Data privacy	Data never leaves device	Audio sent to cloud
Meeting partner audio	✅ System audio capture	△ Bot-based only
Open source	✅ Fully auditable	❌ Closed
Setup complexity	Moderate	Low
Accuracy	High (large-v3-turbo)	High

Cloud transcription services are convenient but involve uploading your audio — which may contain confidential information — to third-party servers. For sensitive meetings, local transcription is significantly safer from a data privacy standpoint.

Planned Features

Local LLM integration — automatic meeting summaries generated on-device
Real-time translation — live transcription + translation overlay
Simultaneous display — show original and translated text side by side

The long-term goal is a fully local "transcribe and translate in real time during meetings" workflow, with no cloud dependency at any stage.

How to Get Started

GitHub: https://github.com/y-dai20/local-whisper

Prebuilt binaries and a demo video are available in the repository. The model weights (whisper large-v3-turbo Q5_0) need to be downloaded separately on first run due to file size.

Requirements:

macOS 12.3 or later (for ScreenCaptureKit system audio)
Apple Silicon Mac recommended (M1/M2/M3/M4) for CoreML acceleration
Intel Mac supported but will be significantly slower

FAQ

Q: Does this work on Windows or Linux?

whisper.cpp itself runs on Windows and Linux, but the system audio capture component (ScreenCaptureKit) is macOS-only. You can use Local Whisper on Windows/Linux with microphone input only, but system audio capture will not be available. CoreML acceleration is also macOS/Apple Silicon exclusive.

Q: How accurate is Whisper large-v3-turbo compared to large-v3?

In most transcription benchmarks, large-v3-turbo achieves accuracy within 1–3% WER (word error rate) of large-v3 while running approximately 3–4× faster. For conversational speech — the primary use case here — the difference is barely perceptible in practice.

Q: What does "Q5_0 quantization" mean, and does it hurt accuracy?

Q5_0 means the model weights are stored as 5-bit integers instead of 16-bit floats. For Whisper, 5-bit quantization has been extensively benchmarked and shows minimal degradation — generally less than 1–2% WER increase compared to fp16, with roughly 65% reduction in model size and meaningful speedup on CPU/Neural Engine paths.

Q: Can I use this with Zoom, Google Meet, Teams, etc.?

Yes. ScreenCaptureKit captures all system audio output, which includes audio from any app — Zoom, Google Meet, Teams, Slack huddles, browser-based meetings, etc. No bot, no API key, no integration required.

Q: Is this legal to use for meeting transcription?

That depends on your jurisdiction and your organization's policies. In many places, recording meetings requires consent from all participants. Local Whisper is a technical tool — ensuring appropriate consent and compliance with applicable laws is the user's responsibility.

Q: What if I don't have Apple Silicon? Can I still use it on an Intel Mac?

Yes, but CoreML acceleration requires Apple Silicon. On Intel Macs, whisper.cpp falls back to CPU inference, which will be significantly slower. Real-time transcription may not be practical on Intel hardware with the large-v3-turbo model. You might consider using a smaller model (e.g., medium or small) if you're on Intel.

Q: Does Silero VAD work well for non-English speech?

Silero VAD is language-agnostic — it detects voice activity regardless of language. It does not perform any speech recognition, only silence/speech classification. So yes, it works equally well for Japanese, French, German, or any other language.

Q: How is this different from running Whisper through the OpenAI API?

The OpenAI Whisper API sends your audio to OpenAI's servers. Local Whisper runs entirely on your machine — nothing leaves your device at any point. It also supports real-time transcription (the API is batch-only), and costs nothing beyond the initial setup.

Summary

If you're on an Apple Silicon Mac and want offline, private, free transcription:

whisper.cpp + CoreML gives you near-real-time inference without a GPU
Silero VAD removes silence before it reaches the encoder, cutting accumulated latency
Reducing audio_ctx from 1500 to 750 halves encoder time with negligible quality loss
ScreenCaptureKit solves the hardest Mac problem: capturing system audio without virtual drivers

The result is a transcription tool that handles both sides of a conversation, runs entirely offline, and is built entirely on open-source components you can inspect and modify.

For anyone who values privacy, wants to avoid recurring costs, or simply needs reliable meeting transcription without cloud dependency — this stack works.

GitHub: https://github.com/y-dai20/local-whisper

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Go to Contact

Explore more articles connected to this topic.

May 1, 2026

Best Japanese ASR Models 2026: Qwen3 ASR vs Whisper vs Reazonspeech Benchmark

We benchmarked 8 Japanese ASR models — Qwen3-ASR, Whisper, ReazonSpeech, and Parakeet — on WER and inference speed. See which wins for accuracy, speed, and broadcast audio.

Read article →

May 1, 2026diarization

Best Open-Source Speaker Diarization Models 2026: NeMo vs Pyannote vs VibeVoice Benchmarked

VibeVoice tops our 2026 benchmark across 6–14 speaker scenarios. We tested NeMo (SortFormer, MSDD), Pyannote, and VibeVoice on real-world audio — full DER scores, latency, and a model picker by use case.

Read article →

April 30, 2026asrbenchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.

Read article →

What Is Local Whisper?

Feature Summary

Why I Built This

Core Engine: whisper.cpp + CoreML

Model Configuration

Benchmark Result

Real-Time Optimization: Two Key Techniques

Technique 1: Silent Segment Removal with Silero VAD

Technique 2: Reducing audio_ctx to Cut Encoder Time in Half

Combined Effect

System Audio Capture on Mac: ScreenCaptureKit

What This Enables

Full Technical Stack

Why Tauri instead of Electron?

Development Timeline

Local Whisper vs. Cloud Transcription Services

Planned Features

How to Get Started

FAQ

Q: Does this work on Windows or Linux?

Q: How accurate is Whisper large-v3-turbo compared to large-v3?

Q: What does "Q5_0 quantization" mean, and does it hurt accuracy?

Q: Can I use this with Zoom, Google Meet, Teams, etc.?

Q: Is this legal to use for meeting transcription?

Q: What if I don't have Apple Silicon? Can I still use it on an Intel Mac?

Q: Does Silero VAD work well for non-English speech?

Q: How is this different from running Whisper through the OpenAI API?

Summary

For project inquiries and collaboration, contact us here.

Related Articles

Best Japanese ASR Models 2026: Qwen3 ASR vs Whisper vs Reazonspeech Benchmark

Best Open-Source Speaker Diarization Models 2026: NeMo vs Pyannote vs VibeVoice Benchmarked

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

Technique 2: Reducing `audio_ctx` to Cut Encoder Time in Half