Free Offline Transcription App for Mac: How I Built Local Whisper with whisper.cpp + CoreML
TL;DR: I built a fully offline, free transcription app for Mac called Local Whisper that captures both microphone input and system audio (including your meeting partner's voice). It runs whisper.cpp with CoreML acceleration on Apple Silicon, achieving near-real-time transcription without sending any data to the cloud. This post covers the architecture, the real-time optimization tricks, and the system audio capture method that took the most effort to crack.
Are you looking for a transcription app on Mac that works completely offline and costs nothing?
- You want to transcribe both your voice and the other person's voice in web meetings
- You don't want to pay a monthly subscription just for occasional use
- You don't want your audio data uploaded to any cloud server
- You need something fully local, secure, and auditable
This article walks through Local Whisper — an open-source, fully offline transcription app I built for Mac — covering its technical design, the real-time optimization strategies, and the system audio capture approach that makes it possible.
GitHub: https://github.com/y-dai20/local-whisper
What Is Local Whisper?
Local Whisper is a Mac application that runs OpenAI's Whisper speech-to-text model entirely on your local machine — no internet connection required.
The name comes directly from what it does: it runs Whisper locally. While it can technically run on Windows, macOS (especially on Apple Silicon Macs) is significantly faster thanks to CoreML support and the Neural Engine.
Feature Summary
| Feature | Status |
|---|---|
| Fully offline (no external data transmission) | ✅ |
| No subscription / completely free | ✅ |
| Microphone audio capture | ✅ |
| System audio capture (meeting partner's voice) | ✅ |
| Web meeting transcription without bots | ✅ |
| Open source and auditable | ✅ |
Why I Built This
The transcription app market is crowded, but almost every option has at least one of these problems:
- Monthly subscription fees — even for light, occasional use
- Mandatory cloud upload — your audio leaves your device
- Bot-based meeting capture — requires joining meetings as a separate participant
I didn't want to pay $10–20/month for something I'd use a handful of times per week. More importantly, I wasn't comfortable sending audio data — which often contains sensitive conversations — to an external server.
The goal was simple:
A transcription tool that is fully self-contained, based on open-source components, and where I can audit every part of the pipeline.
Core Engine: whisper.cpp + CoreML
For fully offline, high-accuracy transcription, I chose whisper.cpp — a C/C++ port of OpenAI's Whisper model that is highly optimized for local inference.
whisper.cpp is a C/C++ implementation of the Whisper architecture that supports a wide range of quantization formats and hardware acceleration backends, including CoreML on Apple Silicon and CUDA on NVIDIA GPUs.
Model Configuration
Model: Whisper large-v3-turbo
Quantization: Q5_0
Acceleration: CoreML (Apple Neural Engine)
Why large-v3-turbo?
It hits the sweet spot between accuracy and speed. The turbo variant uses a smaller decoder while keeping the full encoder from large-v3, making it significantly faster than large-v3 while retaining most of its accuracy.
Why Q5_0 quantization?
Q5_0 is a 5-bit integer quantization format. Compared to full fp16, it:
- Reduces model size by roughly 65%
- Has minimal accuracy degradation for transcription tasks
- Fits comfortably in memory on M-series Macs
Quantization is a technique to reduce model size and speed up inference by representing weights with lower-precision numbers (e.g., 5-bit integers instead of 16-bit floats). It trades a small amount of accuracy for significant performance gains.
Benchmark Result
With CoreML enabled on an Apple Silicon Mac:
~7 seconds of audio transcribed in approximately 1 second
That's a real-time factor (RTF) of roughly 0.14x — meaning the model runs about 7× faster than real-time. This makes real-time transcription genuinely practical on M-series hardware.
Real-Time Optimization: Two Key Techniques
Feeding long audio directly to Whisper causes latency to accumulate over time. Even at 7× real-time speed, if you buffer 30 seconds before running inference, your output is always 30+ seconds behind. Two optimizations brought this down to near-real-time behavior.
Technique 1: Silent Segment Removal with Silero VAD
VAD (Voice Activity Detection) is the process of detecting which portions of an audio stream contain speech, and which contain silence or background noise. Silero VAD is a lightweight neural VAD model from Silero AI that runs efficiently on CPU.
Without VAD, Whisper processes silence just like speech. This causes:
- Unnecessary encoder computation on empty audio frames
- Cumulative latency growth over long sessions
- Degraded real-time performance
The pipeline in Local Whisper:
Audio Stream
↓
Split into small chunks (e.g., 512 samples)
↓
Silero VAD → Is this chunk speech?
↓ (YES) ↓ (NO)
Pass to Whisper Discard
↓
Transcribed text
Additionally, if silence continues for more than 1 second, the session is ended and the next utterance starts a fresh session. This prevents the context buffer from growing indefinitely.
Practical impact: Latency reduction is perceptible even in casual use — speech responses appear noticeably faster than without VAD.
Technique 2: Reducing audio_ctx to Cut Encoder Time in Half
Whisper's inference time is dominated by the encoder — the part that converts raw audio into a latent representation before decoding text.
Whisper uses a Transformer encoder-decoder architecture. The encoder processes the full audio spectrogram and is the computational bottleneck. The decoder generates tokens autoregressively but is typically much faster.
The default audio_ctx parameter is 1500, which corresponds to ~30 seconds of audio context. For real-time use where you're processing short chunks, this is wasteful.
Default: audio_ctx = 1500 (~30 seconds) → Encoder: ~1000ms
Optimized: audio_ctx = 750 (~15 seconds) → Encoder: ~500ms
Result: Encoder latency cut by approximately 50%.
For real-time transcription, the tradeoff is clear:
You don't need 30 seconds of audio context per chunk. You need low latency.
Reducing audio_ctx to 750 sacrifices some ability to handle very long unbroken utterances, but in practice — where sentences and thoughts are measured in seconds — this is rarely a constraint.
Combined Effect
| Optimization | Latency Reduction |
|---|---|
| Silero VAD (silence removal) | Significant (session-dependent) |
| audio_ctx 1500 → 750 | ~50% encoder speedup |
| CoreML acceleration | Baseline ~7× real-time factor |
Together, these make real-time transcription on Apple Silicon Macs a genuinely viable workflow.
System Audio Capture on Mac: ScreenCaptureKit
This was the hardest part of the project. Capturing system audio on Mac — meaning the audio output from apps like Zoom, Google Meet, or a browser — is not straightforward.
The common approach is to use a virtual audio device like BlackHole, which routes system audio through a virtual loopback device. This works well for personal use but is awkward to bundle in a distributable app — it requires users to install a kernel extension separately.
Instead, I used Apple's official ScreenCaptureKit framework (introduced in macOS 12.3):
Architecture:
┌─────────────────────────────────┐
│ ScreenCaptureKit (Swift) │ ← Captures system audio + screen
│ ↓ │
│ Bridge layer │ ← Swift ↔ Rust FFI
│ ↓ │
│ Tauri (Rust) │ ← App logic
│ ↓ │
│ whisper.cpp │ ← Transcription engine
└─────────────────────────────────┘
ScreenCaptureKit is Apple's modern screen and audio capture API, available from macOS 12.3 onwards. It provides granular control over what to capture (specific apps, windows, or the whole screen) and supports audio capture without requiring virtual devices.
ScreenCaptureKit requires macOS 12.3 or later. Users on older macOS versions will not be able to capture system audio with this approach.
What This Enables
- Microphone input
- System audio output (meeting partner's voice, browser audio, etc.)
- Screen capture (for future features)
All three simultaneously, in a single app, with no third-party audio drivers needed.
The end result effectively behaves like a transcription-enabled meeting tool — both sides of a conversation are captured and transcribed locally, in real time.
Full Technical Stack
Speech Recognition: whisper.cpp
Model: Whisper large-v3-turbo (Q5_0 quantization)
Hardware Accel: CoreML (Apple Neural Engine)
VAD: Silero VAD
UI Framework: Tauri (Rust + Web frontend)
System Audio: ScreenCaptureKit (Swift, bridged to Rust)
Why Tauri instead of Electron?
| Tauri | Electron | |
|---|---|---|
| Binary size | Small (~5–20 MB) | Large (100+ MB) |
| Runtime | Native WebView | Bundled Chromium |
| Language | Rust | Node.js/JS |
| Build speed | Fast | Slower |
| Native API access | Via Rust + plugins | Via Node.js |
For a local desktop app that needs to call Swift APIs, bridge to C++ (whisper.cpp), and ship a lean binary, Tauri was the natural choice.
Development Timeline
Total development time: approximately 3 days for a working proof-of-concept.
AI-assisted coding (Claude + Codex) was used throughout — primarily for the Rust/Swift bridge boilerplate, the Silero VAD integration, and the Tauri plugin scaffolding. The core architecture decisions were manual, but implementation speed was significantly accelerated by AI assistance.
This is the kind of project that would have taken 2–3 weeks to prototype solo just a few years ago.
Local Whisper vs. Cloud Transcription Services
| Criterion | Local Whisper | Cloud-based services |
|---|---|---|
| Offline operation | ✅ Always | ❌ Requires internet |
| Subscription cost | Free | $10–30/month typical |
| Data privacy | Data never leaves device | Audio sent to cloud |
| Meeting partner audio | ✅ System audio capture | △ Bot-based only |
| Open source | ✅ Fully auditable | ❌ Closed |
| Setup complexity | Moderate | Low |
| Accuracy | High (large-v3-turbo) | High |
Cloud transcription services are convenient but involve uploading your audio — which may contain confidential information — to third-party servers. For sensitive meetings, local transcription is significantly safer from a data privacy standpoint.
Planned Features
- Local LLM integration — automatic meeting summaries generated on-device
- Real-time translation — live transcription + translation overlay
- Simultaneous display — show original and translated text side by side
The long-term goal is a fully local "transcribe and translate in real time during meetings" workflow, with no cloud dependency at any stage.
How to Get Started
GitHub: https://github.com/y-dai20/local-whisper
Prebuilt binaries and a demo video are available in the repository. The model weights (whisper large-v3-turbo Q5_0) need to be downloaded separately on first run due to file size.
Requirements:
- macOS 12.3 or later (for ScreenCaptureKit system audio)
- Apple Silicon Mac recommended (M1/M2/M3/M4) for CoreML acceleration
- Intel Mac supported but will be significantly slower
FAQ
Q: Does this work on Windows or Linux?
whisper.cpp itself runs on Windows and Linux, but the system audio capture component (ScreenCaptureKit) is macOS-only. You can use Local Whisper on Windows/Linux with microphone input only, but system audio capture will not be available. CoreML acceleration is also macOS/Apple Silicon exclusive.
Q: How accurate is Whisper large-v3-turbo compared to large-v3?
In most transcription benchmarks, large-v3-turbo achieves accuracy within 1–3% WER (word error rate) of large-v3 while running approximately 3–4× faster. For conversational speech — the primary use case here — the difference is barely perceptible in practice.
Q: What does "Q5_0 quantization" mean, and does it hurt accuracy?
Q5_0 means the model weights are stored as 5-bit integers instead of 16-bit floats. For Whisper, 5-bit quantization has been extensively benchmarked and shows minimal degradation — generally less than 1–2% WER increase compared to fp16, with roughly 65% reduction in model size and meaningful speedup on CPU/Neural Engine paths.
Q: Can I use this with Zoom, Google Meet, Teams, etc.?
Yes. ScreenCaptureKit captures all system audio output, which includes audio from any app — Zoom, Google Meet, Teams, Slack huddles, browser-based meetings, etc. No bot, no API key, no integration required.
Q: Is this legal to use for meeting transcription?
That depends on your jurisdiction and your organization's policies. In many places, recording meetings requires consent from all participants. Local Whisper is a technical tool — ensuring appropriate consent and compliance with applicable laws is the user's responsibility.
Q: What if I don't have Apple Silicon? Can I still use it on an Intel Mac?
Yes, but CoreML acceleration requires Apple Silicon. On Intel Macs, whisper.cpp falls back to CPU inference, which will be significantly slower. Real-time transcription may not be practical on Intel hardware with the large-v3-turbo model. You might consider using a smaller model (e.g., medium or small) if you're on Intel.
Q: Does Silero VAD work well for non-English speech?
Silero VAD is language-agnostic — it detects voice activity regardless of language. It does not perform any speech recognition, only silence/speech classification. So yes, it works equally well for Japanese, French, German, or any other language.
Q: How is this different from running Whisper through the OpenAI API?
The OpenAI Whisper API sends your audio to OpenAI's servers. Local Whisper runs entirely on your machine — nothing leaves your device at any point. It also supports real-time transcription (the API is batch-only), and costs nothing beyond the initial setup.
Summary
If you're on an Apple Silicon Mac and want offline, private, free transcription:
- whisper.cpp + CoreML gives you near-real-time inference without a GPU
- Silero VAD removes silence before it reaches the encoder, cutting accumulated latency
- Reducing
audio_ctxfrom 1500 to 750 halves encoder time with negligible quality loss - ScreenCaptureKit solves the hardest Mac problem: capturing system audio without virtual drivers
The result is a transcription tool that handles both sides of a conversation, runs entirely offline, and is built entirely on open-source components you can inspect and modify.
For anyone who values privacy, wants to avoid recurring costs, or simply needs reliable meeting transcription without cloud dependency — this stack works.
Related Articles
Explore more articles connected to this topic.
5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened
After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented **PageIndex** — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like *"What products do you have?"* and *"What ASR-related blog posts exist?"* — zero vector math involved.
Read article →