Free Offline Transcription App for Mac: How I Built Local Whisper with whisper.cpp + CoreML

TL;DR: I built a fully offline, free transcription app for Mac called Local Whisper that captures both microphone input and system audio (including your meeting partner's voice). It runs whisper.cpp with CoreML acceleration on Apple Silicon, achieving near-real-time transcription without sending any data to the cloud. This post covers the architecture, the real-time optimization tricks, and the system audio capture method that took the most effort to crack.


Are you looking for a transcription app on Mac that works completely offline and costs nothing?

  • You want to transcribe both your voice and the other person's voice in web meetings
  • You don't want to pay a monthly subscription just for occasional use
  • You don't want your audio data uploaded to any cloud server
  • You need something fully local, secure, and auditable

This article walks through Local Whisper — an open-source, fully offline transcription app I built for Mac — covering its technical design, the real-time optimization strategies, and the system audio capture approach that makes it possible.

GitHub: https://github.com/y-dai20/local-whisper


What Is Local Whisper?

Local Whisper is a Mac application that runs OpenAI's Whisper speech-to-text model entirely on your local machine — no internet connection required.

The name comes directly from what it does: it runs Whisper locally. While it can technically run on Windows, macOS (especially on Apple Silicon Macs) is significantly faster thanks to CoreML support and the Neural Engine.

Feature Summary

FeatureStatus
Fully offline (no external data transmission)
No subscription / completely free
Microphone audio capture
System audio capture (meeting partner's voice)
Web meeting transcription without bots
Open source and auditable

Why I Built This

The transcription app market is crowded, but almost every option has at least one of these problems:

  • Monthly subscription fees — even for light, occasional use
  • Mandatory cloud upload — your audio leaves your device
  • Bot-based meeting capture — requires joining meetings as a separate participant

I didn't want to pay $10–20/month for something I'd use a handful of times per week. More importantly, I wasn't comfortable sending audio data — which often contains sensitive conversations — to an external server.

The goal was simple:

A transcription tool that is fully self-contained, based on open-source components, and where I can audit every part of the pipeline.


Core Engine: whisper.cpp + CoreML

For fully offline, high-accuracy transcription, I chose whisper.cpp — a C/C++ port of OpenAI's Whisper model that is highly optimized for local inference.

whisper.cpp is a C/C++ implementation of the Whisper architecture that supports a wide range of quantization formats and hardware acceleration backends, including CoreML on Apple Silicon and CUDA on NVIDIA GPUs.

Model Configuration

text
Model:         Whisper large-v3-turbo
Quantization:  Q5_0
Acceleration:  CoreML (Apple Neural Engine)

Why large-v3-turbo?
It hits the sweet spot between accuracy and speed. The turbo variant uses a smaller decoder while keeping the full encoder from large-v3, making it significantly faster than large-v3 while retaining most of its accuracy.

Why Q5_0 quantization?
Q5_0 is a 5-bit integer quantization format. Compared to full fp16, it:

  • Reduces model size by roughly 65%
  • Has minimal accuracy degradation for transcription tasks
  • Fits comfortably in memory on M-series Macs

Quantization is a technique to reduce model size and speed up inference by representing weights with lower-precision numbers (e.g., 5-bit integers instead of 16-bit floats). It trades a small amount of accuracy for significant performance gains.

Benchmark Result

With CoreML enabled on an Apple Silicon Mac:

~7 seconds of audio transcribed in approximately 1 second

That's a real-time factor (RTF) of roughly 0.14x — meaning the model runs about 7× faster than real-time. This makes real-time transcription genuinely practical on M-series hardware.


Real-Time Optimization: Two Key Techniques

Feeding long audio directly to Whisper causes latency to accumulate over time. Even at 7× real-time speed, if you buffer 30 seconds before running inference, your output is always 30+ seconds behind. Two optimizations brought this down to near-real-time behavior.


Technique 1: Silent Segment Removal with Silero VAD

VAD (Voice Activity Detection) is the process of detecting which portions of an audio stream contain speech, and which contain silence or background noise. Silero VAD is a lightweight neural VAD model from Silero AI that runs efficiently on CPU.

Without VAD, Whisper processes silence just like speech. This causes:

  • Unnecessary encoder computation on empty audio frames
  • Cumulative latency growth over long sessions
  • Degraded real-time performance

The pipeline in Local Whisper:

text
Audio Stream
    ↓
Split into small chunks (e.g., 512 samples)
    ↓
Silero VAD → Is this chunk speech?
    ↓ (YES)         ↓ (NO)
Pass to Whisper    Discard
    ↓
Transcribed text

Additionally, if silence continues for more than 1 second, the session is ended and the next utterance starts a fresh session. This prevents the context buffer from growing indefinitely.

Practical impact: Latency reduction is perceptible even in casual use — speech responses appear noticeably faster than without VAD.


Technique 2: Reducing audio_ctx to Cut Encoder Time in Half

Whisper's inference time is dominated by the encoder — the part that converts raw audio into a latent representation before decoding text.

Whisper uses a Transformer encoder-decoder architecture. The encoder processes the full audio spectrogram and is the computational bottleneck. The decoder generates tokens autoregressively but is typically much faster.

The default audio_ctx parameter is 1500, which corresponds to ~30 seconds of audio context. For real-time use where you're processing short chunks, this is wasteful.

text
Default:   audio_ctx = 1500  (~30 seconds)  → Encoder: ~1000ms
Optimized: audio_ctx = 750   (~15 seconds)  → Encoder: ~500ms

Result: Encoder latency cut by approximately 50%.

For real-time transcription, the tradeoff is clear:

You don't need 30 seconds of audio context per chunk. You need low latency.

Reducing audio_ctx to 750 sacrifices some ability to handle very long unbroken utterances, but in practice — where sentences and thoughts are measured in seconds — this is rarely a constraint.


Combined Effect

OptimizationLatency Reduction
Silero VAD (silence removal)Significant (session-dependent)
audio_ctx 1500 → 750~50% encoder speedup
CoreML accelerationBaseline ~7× real-time factor

Together, these make real-time transcription on Apple Silicon Macs a genuinely viable workflow.


System Audio Capture on Mac: ScreenCaptureKit

This was the hardest part of the project. Capturing system audio on Mac — meaning the audio output from apps like Zoom, Google Meet, or a browser — is not straightforward.

The common approach is to use a virtual audio device like BlackHole, which routes system audio through a virtual loopback device. This works well for personal use but is awkward to bundle in a distributable app — it requires users to install a kernel extension separately.

Instead, I used Apple's official ScreenCaptureKit framework (introduced in macOS 12.3):

text
Architecture:
┌─────────────────────────────────┐
│  ScreenCaptureKit (Swift)       │  ← Captures system audio + screen
│    ↓                            │
│  Bridge layer                   │  ← Swift ↔ Rust FFI
│    ↓                            │
│  Tauri (Rust)                   │  ← App logic
│    ↓                            │
│  whisper.cpp                    │  ← Transcription engine
└─────────────────────────────────┘

ScreenCaptureKit is Apple's modern screen and audio capture API, available from macOS 12.3 onwards. It provides granular control over what to capture (specific apps, windows, or the whole screen) and supports audio capture without requiring virtual devices.

ScreenCaptureKit requires macOS 12.3 or later. Users on older macOS versions will not be able to capture system audio with this approach.

What This Enables

  • Microphone input
  • System audio output (meeting partner's voice, browser audio, etc.)
  • Screen capture (for future features)

All three simultaneously, in a single app, with no third-party audio drivers needed.

The end result effectively behaves like a transcription-enabled meeting tool — both sides of a conversation are captured and transcribed locally, in real time.


Full Technical Stack

text
Speech Recognition:  whisper.cpp
Model:               Whisper large-v3-turbo (Q5_0 quantization)
Hardware Accel:      CoreML (Apple Neural Engine)
VAD:                 Silero VAD
UI Framework:        Tauri (Rust + Web frontend)
System Audio:        ScreenCaptureKit (Swift, bridged to Rust)

Why Tauri instead of Electron?

TauriElectron
Binary sizeSmall (~5–20 MB)Large (100+ MB)
RuntimeNative WebViewBundled Chromium
LanguageRustNode.js/JS
Build speedFastSlower
Native API accessVia Rust + pluginsVia Node.js

For a local desktop app that needs to call Swift APIs, bridge to C++ (whisper.cpp), and ship a lean binary, Tauri was the natural choice.


Development Timeline

Total development time: approximately 3 days for a working proof-of-concept.

AI-assisted coding (Claude + Codex) was used throughout — primarily for the Rust/Swift bridge boilerplate, the Silero VAD integration, and the Tauri plugin scaffolding. The core architecture decisions were manual, but implementation speed was significantly accelerated by AI assistance.

This is the kind of project that would have taken 2–3 weeks to prototype solo just a few years ago.


Local Whisper vs. Cloud Transcription Services

CriterionLocal WhisperCloud-based services
Offline operation✅ Always❌ Requires internet
Subscription costFree$10–30/month typical
Data privacyData never leaves deviceAudio sent to cloud
Meeting partner audio✅ System audio capture△ Bot-based only
Open source✅ Fully auditable❌ Closed
Setup complexityModerateLow
AccuracyHigh (large-v3-turbo)High

Cloud transcription services are convenient but involve uploading your audio — which may contain confidential information — to third-party servers. For sensitive meetings, local transcription is significantly safer from a data privacy standpoint.


Planned Features

  • Local LLM integration — automatic meeting summaries generated on-device
  • Real-time translation — live transcription + translation overlay
  • Simultaneous display — show original and translated text side by side

The long-term goal is a fully local "transcribe and translate in real time during meetings" workflow, with no cloud dependency at any stage.


How to Get Started

GitHub: https://github.com/y-dai20/local-whisper

Prebuilt binaries and a demo video are available in the repository. The model weights (whisper large-v3-turbo Q5_0) need to be downloaded separately on first run due to file size.

Requirements:

  • macOS 12.3 or later (for ScreenCaptureKit system audio)
  • Apple Silicon Mac recommended (M1/M2/M3/M4) for CoreML acceleration
  • Intel Mac supported but will be significantly slower

FAQ

Q: Does this work on Windows or Linux?

whisper.cpp itself runs on Windows and Linux, but the system audio capture component (ScreenCaptureKit) is macOS-only. You can use Local Whisper on Windows/Linux with microphone input only, but system audio capture will not be available. CoreML acceleration is also macOS/Apple Silicon exclusive.

Q: How accurate is Whisper large-v3-turbo compared to large-v3?

In most transcription benchmarks, large-v3-turbo achieves accuracy within 1–3% WER (word error rate) of large-v3 while running approximately 3–4× faster. For conversational speech — the primary use case here — the difference is barely perceptible in practice.

Q: What does "Q5_0 quantization" mean, and does it hurt accuracy?

Q5_0 means the model weights are stored as 5-bit integers instead of 16-bit floats. For Whisper, 5-bit quantization has been extensively benchmarked and shows minimal degradation — generally less than 1–2% WER increase compared to fp16, with roughly 65% reduction in model size and meaningful speedup on CPU/Neural Engine paths.

Q: Can I use this with Zoom, Google Meet, Teams, etc.?

Yes. ScreenCaptureKit captures all system audio output, which includes audio from any app — Zoom, Google Meet, Teams, Slack huddles, browser-based meetings, etc. No bot, no API key, no integration required.

That depends on your jurisdiction and your organization's policies. In many places, recording meetings requires consent from all participants. Local Whisper is a technical tool — ensuring appropriate consent and compliance with applicable laws is the user's responsibility.

Q: What if I don't have Apple Silicon? Can I still use it on an Intel Mac?

Yes, but CoreML acceleration requires Apple Silicon. On Intel Macs, whisper.cpp falls back to CPU inference, which will be significantly slower. Real-time transcription may not be practical on Intel hardware with the large-v3-turbo model. You might consider using a smaller model (e.g., medium or small) if you're on Intel.

Q: Does Silero VAD work well for non-English speech?

Silero VAD is language-agnostic — it detects voice activity regardless of language. It does not perform any speech recognition, only silence/speech classification. So yes, it works equally well for Japanese, French, German, or any other language.

Q: How is this different from running Whisper through the OpenAI API?

The OpenAI Whisper API sends your audio to OpenAI's servers. Local Whisper runs entirely on your machine — nothing leaves your device at any point. It also supports real-time transcription (the API is batch-only), and costs nothing beyond the initial setup.


Summary

If you're on an Apple Silicon Mac and want offline, private, free transcription:

  1. whisper.cpp + CoreML gives you near-real-time inference without a GPU
  2. Silero VAD removes silence before it reaches the encoder, cutting accumulated latency
  3. Reducing audio_ctx from 1500 to 750 halves encoder time with negligible quality loss
  4. ScreenCaptureKit solves the hardest Mac problem: capturing system audio without virtual drivers

The result is a transcription tool that handles both sides of a conversation, runs entirely offline, and is built entirely on open-source components you can inspect and modify.

For anyone who values privacy, wants to avoid recurring costs, or simply needs reliable meeting transcription without cloud dependency — this stack works.

GitHub: https://github.com/y-dai20/local-whisper

Related Articles

Explore more articles connected to this topic.