I Tested Zero-Shot Voice Cloning with Emotion Control in OpenVoice — 8 Styles from a 14-Second Reference
TL;DR
OpenVoice V1 is a zero-shot voice cloning library that extracts a speaker's tone color from as little as 14 seconds of reference audio, then synthesizes speech in 8 emotional styles: whispering, shouting, excited, cheerful, terrified, angry, sad, and friendly. This post covers the code I actually ran, what I heard, and the gotchas I hit along the way.
What Is OpenVoice?
OpenVoice is an open-source voice cloning framework developed by MyShell. Its key design decision is separating tone color (voice identity) and style control into independent modules.
| Component | Role |
|---|---|
BaseSpeakerTTS | Generates speech with a chosen emotional style |
ToneColorConverter | Extracts and transfers voice tone color from a reference audio |
se_extractor | Derives a Speaker Embedding vector from the reference audio |
The mental model is: the base model owns the emotion; the reference audio owns the voice. This means you can freely swap emotions regardless of how the reference speaker originally sounded.
What does "zero-shot" mean?
Adapting a pre-trained model to an unseen speaker without any additional fine-tuning. Because only a short reference clip is needed, data collection cost is essentially zero.
Experiment Setup
- Reference audio: 14-second English clip, single speaker, neutral tone
- Synthesis text:
"This audio is generated by OpenVoice."(default) - Styles tested:
whispering/shouting/excited/cheerful/terrified/angry/sad/friendly— all 8 - Environment: Python 3.10, PyTorch 2.x, CUDA-capable GPU (CPU also works)
Reference audio:
Generated Samples
All clips below use the same text and the same speaker's voice — only the style changes.
-
whispering
-
friendly
-
terrified
-
shouting
-
angry
-
cheerful
-
excited
-
sad
Listening impressions: cheerful, friendly, whispering, and terrified struck a natural balance between preserving the reference voice and conveying the target emotion — all felt production-ready. shouting was the one outlier; it sounded slightly artificial compared to the others, with an exaggerated pressure that didn't quite land.
The Code
A CLI wrapper around OpenVoice V1. Pass --reference for the source audio and --style for the emotion preset.
#!/usr/bin/env python3
import argparse
import os
import sys
import torch
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter
ENGLISH_STYLES = [
"default",
"whispering",
"shouting",
"excited",
"cheerful",
"terrified",
"angry",
"sad",
"friendly",
]
DEFAULT_TEXTS = {
"English": "This audio is generated by OpenVoice.",
"Chinese": "今天天气真好,我们一起出去吃饭吧。",
}
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run OpenVoice V1 style-controlled voice cloning."
)
parser.add_argument(
"--reference",
required=True,
help="Reference audio path used for tone color extraction.",
)
parser.add_argument(
"--text", default=None, help="Text to synthesize before tone color conversion."
)
parser.add_argument(
"--language",
default="English",
choices=["English", "Chinese"],
help="Base speaker language used by OpenVoice V1.",
)
parser.add_argument(
"--style",
default="default",
help="Style preset. English supports angry/sad/friendly/etc. Chinese supports default only.",
)
parser.add_argument(
"--output", default="outputs/output_v1.wav", help="Output wav path."
)
parser.add_argument("--speed", type=float, default=1.0, help="TTS speed.")
parser.add_argument(
"--device", default=None, help="Torch device, e.g. cuda:0 or cpu."
)
parser.add_argument(
"--tau",
type=float,
default=0.3,
help="Tone color conversion strength used by the converter.",
)
return parser.parse_args()
def validate_args(args: argparse.Namespace) -> None:
if args.language == "English":
if args.style not in ENGLISH_STYLES:
raise SystemExit(
f"Unsupported English style '{args.style}'. Available: {', '.join(ENGLISH_STYLES)}"
)
elif args.style != "default":
raise SystemExit("Chinese in OpenVoice V1 supports only --style default.")
def main() -> None:
args = parse_args()
validate_args(args)
device = args.device or ("cuda:0" if torch.cuda.is_available() else "cpu")
text = args.text or DEFAULT_TEXTS[args.language]
ckpt_base = (
"checkpoints/base_speakers/EN"
if args.language == "English"
else "checkpoints/base_speakers/ZH"
)
ckpt_converter = "checkpoints/converter"
base_speaker_tts = BaseSpeakerTTS(f"{ckpt_base}/config.json", device=device)
base_speaker_tts.load_ckpt(f"{ckpt_base}/checkpoint.pth")
tone_color_converter = ToneColorConverter(
f"{ckpt_converter}/config.json", device=device
)
tone_color_converter.load_ckpt(f"{ckpt_converter}/checkpoint.pth")
# Select Speaker Embedding based on style
if args.language == "English":
se_name = "en_style_se.pth" if args.style != "default" else "en_default_se.pth"
else:
se_name = "zh_default_se.pth"
source_se = torch.load(f"{ckpt_base}/{se_name}", map_location=device).to(device)
target_se, _ = se_extractor.get_se(
args.reference, tone_color_converter, target_dir="processed_v1", vad=True
)
out_dir = os.path.dirname(args.output) or "."
os.makedirs(out_dir, exist_ok=True)
src_path = os.path.join(out_dir, "tmp_openvoice_v1.wav")
base_speaker_tts.tts(
text, src_path, speaker=args.style, language=args.language, speed=args.speed
)
tone_color_converter.convert(
audio_src_path=src_path,
src_se=source_se,
tgt_se=target_se,
output_path=args.output,
tau=args.tau,
message="@MyShell",
)
print(args.output)
if __name__ == "__main__":
main()
Setup
# 1. Clone the repo
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice
# 2. Install dependencies
pip install -e .
# 3. Download checkpoints (see official README)
# Place them under checkpoints/base_speakers/EN/ and checkpoints/converter/
# 4. Run a single style
python scripts/run_v1.py \
--reference path/to/reference.wav \
--style angry \
--output outputs/angry_output.wav
Batch-generate all styles
for style in whispering shouting excited cheerful terrified angry sad friendly; do
python scripts/run_v1.py \
--reference reference.wav \
--style "$style" \
--output "outputs/${style}.wav"
done
How It Works Under the Hood
OpenVoice operates in three distinct steps:
[1] BaseSpeakerTTS
Text → styled audio (in the base model's voice / source_se)
↓
[2] se_extractor
Reference audio → target_se (speaker embedding vector)
↓
[3] ToneColorConverter
source_se × target_se → final audio (reference voice × target emotion)
The key insight is that ToneColorConverter performs voice conversion in the spectral domain rather than at the phoneme level. This makes the pipeline largely independent of reference audio length or content — even a 14-second clip yields a stable embedding.
Enabling vad=True in se_extractor.get_se() strips silence before computing the embedding, which meaningfully improves quality when the reference audio has gaps or background noise.
Intermediate files are written to target_dir="processed_v1". If you run multiple reference audios without changing this path, cached embeddings can collide. Use a separate directory per reference clip.
BaseSpeakerTTS in OpenVoice V1 supports English and Chinese only. Passing Japanese text will not produce correct pronunciation. For Japanese, use OpenVoice V2 or a different model entirely.
FAQ
Q. How long does the reference audio need to be?
There is no official minimum, but in practice 10–20 seconds gives a stable embedding. Below 5 seconds, voice identity reproduction noticeably degrades.
Q. What if my reference audio has background music?
vad=True removes silence but does not separate music from speech. Run the audio through a source separation tool like Demucs first to isolate the vocal track.
Q. Can I use this commercially?
OpenVoice is MIT-licensed. That said, if you publish or distribute generated audio, you are separately responsible for ensuring you have the right to clone the source speaker's voice — MIT covers the code, not the ethical and legal questions around the audio.
Q. When should I use V1 vs V2?
V1 is the better choice when you need clear, named emotion styles in English or Chinese. V2 adds multilingual support and cross-lingual cloning, but the emotional expressiveness per style is less distinct. If Japanese or other languages are a requirement, V2 is the practical path.
Q. How do I tune the tau parameter?
tau controls the strength of tone color conversion. Lower values stay closer to the reference voice; higher values apply a stronger transfer. Start at the default 0.3 and explore 0.1–0.5 to find the balance between voice fidelity and style intensity that works for your use case.
Q. Is real-time generation possible?
Not with V1 as-is — the pipeline is batch-oriented and does not support streaming output. For latency-sensitive applications, look at Fish Speech or CosyVoice, both of which offer streaming modes.
Wrap-Up
OpenVoice V1 delivers surprisingly capable zero-shot voice cloning with emotion control from a short reference clip. Across 8 styles tested with a 14-second source, most outputs — especially cheerful, friendly, and whispering — were natural enough for real-world use. Only shouting fell slightly short.
A few things worth remembering when you set it up: manage checkpoint paths carefully relative to your working directory; pass reference audio of at least 10 seconds with VAD enabled; tune --tau to balance voice identity against emotional intensity; and reach for V2 or another model if you need Japanese support.
If you're evaluating options for voice cloning or expressive TTS, OpenVoice V1 is a solid first stop.
Related Articles
Explore more articles connected to this topic.
5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened
After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented **PageIndex** — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like *"What products do you have?"* and *"What ASR-related blog posts exist?"* — zero vector math involved.
Read article →