I Tested Zero-Shot Voice Cloning with Emotion Control in OpenVoice — 8 Styles from a 14-Second Reference
TL;DR
OpenVoice V1 is a zero-shot voice cloning library that extracts a speaker's tone color from as little as 14 seconds of reference audio, then synthesizes speech in 8 emotional styles: whispering, shouting, excited, cheerful, terrified, angry, sad, and friendly. This post covers the code I actually ran, what I heard, and the gotchas I hit along the way.
What Is OpenVoice?
OpenVoice is an open-source voice cloning framework developed by MyShell. Its key design decision is separating tone color (voice identity) and style control into independent modules.
| Component | Role |
|---|---|
BaseSpeakerTTS | Generates speech with a chosen emotional style |
ToneColorConverter | Extracts and transfers voice tone color from a reference audio |
se_extractor | Derives a Speaker Embedding vector from the reference audio |
The mental model is: the base model owns the emotion; the reference audio owns the voice. This means you can freely swap emotions regardless of how the reference speaker originally sounded.
What does "zero-shot" mean?
Adapting a pre-trained model to an unseen speaker without any additional fine-tuning. Because only a short reference clip is needed, data collection cost is essentially zero.
Experiment Setup
- Reference audio: 14-second English clip, single speaker, neutral tone
- Synthesis text:
"This audio is generated by OpenVoice."(default) - Styles tested:
whispering/shouting/excited/cheerful/terrified/angry/sad/friendly— all 8 - Environment: Python 3.10, PyTorch 2.x, CUDA-capable GPU (CPU also works)
Reference audio:
Generated Samples
All clips below use the same text and the same speaker's voice — only the style changes.
-
whispering
-
friendly
-
terrified
-
shouting
-
angry
-
cheerful
-
excited
-
sad
Listening impressions: cheerful, friendly, whispering, and terrified struck a natural balance between preserving the reference voice and conveying the target emotion — all felt production-ready. shouting was the one outlier; it sounded slightly artificial compared to the others, with an exaggerated pressure that didn't quite land.
The Code
A CLI wrapper around OpenVoice V1. Pass --reference for the source audio and --style for the emotion preset.
#!/usr/bin/env python3
import argparse
import os
import sys
import torch
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter
ENGLISH_STYLES = [
"default",
"whispering",
"shouting",
"excited",
"cheerful",
"terrified",
"angry",
"sad",
"friendly",
]
DEFAULT_TEXTS = {
"English": "This audio is generated by OpenVoice.",
"Chinese": "今天天气真好,我们一起出去吃饭吧。",
}
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run OpenVoice V1 style-controlled voice cloning."
)
parser.add_argument(
"--reference",
required=True,
help="Reference audio path used for tone color extraction.",
)
parser.add_argument(
"--text", default=None, help="Text to synthesize before tone color conversion."
)
parser.add_argument(
"--language",
default="English",
choices=["English", "Chinese"],
help="Base speaker language used by OpenVoice V1.",
)
parser.add_argument(
"--style",
default="default",
help="Style preset. English supports angry/sad/friendly/etc. Chinese supports default only.",
)
parser.add_argument(
"--output", default="outputs/output_v1.wav", help="Output wav path."
)
parser.add_argument("--speed", type=float, default=1.0, help="TTS speed.")
parser.add_argument(
"--device", default=None, help="Torch device, e.g. cuda:0 or cpu."
)
parser.add_argument(
"--tau",
type=float,
default=0.3,
help="Tone color conversion strength used by the converter.",
)
return parser.parse_args()
def validate_args(args: argparse.Namespace) -> None:
if args.language == "English":
if args.style not in ENGLISH_STYLES:
raise SystemExit(
f"Unsupported English style '{args.style}'. Available: {', '.join(ENGLISH_STYLES)}"
)
elif args.style != "default":
raise SystemExit("Chinese in OpenVoice V1 supports only --style default.")
def main() -> None:
args = parse_args()
validate_args(args)
device = args.device or ("cuda:0" if torch.cuda.is_available() else "cpu")
text = args.text or DEFAULT_TEXTS[args.language]
ckpt_base = (
"checkpoints/base_speakers/EN"
if args.language == "English"
else "checkpoints/base_speakers/ZH"
)
ckpt_converter = "checkpoints/converter"
base_speaker_tts = BaseSpeakerTTS(f"{ckpt_base}/config.json", device=device)
base_speaker_tts.load_ckpt(f"{ckpt_base}/checkpoint.pth")
tone_color_converter = ToneColorConverter(
f"{ckpt_converter}/config.json", device=device
)
tone_color_converter.load_ckpt(f"{ckpt_converter}/checkpoint.pth")
# Select Speaker Embedding based on style
if args.language == "English":
se_name = "en_style_se.pth" if args.style != "default" else "en_default_se.pth"
else:
se_name = "zh_default_se.pth"
source_se = torch.load(f"{ckpt_base}/{se_name}", map_location=device).to(device)
target_se, _ = se_extractor.get_se(
args.reference, tone_color_converter, target_dir="processed_v1", vad=True
)
out_dir = os.path.dirname(args.output) or "."
os.makedirs(out_dir, exist_ok=True)
src_path = os.path.join(out_dir, "tmp_openvoice_v1.wav")
base_speaker_tts.tts(
text, src_path, speaker=args.style, language=args.language, speed=args.speed
)
tone_color_converter.convert(
audio_src_path=src_path,
src_se=source_se,
tgt_se=target_se,
output_path=args.output,
tau=args.tau,
message="@MyShell",
)
print(args.output)
if __name__ == "__main__":
main()
Setup
# 1. Clone the repo
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice
# 2. Install dependencies
pip install -e .
# 3. Download checkpoints (see official README)
# Place them under checkpoints/base_speakers/EN/ and checkpoints/converter/
# 4. Run a single style
python scripts/run_v1.py \
--reference path/to/reference.wav \
--style angry \
--output outputs/angry_output.wav
Batch-generate all styles
for style in whispering shouting excited cheerful terrified angry sad friendly; do
python scripts/run_v1.py \
--reference reference.wav \
--style "$style" \
--output "outputs/${style}.wav"
done
How It Works Under the Hood
OpenVoice operates in three distinct steps:
[1] BaseSpeakerTTS
Text → styled audio (in the base model's voice / source_se)
↓
[2] se_extractor
Reference audio → target_se (speaker embedding vector)
↓
[3] ToneColorConverter
source_se × target_se → final audio (reference voice × target emotion)
The key insight is that ToneColorConverter performs voice conversion in the spectral domain rather than at the phoneme level. This makes the pipeline largely independent of reference audio length or content — even a 14-second clip yields a stable embedding.
Enabling vad=True in se_extractor.get_se() strips silence before computing the embedding, which meaningfully improves quality when the reference audio has gaps or background noise.
Intermediate files are written to target_dir="processed_v1". If you run multiple reference audios without changing this path, cached embeddings can collide. Use a separate directory per reference clip.
BaseSpeakerTTS in OpenVoice V1 supports English and Chinese only. Passing Japanese text will not produce correct pronunciation. For Japanese, use OpenVoice V2 or a different model entirely.
FAQ
Q. How long does the reference audio need to be?
There is no official minimum, but in practice 10–20 seconds gives a stable embedding. Below 5 seconds, voice identity reproduction noticeably degrades.
Q. What if my reference audio has background music?
vad=True removes silence but does not separate music from speech. Run the audio through a source separation tool like Demucs first to isolate the vocal track.
Q. Can I use this commercially?
OpenVoice is MIT-licensed. That said, if you publish or distribute generated audio, you are separately responsible for ensuring you have the right to clone the source speaker's voice — MIT covers the code, not the ethical and legal questions around the audio.
Q. When should I use V1 vs V2?
V1 is the better choice when you need clear, named emotion styles in English or Chinese. V2 adds multilingual support and cross-lingual cloning, but the emotional expressiveness per style is less distinct. If Japanese or other languages are a requirement, V2 is the practical path.
Q. How do I tune the tau parameter?
tau controls the strength of tone color conversion. Lower values stay closer to the reference voice; higher values apply a stronger transfer. Start at the default 0.3 and explore 0.1–0.5 to find the balance between voice fidelity and style intensity that works for your use case.
Q. Is real-time generation possible?
Not with V1 as-is — the pipeline is batch-oriented and does not support streaming output. For latency-sensitive applications, look at Fish Speech or CosyVoice, both of which offer streaming modes.
Wrap-Up
OpenVoice V1 delivers surprisingly capable zero-shot voice cloning with emotion control from a short reference clip. Across 8 styles tested with a 14-second source, most outputs — especially cheerful, friendly, and whispering — were natural enough for real-world use. Only shouting fell slightly short.
A few things worth remembering when you set it up: manage checkpoint paths carefully relative to your working directory; pass reference audio of at least 10 seconds with VAD enabled; tune --tau to balance voice identity against emotional intensity; and reach for V2 or another model if you need Japanese support.
If you're evaluating options for voice cloning or expressive TTS, OpenVoice V1 is a solid first stop.
Contact
For project inquiries and collaboration, contact us here.
If you are considering a new project, product development, or other collaboration, please get in touch.
Related Articles
Explore more articles connected to this topic.
Open-Source TTS Model Showdown 2026: Hands-On Testing of Qwen3-TTS, Chatterbox, Fish Speech, CosyVoice & IndexTTS2
A hands-on comparison of five open-source TTS models, with a focus on commercial license pitfalls in Fish Speech and IndexTTS2 — covering features, licenses, and working code all in one place.
Read article →Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores
Japanese speech recognition is more challenging than many other languages due to the wide variety of kanji and proper noun representations. This primary benchmark article implements a morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that WER and CER struggle to capture, and re-evaluates 9 open-source ASR models.
Read article →