I Tested Zero-Shot Voice Cloning with Emotion Control in OpenVoice — 8 Styles from a 14-Second Reference

TL;DR
OpenVoice V1 is a zero-shot voice cloning library that extracts a speaker's tone color from as little as 14 seconds of reference audio, then synthesizes speech in 8 emotional styles: whispering, shouting, excited, cheerful, terrified, angry, sad, and friendly. This post covers the code I actually ran, what I heard, and the gotchas I hit along the way.


What Is OpenVoice?

OpenVoice is an open-source voice cloning framework developed by MyShell. Its key design decision is separating tone color (voice identity) and style control into independent modules.

ComponentRole
BaseSpeakerTTSGenerates speech with a chosen emotional style
ToneColorConverterExtracts and transfers voice tone color from a reference audio
se_extractorDerives a Speaker Embedding vector from the reference audio

The mental model is: the base model owns the emotion; the reference audio owns the voice. This means you can freely swap emotions regardless of how the reference speaker originally sounded.

What does "zero-shot" mean?
Adapting a pre-trained model to an unseen speaker without any additional fine-tuning. Because only a short reference clip is needed, data collection cost is essentially zero.


Experiment Setup

  • Reference audio: 14-second English clip, single speaker, neutral tone
  • Synthesis text: "This audio is generated by OpenVoice." (default)
  • Styles tested: whispering / shouting / excited / cheerful / terrified / angry / sad / friendly — all 8
  • Environment: Python 3.10, PyTorch 2.x, CUDA-capable GPU (CPU also works)

Reference audio:


Generated Samples

All clips below use the same text and the same speaker's voice — only the style changes.

  • whispering

  • friendly

  • terrified

  • shouting

  • angry

  • cheerful

  • excited

  • sad

Listening impressions: cheerful, friendly, whispering, and terrified struck a natural balance between preserving the reference voice and conveying the target emotion — all felt production-ready. shouting was the one outlier; it sounded slightly artificial compared to the others, with an exaggerated pressure that didn't quite land.


The Code

A CLI wrapper around OpenVoice V1. Pass --reference for the source audio and --style for the emotion preset.

python
#!/usr/bin/env python3
import argparse
import os
import sys

import torch

ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
    sys.path.insert(0, ROOT)

from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter


ENGLISH_STYLES = [
    "default",
    "whispering",
    "shouting",
    "excited",
    "cheerful",
    "terrified",
    "angry",
    "sad",
    "friendly",
]

DEFAULT_TEXTS = {
    "English": "This audio is generated by OpenVoice.",
    "Chinese": "今天天气真好,我们一起出去吃饭吧。",
}


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Run OpenVoice V1 style-controlled voice cloning."
    )
    parser.add_argument(
        "--reference",
        required=True,
        help="Reference audio path used for tone color extraction.",
    )
    parser.add_argument(
        "--text", default=None, help="Text to synthesize before tone color conversion."
    )
    parser.add_argument(
        "--language",
        default="English",
        choices=["English", "Chinese"],
        help="Base speaker language used by OpenVoice V1.",
    )
    parser.add_argument(
        "--style",
        default="default",
        help="Style preset. English supports angry/sad/friendly/etc. Chinese supports default only.",
    )
    parser.add_argument(
        "--output", default="outputs/output_v1.wav", help="Output wav path."
    )
    parser.add_argument("--speed", type=float, default=1.0, help="TTS speed.")
    parser.add_argument(
        "--device", default=None, help="Torch device, e.g. cuda:0 or cpu."
    )
    parser.add_argument(
        "--tau",
        type=float,
        default=0.3,
        help="Tone color conversion strength used by the converter.",
    )
    return parser.parse_args()


def validate_args(args: argparse.Namespace) -> None:
    if args.language == "English":
        if args.style not in ENGLISH_STYLES:
            raise SystemExit(
                f"Unsupported English style '{args.style}'. Available: {', '.join(ENGLISH_STYLES)}"
            )
    elif args.style != "default":
        raise SystemExit("Chinese in OpenVoice V1 supports only --style default.")


def main() -> None:
    args = parse_args()
    validate_args(args)

    device = args.device or ("cuda:0" if torch.cuda.is_available() else "cpu")
    text = args.text or DEFAULT_TEXTS[args.language]

    ckpt_base = (
        "checkpoints/base_speakers/EN"
        if args.language == "English"
        else "checkpoints/base_speakers/ZH"
    )
    ckpt_converter = "checkpoints/converter"

    base_speaker_tts = BaseSpeakerTTS(f"{ckpt_base}/config.json", device=device)
    base_speaker_tts.load_ckpt(f"{ckpt_base}/checkpoint.pth")

    tone_color_converter = ToneColorConverter(
        f"{ckpt_converter}/config.json", device=device
    )
    tone_color_converter.load_ckpt(f"{ckpt_converter}/checkpoint.pth")

    # Select Speaker Embedding based on style
    if args.language == "English":
        se_name = "en_style_se.pth" if args.style != "default" else "en_default_se.pth"
    else:
        se_name = "zh_default_se.pth"

    source_se = torch.load(f"{ckpt_base}/{se_name}", map_location=device).to(device)
    target_se, _ = se_extractor.get_se(
        args.reference, tone_color_converter, target_dir="processed_v1", vad=True
    )

    out_dir = os.path.dirname(args.output) or "."
    os.makedirs(out_dir, exist_ok=True)
    src_path = os.path.join(out_dir, "tmp_openvoice_v1.wav")

    base_speaker_tts.tts(
        text, src_path, speaker=args.style, language=args.language, speed=args.speed
    )
    tone_color_converter.convert(
        audio_src_path=src_path,
        src_se=source_se,
        tgt_se=target_se,
        output_path=args.output,
        tau=args.tau,
        message="@MyShell",
    )
    print(args.output)


if __name__ == "__main__":
    main()

Setup

bash
# 1. Clone the repo
git clone https://github.com/myshell-ai/OpenVoice.git
cd OpenVoice

# 2. Install dependencies
pip install -e .

# 3. Download checkpoints (see official README)
# Place them under checkpoints/base_speakers/EN/ and checkpoints/converter/

# 4. Run a single style
python scripts/run_v1.py \
  --reference path/to/reference.wav \
  --style angry \
  --output outputs/angry_output.wav

Batch-generate all styles

bash
for style in whispering shouting excited cheerful terrified angry sad friendly; do
  python scripts/run_v1.py \
    --reference reference.wav \
    --style "$style" \
    --output "outputs/${style}.wav"
done

How It Works Under the Hood

OpenVoice operates in three distinct steps:

text
[1] BaseSpeakerTTS
    Text → styled audio (in the base model's voice / source_se)
         ↓
[2] se_extractor
    Reference audio → target_se (speaker embedding vector)
         ↓
[3] ToneColorConverter
    source_se × target_se → final audio (reference voice × target emotion)

The key insight is that ToneColorConverter performs voice conversion in the spectral domain rather than at the phoneme level. This makes the pipeline largely independent of reference audio length or content — even a 14-second clip yields a stable embedding.

Enabling vad=True in se_extractor.get_se() strips silence before computing the embedding, which meaningfully improves quality when the reference audio has gaps or background noise.

Intermediate files are written to target_dir="processed_v1". If you run multiple reference audios without changing this path, cached embeddings can collide. Use a separate directory per reference clip.

BaseSpeakerTTS in OpenVoice V1 supports English and Chinese only. Passing Japanese text will not produce correct pronunciation. For Japanese, use OpenVoice V2 or a different model entirely.


FAQ

Q. How long does the reference audio need to be?
There is no official minimum, but in practice 10–20 seconds gives a stable embedding. Below 5 seconds, voice identity reproduction noticeably degrades.

Q. What if my reference audio has background music?
vad=True removes silence but does not separate music from speech. Run the audio through a source separation tool like Demucs first to isolate the vocal track.

Q. Can I use this commercially?
OpenVoice is MIT-licensed. That said, if you publish or distribute generated audio, you are separately responsible for ensuring you have the right to clone the source speaker's voice — MIT covers the code, not the ethical and legal questions around the audio.

Q. When should I use V1 vs V2?
V1 is the better choice when you need clear, named emotion styles in English or Chinese. V2 adds multilingual support and cross-lingual cloning, but the emotional expressiveness per style is less distinct. If Japanese or other languages are a requirement, V2 is the practical path.

Q. How do I tune the tau parameter?
tau controls the strength of tone color conversion. Lower values stay closer to the reference voice; higher values apply a stronger transfer. Start at the default 0.3 and explore 0.1–0.5 to find the balance between voice fidelity and style intensity that works for your use case.

Q. Is real-time generation possible?
Not with V1 as-is — the pipeline is batch-oriented and does not support streaming output. For latency-sensitive applications, look at Fish Speech or CosyVoice, both of which offer streaming modes.


Wrap-Up

OpenVoice V1 delivers surprisingly capable zero-shot voice cloning with emotion control from a short reference clip. Across 8 styles tested with a 14-second source, most outputs — especially cheerful, friendly, and whispering — were natural enough for real-world use. Only shouting fell slightly short.

A few things worth remembering when you set it up: manage checkpoint paths carefully relative to your working directory; pass reference audio of at least 10 seconds with VAD enabled; tune --tau to balance voice identity against emotional intensity; and reach for V2 or another model if you need Japanese support.

If you're evaluating options for voice cloning or expressive TTS, OpenVoice V1 is a solid first stop.

Related Articles

Explore more articles connected to this topic.