Open-Source TTS Model Showdown 2026: Hands-On Testing of Qwen3-TTS, Chatterbox, Fish Speech, CosyVoice & IndexTTS2

In this article, we compare five open-source TTS (text-to-speech) models gaining attention in 2026 across five dimensions — license, multilingual support, zero-shot voice cloning, emotion control, and latency — based on hands-on local testing. Working code samples are included so you can try them immediately.

TL;DR

  • Commercial projects + Japanese quality → Qwen3-TTS (Apache-2.0, Japanese support, emotion control)
  • Clean license + multilingual cloning → CosyVoice 3.0 (Apache-2.0, best-in-class cross-lingual cloning)
  • Video dubbing with duration matching → IndexTTS2 (duration control built-in, but read the license carefully)
  • English content + paralinguistic expressions (laughter, coughing, etc.) → Chatterbox Turbo
  • Best audio quality + widest language coverage → Fish Speech S2 Pro (complex license, beware)

What Are Open-Source TTS Models — and Why Do They Matter Now?

TTS (Text-to-Speech) is the technology that converts text into spoken audio. Cloud APIs like ElevenLabs and OpenAI TTS deliver high quality, but come with constraints around API costs, data privacy, and customizability.

In 2026, open-source TTS quality has reached near-parity with cloud APIs. The biggest shift is that the following capabilities are now achievable locally:

  • Zero-shot Voice Cloning: Mimic a speaker from just 3–10 seconds of reference audio — no fine-tuning required.
  • Emotion Control: Switch between anger, joy, sadness, and more using natural-language instructions.
  • Streaming Generation: Time-to-first-audio under 100ms. Real-time applications are now viable.
  • Duration Control: Automatically adjust speech speed to match video runtime.

The ability to run all of this locally — without depending on the cloud — is a genuine competitive advantage for use cases like video dubbing, virtual YouTubers, call center AI, and multilingual podcasting.


Detailed Model Comparisons

For voice cloning reference audio, we used the following sample from the Japanese Voice Actor Statistics Society:

Qwen3-TTS

GitHub: https://github.com/QwenLM/Qwen3-TTS

A TTS model released by Alibaba's Qwen team. The standout feature is applying the Qwen architecture — proven across LLMs — directly to TTS.

ItemDetails
Supported Languages10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) + Chinese dialects
LicenseApache-2.0 (both code and weights)
Model Size0.6B / 1.7B
Zero-shot CloneYes (approx. 3-second reference audio)
Emotion ControlYes (natural language instructions)
StreamingYes (first-audio ~97ms)

Three Modes to Know

Qwen3-TTS offers three modes for different use cases:

  • VoiceDesign: Describe the voice in text (e.g., "a calm, middle-aged male voice") to generate it.
  • CustomVoice: Choose from 9 preset voices. Delivers stable, consistent quality.
  • Base (Clone): Pass a reference audio for zero-shot voice cloning.

Voice cloning and emotion control cannot be used simultaneously. Emotion instructions are disabled in Base (clone) mode. Decide which to prioritize at design time.

Voice clone result:

Japanese Language Usability

Japanese is included among the 10 supported languages, and naturalness is high compared to other Chinese-origin models. That said, English and Chinese offer the most stable quality — for Japanese, using clone mode with a Japanese reference audio is recommended. Fine-tuning on Japanese speech data could improve results further.


Chatterbox (3 Variants)

GitHub: https://github.com/resemble-ai/chatterbox

An MIT-licensed TTS model released by Resemble AI. Primarily English-focused, though the Multilingual variant supports 23 languages.

VariantSizeLanguagesKey Features
Original0.5BEnglishEmotion exaggeration control
Multilingual0.5B23 languagesEmotion exaggeration + multilingual cloning
Turbo350MEnglishParalinguistic tags + ultra-low latency

A Common Pitfall with Variant Selection

Paralinguistic tags like [laugh] and [cough] are exclusive to the Turbo variant. They do not exist in Multilingual. If you need both multilingual support and expressions like laughter or coughing, that's currently not possible in a single variant — you'll need to prioritize one or the other.

The MIT license is one of the most permissive options available, allowing commercial use and derivative model creation with minimal restrictions. It's worth considering for any project where license risk needs to be minimized.

Voice clone result:


Fish Speech

GitHub: https://github.com/fishaudio/fish-speech

A high-quality multilingual TTS developed by Fish Audio. S1 supports 13+ languages; the latest S2 Pro supports 80+ languages, giving it the widest language coverage of the five models compared here.

ItemDetails
Supported LanguagesS1: 13+, S2 Pro: 80+
LicenseCode: Fish Audio Research License; Weights: varies by model
Zero-shot CloneYes (short reference audio)
Emotion ControlProsody / emotion / inline emotion tags

The license is highly complex. Code and weights have different licenses, and some model versions use non-commercial licenses like CC-BY-NC-SA-4.0. You may see simplified summaries saying "Fish Speech is commercially OK," but always check the license of the specific model version you plan to use. Adopting it for a commercial project without verification carries legal risk.

In terms of audio quality, Fish Speech ranks among the top of the five models, and the inline emotion tags allow fine-grained emotion control. For non-commercial or research use, it's a top contender.

Note that running Fish Speech from the command line requires a two-step pipeline (semantic token generation → wav conversion), unlike the other models. See the code section below for details.

Voice clone result:


CosyVoice 3.0

GitHub: https://github.com/FunAudioLLM/CosyVoice

A TTS model from Alibaba's FunAudioLLM team. Cross-lingual cloning is its standout capability — perfect for use cases like "make a Chinese speaker's voice read Japanese."

ItemDetails
Supported Languages9 languages + 18+ Chinese dialects and accents
LicenseApache-2.0
Model Size0.5B
Zero-shot CloneYes (multilingual / cross-lingual)
Emotion ControlEmotion cloning evaluation set included

Clear Apache-2.0 licensing and strong cross-lingual cloning performance make this a highly practical model for multilingual content production.

For emotion control, the primary mechanism is emotion cloning — inheriting the emotion from the reference audio — rather than explicit natural-language instructions. Think of it less as "read this angrily" and more as "match the tone of this angry reference clip." Setting the right expectation here will save you surprises during implementation.

Voice clone result:


IndexTTS2

GitHub: https://github.com/index-tts/index-tts

Released by a team under ByteDance. Its headline features are independent control of Emotion and Speaker Identity, and Duration Control for precise audio-to-video timing. This model is particularly well-suited for video dubbing workflows.

ItemDetails
Supported LanguagesPrimarily English and Chinese (cross-lingual claimed but details not disclosed)
LicenseApache-2.0 (with additional restrictions)
Zero-shot CloneYes
Emotion ControlIndependent emotion/speaker identity control via reference audio or natural language
Duration ControlYes

Caution regarding the license. Despite the Apache-2.0 label, the license text includes a restriction: "This model must not be used to improve other AI models." This is not part of standard Apache-2.0 and means it cannot be treated as a straightforward Apache-2.0 license. Pay particular attention if you plan to use it for fine-tuning data generation or model distillation.

Duration Control is an extremely valuable feature for video dubbing. By automatically adjusting speech speed to match the original clip's length, it minimizes lip-sync issues and subtitle drift.

Voice clone result (note: output is in Chinese):


License Quick Reference

ModelCodeWeightsCommercial Use
Qwen3-TTSApache-2.0Apache-2.0
ChatterboxMITMIT
Fish SpeechFish Audio Research LicenseVaries by model (e.g., CC-BY-NC-SA-4.0)⚠️ Verify per model
CosyVoice 3.0Apache-2.0Apache-2.0
IndexTTS2Apache-2.0 (with restrictions)Apache-2.0 (with restrictions)⚠️ Read carefully

Feature Comparison Matrix

FeatureQwen3-TTSChatterboxFish SpeechCosyVoice 3.0IndexTTS2
Languages1023 (MTL)80+ (S2 Pro)9English-centric
Zero-shot Clone
Emotion Control (instruction-based)✅ (exaggeration)△ (cloning-based)✅ (independent)
Streaming✅ (97ms)
Duration Control
Paralinguistic Tags✅ (Turbo only)✅ (inline)
Model Size0.6B–1.7B0.35B–0.5B4B0.5B~0.5B

Verified Working Code

The following code samples have been tested and confirmed working. If you run into issues due to environment differences, refer to each repository's README.

Qwen3-TTS: Voice Clone

Install the qwen_tts package, then run the script below. Specifying --model Qwen/Qwen3-TTS-12Hz-0.6B-Base activates Clone mode.

python
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import importlib.util
from pathlib import Path

import soundfile as sf
import torch
from qwen_tts import Qwen3TTSModel


DEFAULT_REF_AUDIO = (
    "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
)
DEFAULT_REF_TEXT = (
    "Okay. Yeah. I resent you. I love you. I respect you. But you know what? "
    "You blew it! And thanks to you."
)


def resolve_device(device: str) -> str:
    if device == "cuda":
        return "cuda:0" if torch.cuda.is_available() else "cpu"
    if device.startswith("cuda") and not torch.cuda.is_available():
        return "cpu"
    return device


def resolve_dtype(device: str) -> torch.dtype:
    if not device.startswith("cuda"):
        return torch.float32
    bf16_supported = bool(
        hasattr(torch.cuda, "is_bf16_supported") and torch.cuda.is_bf16_supported()
    )
    return torch.bfloat16 if bf16_supported else torch.float16


def resolve_attention(device: str, requested: str) -> str:
    if requested != "auto":
        return requested
    if device.startswith("cuda") and importlib.util.find_spec("flash_attn") is not None:
        return "flash_attention_2"
    return "sdpa" if device.startswith("cuda") else "eager"


def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument(
        "--text",
        default="This is a Qwen3 voice clone test. Reading in a voice close to the reference audio.",
    )
    p.add_argument("--out", default="outputs/qwen3_voice_clone.wav")
    p.add_argument("--device", default="cuda")
    p.add_argument("--language", default="English")
    p.add_argument("--model", default="Qwen/Qwen3-TTS-12Hz-0.6B-Base")
    p.add_argument("--ref-audio", default=DEFAULT_REF_AUDIO)
    p.add_argument("--ref-text", default=DEFAULT_REF_TEXT)
    p.add_argument("--x-vector-only-mode", action="store_true")
    p.add_argument(
        "--attn-implementation",
        default="auto",
        choices=["auto", "flash_attention_2", "sdpa", "eager"],
    )
    args = p.parse_args()

    device = resolve_device(args.device)
    dtype = resolve_dtype(device)
    attn_implementation = resolve_attention(device, args.attn_implementation)

    if args.x_vector_only_mode:
        ref_text = None
    else:
        ref_text = args.ref_text.strip()
        if not ref_text:
            raise SystemExit("--ref-text is required unless --x-vector-only-mode is set")

    model = Qwen3TTSModel.from_pretrained(
        args.model,
        device_map=device,
        dtype=dtype,
        attn_implementation=attn_implementation,
    )
    wavs, sr = model.generate_voice_clone(
        text=args.text,
        language=args.language,
        ref_audio=args.ref_audio,
        ref_text=ref_text,
        x_vector_only_mode=args.x_vector_only_mode,
    )

    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    sf.write(out, wavs[0], sr)
    print(out)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Key Point: the --x-vector-only-mode flag

When this flag is set, the model extracts only the speaker style from the reference audio as an x-vector, without requiring a transcription (--ref-text). Useful when you don't have a transcript of the reference audio, but clone quality tends to be lower than standard mode.


Chatterbox: Japanese Clone with the Multilingual Variant

python
#!/usr/bin/env python3
from __future__ import annotations

import argparse
from pathlib import Path

import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from chatterbox.tts import ChatterboxTTS


def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--text", default="Hello, this is a Chatterbox test.")
    p.add_argument("--out", default="outputs/chatterbox.wav")
    p.add_argument("--device", default="cuda")
    p.add_argument(
        "--variant", choices=["turbo", "multilingual"], default="multilingual"
    )
    p.add_argument("--language-id", default="en")
    p.add_argument("--audio-prompt-path", default=None)
    args = p.parse_args()

    device = (
        "cuda:0"
        if args.device.startswith("cuda") and torch.cuda.is_available()
        else "cpu"
    )

    if args.variant == "multilingual":
        model = ChatterboxMultilingualTTS.from_pretrained(device=device)
        wav = model.generate(
            args.text,
            language_id=args.language_id,
            audio_prompt_path=args.audio_prompt_path,
        )
    else:
        model = ChatterboxTTS.from_pretrained(device=device)
        wav = model.generate(args.text, audio_prompt_path=args.audio_prompt_path)

    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    ta.save(str(out), wav, model.sr)
    print(out)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Pass --language-id ja and Japanese text to --text for Japanese output. Pass a reference audio file path to --audio-prompt-path to enable clone mode.


Fish Speech: Two-Step Pipeline

Unlike other models, Fish Speech requires a two-step process: semantic token generation → wav conversion.

Setup

bash
sudo apt-get install -y portaudio19-dev libsox-dev ffmpeg build-essential
uv sync --python 3.12 --extra cu129
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro

Generate without reference audio

bash
# Step 1: Generate semantic tokens
uv run python fish_speech/models/text2semantic/inference.py \
  --text 'Hello. Testing Fish Speech command-line inference.' \
  --output-dir output/demo \
  --device cuda

# Step 2: Convert tokens to wav
uv run python fish_speech/models/dac/inference.py \
  -i output/demo/codes_0.npy \
  -o output/demo/generated.wav \
  --checkpoint-path checkpoints/s2-pro/codec.pth \
  -d cuda

Voice cloning with reference audio

Fish Speech cloning requires not just the reference audio, but also its transcript text.

bash
# 1. Transcribe reference audio with Whisper
python - <<'PY'
from transformers import pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3-turbo",
    device="cuda:0",
)
res = pipe("path/to/reference.wav")
print(res["text"])
PY

# 2. Generate clone semantic tokens
uv run python fish_speech/models/text2semantic/inference.py \
  --text 'Text you want to read aloud' \
  --prompt-text 'Transcript of the reference audio' \
  --prompt-audio path/to/reference.wav \
  --output-dir output/clone \
  --device cuda

# 3. Convert to wav
uv run python fish_speech/models/dac/inference.py \
  -i output/clone/codes_0.npy \
  -o output/clone/clone.wav \
  --checkpoint-path checkpoints/s2-pro/codec.pth \
  -d cuda

Auto-transcribing the reference audio with Whisper large-v3-turbo is the practical approach. The difference from a hand-written transcript is negligible.


CosyVoice 2: Zero-shot Cross-lingual Clone

python
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import sys
from pathlib import Path

import torchaudio


def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--text", default="Hello. This is a CosyVoice test.")
    p.add_argument("--out", default="outputs/cosyvoice.wav")
    p.add_argument("--model", default="pretrained_models/CosyVoice2-0.5B")
    p.add_argument("--prompt-text", default="希望你以后能够做的比我还好呦。")
    p.add_argument("--prompt-wav", required=True)
    args = p.parse_args()

    vendor = Path(__file__).resolve().parents[1] / "vendors" / "cosyvoice"
    sys.path.insert(0, str(vendor))
    sys.path.insert(0, str(vendor / "third_party" / "Matcha-TTS"))
    from cosyvoice.cli.cosyvoice import AutoModel

    model = AutoModel(model_dir=args.model)
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    for i, chunk in enumerate(model.inference_zero_shot(args.text, args.prompt_text, args.prompt_wav, stream=False)):
        torchaudio.save(str(out), chunk["tts_speech"], model.sample_rate)
        if i == 0:
            break
    print(out)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

IndexTTS2: Emotion Control + Duration Matching

python
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import sys
from pathlib import Path


def main() -> int:
    p = argparse.ArgumentParser()
    p.add_argument("--text", default="Hello. This is an IndexTTS test.")
    p.add_argument("--out", default="outputs/index_tts.wav")
    p.add_argument("--model-dir", default="checkpoints")
    p.add_argument("--prompt-wav", required=True)
    p.add_argument("--device", default="cuda")
    p.add_argument("--emo", default="")   # emotion vector or natural language text
    args = p.parse_args()

    vendor = Path(__file__).resolve().parents[1] / "vendors" / "index-tts"
    model_dir = Path(args.model_dir)
    if not model_dir.is_absolute():
        model_dir = (Path.cwd() / model_dir).resolve()
    cfg_path = model_dir / "config.yaml"

    sys.path.insert(0, str(vendor))
    sys.path.insert(0, str(vendor / "indextts"))
    from indextts.infer_v2 import IndexTTS2

    tts = IndexTTS2(
        model_dir=str(model_dir),
        cfg_path=str(cfg_path),
        use_fp16=args.device.startswith("cuda"),
        device=args.device,
    )
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    tts.infer(
        spk_audio_prompt=args.prompt_wav,
        text=args.text,
        output_path=str(out),
        emo_vector=args.emo or None,
    )
    print(out)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Specify the emotion via the --emo argument. Setting emo_vector=None automatically inherits the emotion from the reference audio.


Glossary

TermExplanation
Zero-shot CloneTechnology that mimics a speaker from just a few seconds of reference audio, with no prior fine-tuning required
Emotion ExaggerationA parameter that amplifies emotional intensity. 1.0 is neutral; 2.0 exaggerates joy, anger, sadness, etc.
Paralinguistic TagsTags like [laugh] and [cough] that instruct the model to produce non-verbal vocal expressions
Duration ControlA feature that matches the generated audio length to a specified target duration. Valuable for video dubbing
Cross-lingual CloningTechnology that reproduces a speaker's voice in a different language from the reference audio
x-vectorA fixed-length feature vector representing a speaker's vocal characteristics, independent of speech content
Semantic TokenA compressed token sequence encoding textual meaning. Used in Fish Speech as an intermediate representation before wav conversion

FAQ

Q. How much GPU VRAM is required?

Most models are 0.5B–1.7B, so roughly 4–8GB VRAM is sufficient to run them in fp16/bf16. Chatterbox Turbo (350M) is the lightest and runs comfortably on a 4GB GPU. Qwen3-TTS 1.7B in bf16 typically needs 8–12GB.

Q. Can these run on CPU only?

Technically yes, but not at a practical speed. Real-time generation (generating in equal or less time than it takes to read the text aloud) generally requires a GPU. CPU is fine for quality evaluation and experimentation.

Q. Which model has the best Japanese quality?

Within the scope of this comparison, it comes down to Qwen3-TTS (official Japanese support) and Fish Speech S2 Pro (80-language coverage). That said, Japanese quality evaluation is highly dependent on the reference audio and input text — trying both on your actual use case will give you the most accurate answer.

Q. What's the ideal reference audio length?

It varies by model, but 3–10 seconds is the practical range. Too short and speaker characteristics are insufficient; too long yields diminishing returns. Clarity of the recording (no background noise, clean speech) matters more than length.

Q. How do I use Qwen3-TTS VoiceDesign mode?

Use a CustomVoice or VoiceDesign mode model and pass a natural-language voice description to the text field — for example: "A calm, middle-aged male voice with a slight husky tone." This does not work with Clone mode models (the *-Base variants).

Q. Where can I verify the Fish Speech commercial license?

Check the LICENSE file directly on each model's Hugging Face card (e.g., fishaudio/fish-speech-1.5). CC-BY-NC-SA-4.0 (non-commercial) and Fish Audio Research License (commercially permissive with conditions) coexist across versions. The "version = license" relationship can change over time, so always refer to the latest license file on Hugging Face.


Summary

Open-source TTS reached a sufficient quality bar in 2026 to serve as a genuine alternative to cloud APIs. However, there are significant differences across models on three axes — quality, license, and features — so clarifying your use case and license requirements before selecting a model is essential.

In particular, Fish Speech and IndexTTS2 have licensing pitfalls that require careful reading before any commercial adoption. If you want high quality with a clean license, starting with Qwen3-TTS (Apache-2.0) or CosyVoice 3.0 (Apache-2.0) is the safest choice at this point in time.

Related Articles

Explore more articles connected to this topic.