Open-Source TTS Model Showdown 2026: Hands-On Testing of Qwen3-TTS, Chatterbox, Fish Speech, CosyVoice & IndexTTS2
In this article, we compare five open-source TTS (text-to-speech) models gaining attention in 2026 across five dimensions — license, multilingual support, zero-shot voice cloning, emotion control, and latency — based on hands-on local testing. Working code samples are included so you can try them immediately.
TL;DR
- Commercial projects + Japanese quality → Qwen3-TTS (Apache-2.0, Japanese support, emotion control)
- Clean license + multilingual cloning → CosyVoice 3.0 (Apache-2.0, best-in-class cross-lingual cloning)
- Video dubbing with duration matching → IndexTTS2 (duration control built-in, but read the license carefully)
- English content + paralinguistic expressions (laughter, coughing, etc.) → Chatterbox Turbo
- Best audio quality + widest language coverage → Fish Speech S2 Pro (complex license, beware)
What Are Open-Source TTS Models — and Why Do They Matter Now?
TTS (Text-to-Speech) is the technology that converts text into spoken audio. Cloud APIs like ElevenLabs and OpenAI TTS deliver high quality, but come with constraints around API costs, data privacy, and customizability.
In 2026, open-source TTS quality has reached near-parity with cloud APIs. The biggest shift is that the following capabilities are now achievable locally:
- Zero-shot Voice Cloning: Mimic a speaker from just 3–10 seconds of reference audio — no fine-tuning required.
- Emotion Control: Switch between anger, joy, sadness, and more using natural-language instructions.
- Streaming Generation: Time-to-first-audio under 100ms. Real-time applications are now viable.
- Duration Control: Automatically adjust speech speed to match video runtime.
The ability to run all of this locally — without depending on the cloud — is a genuine competitive advantage for use cases like video dubbing, virtual YouTubers, call center AI, and multilingual podcasting.
Detailed Model Comparisons
For voice cloning reference audio, we used the following sample from the Japanese Voice Actor Statistics Society:
Qwen3-TTS
GitHub: https://github.com/QwenLM/Qwen3-TTS
A TTS model released by Alibaba's Qwen team. The standout feature is applying the Qwen architecture — proven across LLMs — directly to TTS.
| Item | Details |
|---|---|
| Supported Languages | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) + Chinese dialects |
| License | Apache-2.0 (both code and weights) |
| Model Size | 0.6B / 1.7B |
| Zero-shot Clone | Yes (approx. 3-second reference audio) |
| Emotion Control | Yes (natural language instructions) |
| Streaming | Yes (first-audio ~97ms) |
Three Modes to Know
Qwen3-TTS offers three modes for different use cases:
- VoiceDesign: Describe the voice in text (e.g., "a calm, middle-aged male voice") to generate it.
- CustomVoice: Choose from 9 preset voices. Delivers stable, consistent quality.
- Base (Clone): Pass a reference audio for zero-shot voice cloning.
Voice cloning and emotion control cannot be used simultaneously. Emotion instructions are disabled in Base (clone) mode. Decide which to prioritize at design time.
Voice clone result:
Japanese Language Usability
Japanese is included among the 10 supported languages, and naturalness is high compared to other Chinese-origin models. That said, English and Chinese offer the most stable quality — for Japanese, using clone mode with a Japanese reference audio is recommended. Fine-tuning on Japanese speech data could improve results further.
Chatterbox (3 Variants)
GitHub: https://github.com/resemble-ai/chatterbox
An MIT-licensed TTS model released by Resemble AI. Primarily English-focused, though the Multilingual variant supports 23 languages.
| Variant | Size | Languages | Key Features |
|---|---|---|---|
| Original | 0.5B | English | Emotion exaggeration control |
| Multilingual | 0.5B | 23 languages | Emotion exaggeration + multilingual cloning |
| Turbo | 350M | English | Paralinguistic tags + ultra-low latency |
A Common Pitfall with Variant Selection
Paralinguistic tags like [laugh] and [cough] are exclusive to the Turbo variant. They do not exist in Multilingual. If you need both multilingual support and expressions like laughter or coughing, that's currently not possible in a single variant — you'll need to prioritize one or the other.
The MIT license is one of the most permissive options available, allowing commercial use and derivative model creation with minimal restrictions. It's worth considering for any project where license risk needs to be minimized.
Voice clone result:
Fish Speech
GitHub: https://github.com/fishaudio/fish-speech
A high-quality multilingual TTS developed by Fish Audio. S1 supports 13+ languages; the latest S2 Pro supports 80+ languages, giving it the widest language coverage of the five models compared here.
| Item | Details |
|---|---|
| Supported Languages | S1: 13+, S2 Pro: 80+ |
| License | Code: Fish Audio Research License; Weights: varies by model |
| Zero-shot Clone | Yes (short reference audio) |
| Emotion Control | Prosody / emotion / inline emotion tags |
The license is highly complex. Code and weights have different licenses, and some model versions use non-commercial licenses like CC-BY-NC-SA-4.0. You may see simplified summaries saying "Fish Speech is commercially OK," but always check the license of the specific model version you plan to use. Adopting it for a commercial project without verification carries legal risk.
In terms of audio quality, Fish Speech ranks among the top of the five models, and the inline emotion tags allow fine-grained emotion control. For non-commercial or research use, it's a top contender.
Note that running Fish Speech from the command line requires a two-step pipeline (semantic token generation → wav conversion), unlike the other models. See the code section below for details.
Voice clone result:
CosyVoice 3.0
GitHub: https://github.com/FunAudioLLM/CosyVoice
A TTS model from Alibaba's FunAudioLLM team. Cross-lingual cloning is its standout capability — perfect for use cases like "make a Chinese speaker's voice read Japanese."
| Item | Details |
|---|---|
| Supported Languages | 9 languages + 18+ Chinese dialects and accents |
| License | Apache-2.0 |
| Model Size | 0.5B |
| Zero-shot Clone | Yes (multilingual / cross-lingual) |
| Emotion Control | Emotion cloning evaluation set included |
Clear Apache-2.0 licensing and strong cross-lingual cloning performance make this a highly practical model for multilingual content production.
For emotion control, the primary mechanism is emotion cloning — inheriting the emotion from the reference audio — rather than explicit natural-language instructions. Think of it less as "read this angrily" and more as "match the tone of this angry reference clip." Setting the right expectation here will save you surprises during implementation.
Voice clone result:
IndexTTS2
GitHub: https://github.com/index-tts/index-tts
Released by a team under ByteDance. Its headline features are independent control of Emotion and Speaker Identity, and Duration Control for precise audio-to-video timing. This model is particularly well-suited for video dubbing workflows.
| Item | Details |
|---|---|
| Supported Languages | Primarily English and Chinese (cross-lingual claimed but details not disclosed) |
| License | Apache-2.0 (with additional restrictions) |
| Zero-shot Clone | Yes |
| Emotion Control | Independent emotion/speaker identity control via reference audio or natural language |
| Duration Control | Yes |
Caution regarding the license. Despite the Apache-2.0 label, the license text includes a restriction: "This model must not be used to improve other AI models." This is not part of standard Apache-2.0 and means it cannot be treated as a straightforward Apache-2.0 license. Pay particular attention if you plan to use it for fine-tuning data generation or model distillation.
Duration Control is an extremely valuable feature for video dubbing. By automatically adjusting speech speed to match the original clip's length, it minimizes lip-sync issues and subtitle drift.
Voice clone result (note: output is in Chinese):
License Quick Reference
| Model | Code | Weights | Commercial Use |
|---|---|---|---|
| Qwen3-TTS | Apache-2.0 | Apache-2.0 | ✅ |
| Chatterbox | MIT | MIT | ✅ |
| Fish Speech | Fish Audio Research License | Varies by model (e.g., CC-BY-NC-SA-4.0) | ⚠️ Verify per model |
| CosyVoice 3.0 | Apache-2.0 | Apache-2.0 | ✅ |
| IndexTTS2 | Apache-2.0 (with restrictions) | Apache-2.0 (with restrictions) | ⚠️ Read carefully |
Feature Comparison Matrix
| Feature | Qwen3-TTS | Chatterbox | Fish Speech | CosyVoice 3.0 | IndexTTS2 |
|---|---|---|---|---|---|
| Languages | 10 | 23 (MTL) | 80+ (S2 Pro) | 9 | English-centric |
| Zero-shot Clone | ✅ | ✅ | ✅ | ✅ | ✅ |
| Emotion Control (instruction-based) | ✅ | ✅ (exaggeration) | ✅ | △ (cloning-based) | ✅ (independent) |
| Streaming | ✅ (97ms) | — | — | ✅ | — |
| Duration Control | ❌ | ❌ | ❌ | ❌ | ✅ |
| Paralinguistic Tags | ❌ | ✅ (Turbo only) | ✅ (inline) | ❌ | ❌ |
| Model Size | 0.6B–1.7B | 0.35B–0.5B | 4B | 0.5B | ~0.5B |
Verified Working Code
The following code samples have been tested and confirmed working. If you run into issues due to environment differences, refer to each repository's README.
Qwen3-TTS: Voice Clone
Install the qwen_tts package, then run the script below. Specifying --model Qwen/Qwen3-TTS-12Hz-0.6B-Base activates Clone mode.
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import importlib.util
from pathlib import Path
import soundfile as sf
import torch
from qwen_tts import Qwen3TTSModel
DEFAULT_REF_AUDIO = (
"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
)
DEFAULT_REF_TEXT = (
"Okay. Yeah. I resent you. I love you. I respect you. But you know what? "
"You blew it! And thanks to you."
)
def resolve_device(device: str) -> str:
if device == "cuda":
return "cuda:0" if torch.cuda.is_available() else "cpu"
if device.startswith("cuda") and not torch.cuda.is_available():
return "cpu"
return device
def resolve_dtype(device: str) -> torch.dtype:
if not device.startswith("cuda"):
return torch.float32
bf16_supported = bool(
hasattr(torch.cuda, "is_bf16_supported") and torch.cuda.is_bf16_supported()
)
return torch.bfloat16 if bf16_supported else torch.float16
def resolve_attention(device: str, requested: str) -> str:
if requested != "auto":
return requested
if device.startswith("cuda") and importlib.util.find_spec("flash_attn") is not None:
return "flash_attention_2"
return "sdpa" if device.startswith("cuda") else "eager"
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument(
"--text",
default="This is a Qwen3 voice clone test. Reading in a voice close to the reference audio.",
)
p.add_argument("--out", default="outputs/qwen3_voice_clone.wav")
p.add_argument("--device", default="cuda")
p.add_argument("--language", default="English")
p.add_argument("--model", default="Qwen/Qwen3-TTS-12Hz-0.6B-Base")
p.add_argument("--ref-audio", default=DEFAULT_REF_AUDIO)
p.add_argument("--ref-text", default=DEFAULT_REF_TEXT)
p.add_argument("--x-vector-only-mode", action="store_true")
p.add_argument(
"--attn-implementation",
default="auto",
choices=["auto", "flash_attention_2", "sdpa", "eager"],
)
args = p.parse_args()
device = resolve_device(args.device)
dtype = resolve_dtype(device)
attn_implementation = resolve_attention(device, args.attn_implementation)
if args.x_vector_only_mode:
ref_text = None
else:
ref_text = args.ref_text.strip()
if not ref_text:
raise SystemExit("--ref-text is required unless --x-vector-only-mode is set")
model = Qwen3TTSModel.from_pretrained(
args.model,
device_map=device,
dtype=dtype,
attn_implementation=attn_implementation,
)
wavs, sr = model.generate_voice_clone(
text=args.text,
language=args.language,
ref_audio=args.ref_audio,
ref_text=ref_text,
x_vector_only_mode=args.x_vector_only_mode,
)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
sf.write(out, wavs[0], sr)
print(out)
return 0
if __name__ == "__main__":
raise SystemExit(main())
Key Point: the --x-vector-only-mode flag
When this flag is set, the model extracts only the speaker style from the reference audio as an x-vector, without requiring a transcription (--ref-text). Useful when you don't have a transcript of the reference audio, but clone quality tends to be lower than standard mode.
Chatterbox: Japanese Clone with the Multilingual Variant
#!/usr/bin/env python3
from __future__ import annotations
import argparse
from pathlib import Path
import torch
import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from chatterbox.tts import ChatterboxTTS
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--text", default="Hello, this is a Chatterbox test.")
p.add_argument("--out", default="outputs/chatterbox.wav")
p.add_argument("--device", default="cuda")
p.add_argument(
"--variant", choices=["turbo", "multilingual"], default="multilingual"
)
p.add_argument("--language-id", default="en")
p.add_argument("--audio-prompt-path", default=None)
args = p.parse_args()
device = (
"cuda:0"
if args.device.startswith("cuda") and torch.cuda.is_available()
else "cpu"
)
if args.variant == "multilingual":
model = ChatterboxMultilingualTTS.from_pretrained(device=device)
wav = model.generate(
args.text,
language_id=args.language_id,
audio_prompt_path=args.audio_prompt_path,
)
else:
model = ChatterboxTTS.from_pretrained(device=device)
wav = model.generate(args.text, audio_prompt_path=args.audio_prompt_path)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
ta.save(str(out), wav, model.sr)
print(out)
return 0
if __name__ == "__main__":
raise SystemExit(main())
Pass --language-id ja and Japanese text to --text for Japanese output. Pass a reference audio file path to --audio-prompt-path to enable clone mode.
Fish Speech: Two-Step Pipeline
Unlike other models, Fish Speech requires a two-step process: semantic token generation → wav conversion.
Setup
sudo apt-get install -y portaudio19-dev libsox-dev ffmpeg build-essential
uv sync --python 3.12 --extra cu129
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro
Generate without reference audio
# Step 1: Generate semantic tokens
uv run python fish_speech/models/text2semantic/inference.py \
--text 'Hello. Testing Fish Speech command-line inference.' \
--output-dir output/demo \
--device cuda
# Step 2: Convert tokens to wav
uv run python fish_speech/models/dac/inference.py \
-i output/demo/codes_0.npy \
-o output/demo/generated.wav \
--checkpoint-path checkpoints/s2-pro/codec.pth \
-d cuda
Voice cloning with reference audio
Fish Speech cloning requires not just the reference audio, but also its transcript text.
# 1. Transcribe reference audio with Whisper
python - <<'PY'
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3-turbo",
device="cuda:0",
)
res = pipe("path/to/reference.wav")
print(res["text"])
PY
# 2. Generate clone semantic tokens
uv run python fish_speech/models/text2semantic/inference.py \
--text 'Text you want to read aloud' \
--prompt-text 'Transcript of the reference audio' \
--prompt-audio path/to/reference.wav \
--output-dir output/clone \
--device cuda
# 3. Convert to wav
uv run python fish_speech/models/dac/inference.py \
-i output/clone/codes_0.npy \
-o output/clone/clone.wav \
--checkpoint-path checkpoints/s2-pro/codec.pth \
-d cuda
Auto-transcribing the reference audio with Whisper large-v3-turbo is the practical approach. The difference from a hand-written transcript is negligible.
CosyVoice 2: Zero-shot Cross-lingual Clone
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import sys
from pathlib import Path
import torchaudio
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--text", default="Hello. This is a CosyVoice test.")
p.add_argument("--out", default="outputs/cosyvoice.wav")
p.add_argument("--model", default="pretrained_models/CosyVoice2-0.5B")
p.add_argument("--prompt-text", default="希望你以后能够做的比我还好呦。")
p.add_argument("--prompt-wav", required=True)
args = p.parse_args()
vendor = Path(__file__).resolve().parents[1] / "vendors" / "cosyvoice"
sys.path.insert(0, str(vendor))
sys.path.insert(0, str(vendor / "third_party" / "Matcha-TTS"))
from cosyvoice.cli.cosyvoice import AutoModel
model = AutoModel(model_dir=args.model)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
for i, chunk in enumerate(model.inference_zero_shot(args.text, args.prompt_text, args.prompt_wav, stream=False)):
torchaudio.save(str(out), chunk["tts_speech"], model.sample_rate)
if i == 0:
break
print(out)
return 0
if __name__ == "__main__":
raise SystemExit(main())
IndexTTS2: Emotion Control + Duration Matching
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import sys
from pathlib import Path
def main() -> int:
p = argparse.ArgumentParser()
p.add_argument("--text", default="Hello. This is an IndexTTS test.")
p.add_argument("--out", default="outputs/index_tts.wav")
p.add_argument("--model-dir", default="checkpoints")
p.add_argument("--prompt-wav", required=True)
p.add_argument("--device", default="cuda")
p.add_argument("--emo", default="") # emotion vector or natural language text
args = p.parse_args()
vendor = Path(__file__).resolve().parents[1] / "vendors" / "index-tts"
model_dir = Path(args.model_dir)
if not model_dir.is_absolute():
model_dir = (Path.cwd() / model_dir).resolve()
cfg_path = model_dir / "config.yaml"
sys.path.insert(0, str(vendor))
sys.path.insert(0, str(vendor / "indextts"))
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(
model_dir=str(model_dir),
cfg_path=str(cfg_path),
use_fp16=args.device.startswith("cuda"),
device=args.device,
)
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
tts.infer(
spk_audio_prompt=args.prompt_wav,
text=args.text,
output_path=str(out),
emo_vector=args.emo or None,
)
print(out)
return 0
if __name__ == "__main__":
raise SystemExit(main())
Specify the emotion via the --emo argument. Setting emo_vector=None automatically inherits the emotion from the reference audio.
Glossary
| Term | Explanation |
|---|---|
| Zero-shot Clone | Technology that mimics a speaker from just a few seconds of reference audio, with no prior fine-tuning required |
| Emotion Exaggeration | A parameter that amplifies emotional intensity. 1.0 is neutral; 2.0 exaggerates joy, anger, sadness, etc. |
| Paralinguistic Tags | Tags like [laugh] and [cough] that instruct the model to produce non-verbal vocal expressions |
| Duration Control | A feature that matches the generated audio length to a specified target duration. Valuable for video dubbing |
| Cross-lingual Cloning | Technology that reproduces a speaker's voice in a different language from the reference audio |
| x-vector | A fixed-length feature vector representing a speaker's vocal characteristics, independent of speech content |
| Semantic Token | A compressed token sequence encoding textual meaning. Used in Fish Speech as an intermediate representation before wav conversion |
FAQ
Q. How much GPU VRAM is required?
Most models are 0.5B–1.7B, so roughly 4–8GB VRAM is sufficient to run them in fp16/bf16. Chatterbox Turbo (350M) is the lightest and runs comfortably on a 4GB GPU. Qwen3-TTS 1.7B in bf16 typically needs 8–12GB.
Q. Can these run on CPU only?
Technically yes, but not at a practical speed. Real-time generation (generating in equal or less time than it takes to read the text aloud) generally requires a GPU. CPU is fine for quality evaluation and experimentation.
Q. Which model has the best Japanese quality?
Within the scope of this comparison, it comes down to Qwen3-TTS (official Japanese support) and Fish Speech S2 Pro (80-language coverage). That said, Japanese quality evaluation is highly dependent on the reference audio and input text — trying both on your actual use case will give you the most accurate answer.
Q. What's the ideal reference audio length?
It varies by model, but 3–10 seconds is the practical range. Too short and speaker characteristics are insufficient; too long yields diminishing returns. Clarity of the recording (no background noise, clean speech) matters more than length.
Q. How do I use Qwen3-TTS VoiceDesign mode?
Use a CustomVoice or VoiceDesign mode model and pass a natural-language voice description to the text field — for example: "A calm, middle-aged male voice with a slight husky tone." This does not work with Clone mode models (the *-Base variants).
Q. Where can I verify the Fish Speech commercial license?
Check the LICENSE file directly on each model's Hugging Face card (e.g., fishaudio/fish-speech-1.5). CC-BY-NC-SA-4.0 (non-commercial) and Fish Audio Research License (commercially permissive with conditions) coexist across versions. The "version = license" relationship can change over time, so always refer to the latest license file on Hugging Face.
Summary
Open-source TTS reached a sufficient quality bar in 2026 to serve as a genuine alternative to cloud APIs. However, there are significant differences across models on three axes — quality, license, and features — so clarifying your use case and license requirements before selecting a model is essential.
In particular, Fish Speech and IndexTTS2 have licensing pitfalls that require careful reading before any commercial adoption. If you want high quality with a clean license, starting with Qwen3-TTS (Apache-2.0) or CosyVoice 3.0 (Apache-2.0) is the safest choice at this point in time.
Related Articles
Explore more articles connected to this topic.
I Tested Zero-Shot Voice Cloning with Emotion Control in OpenVoice — 8 Styles from a 14-Second Reference
OpenVoice V1 is a zero-shot voice cloning library that extracts a speaker's tone color from as little as 14 seconds of reference audio, then synthesizes speech in 8 emotional styles: `whispering`, `shouting`, `excited`, `cheerful`, `terrified`, `angry`, `sad`, and `friendly`. This post covers the code I actually ran, what I heard, and the gotchas I hit along the way.
Read article →5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →