ChatTTS Hands-On: Is This 39k-Star Open-Source TTS Really That Good?
ChatTTS is an open-source TTS model purpose-built for conversational speech, with 39k stars on GitHub. After a week of using it for audio content, here's what works and what doesn't.
广告
ChatTTS Hands-On: Is This 39k-Star Open-Source TTS Really That Good?
I’ve been creating content for over a year and constantly searching for a decent free TTS solution. Commercial options like Azure TTS and ElevenLabs are pricey, and traditional self-hosted TTS sounds like a 1980s synthesizer. Then ChatTTS landed, and for the first time I thought, “okay, this one is actually usable.”
Project Background
ChatTTS is open-sourced by the 2noise team, released in 2024, and now has 39k stars on GitHub. Its positioning is unique: it’s not a general-purpose TTS — it’s specifically optimized for conversational speech. That means it generates natural pauses, hesitations, and even laughs, sounding more like real chat than a news broadcast.
Built on PyTorch, the model is roughly 1GB in size. License is AGPL-3.0 (more on this below), and the latest update was April 2026.
Where It Genuinely Excels
1. Prosody Feels Almost Human
I compared ChatTTS, Bark, edge-tts, and a major commercial Chinese TTS API on the same podcast script. ChatTTS clearly had the best “breathing” — natural mid-sentence pauses, appropriate emphasis on specific words, and natural intonation at question endings.
It also handles filler words (“um,” “uh,” “ah”) gracefully — they don’t come out staccato like traditional TTS.
2. Code-Switching Doesn’t Break
I record technical content with a lot of mixed Chinese-English sentences (“write a RESTful API in Python”). Many TTS engines either pronounce English as Chinese pinyin or switch accents abruptly. ChatTTS handles this surprisingly smoothly — “Python” pronounced in clean English, then back to Chinese without a jarring transition.
3. Voice Embedding Is a Game-Changer
You can “sample” a voice and persist it via speaker embeddings. I’ve saved several good-sounding speaker embeddings as .pt files and reuse them across content for consistent voice branding. For series creators, this is huge.
4. Inference Speed Is Acceptable
On my RTX 3060 12GB, generating a 30-second clip takes 5-8 seconds. On an M1 Pro Mac via MPS backend, it’s usable but noticeably slower than CUDA.
Quick Setup
The fastest path is the official web UI:
git clone https://github.com/2noise/ChatTTS
cd ChatTTS
pip install -r requirements.txt
python examples/web/webui.py
Open localhost:8080, paste your text, pick a random_speaker, hit generate.
For code integration, here’s the minimal snippet:
import ChatTTS
import torchaudio
chat = ChatTTS.Chat()
chat.load_models()
texts = ["Hello, today I'd like to share an open-source TTS project"]
wavs = chat.infer(texts)
torchaudio.save("output.wav", torch.from_numpy(wavs[0]), 24000)
First run downloads ~1.2GB of model files from HuggingFace. If your connection is shaky, manually pre-download and drop them into ~/.cache/huggingface/hub.
But It Has Real Limitations
VRAM is hungrier than advertised. Docs say 4GB is enough; in practice, generating 30+ seconds of audio occasionally OOM’s even my 12GB card. If you only have 8GB, chunk your text aggressively.
Long text gets unstable. Throw in over 200 characters at once and you sometimes get weird pitch jumps or sudden speed-ups near the end. The right approach is splitting at sentence boundaries and stitching outputs together.
Emotion control is coarse. It supports inline tags like [laugh] and [uv_break], but you can’t precisely dial emotion intensity (e.g., “slightly happy” vs. “very happy”). ElevenLabs handles this better — but ElevenLabs is a paid product.
License is AGPL-3.0. Important for commercial users: AGPL is highly viral. Building a SaaS product on top of it likely means open-sourcing your own code. For personal projects or internal use, you’re fine. Enterprise use should consult legal first.
Chinese dialects don’t work. The model is trained predominantly on Mandarin; Cantonese, Sichuanese, etc. are essentially unsupported.
How It Compares
Bark (Suno): wider language coverage, but Chinese quality and prosody are clearly weaker than ChatTTS.
edge-tts: free wrapper around Microsoft Azure, excellent Chinese quality, but cloud-dependent and the API could be killed any day.
MeloTTS: fast training, smaller models, but voice variety is limited and lacks the “conversational feel” of ChatTTS.
XTTS (Coqui): strong multilingual support including zero-shot voice cloning, but Chinese quality is mediocre and commercial license has restrictions.
For pure Chinese conversational content, ChatTTS is currently the strongest open-source option, period.
Who It’s For
Podcasters, audiobook producers, video narrators, technical bloggers wanting an audio companion to articles — all great fits. Especially when you need to produce long-form content at volume, commercial TTS subscription costs add up fast. Deploy ChatTTS once, use it forever.
But for a few short TikTok voice-overs, free edge-tts or CapCut’s built-in TTS will be faster than setting up local inference. The setup time alone is worth ten short videos.
Bottom Line
ChatTTS’s 39k stars are earned. Its Chinese conversational performance has genuinely raised the bar for open-source TTS. In my workflow, anything over 5 minutes long now goes through ChatTTS, and I’ve cancelled my paid TTS subscription.
The one thing that gives me pause is the AGPL license. If you’re planning a commercial SaaS, think it through carefully. But for individual creators and internal use, it’s a delicious free lunch.
GitHub: https://github.com/2noise/ChatTTS
About the Author
Liudingyu is a full-stack developer and heavy GitHub user. With 900+ starred repos over the past 3 years, this site only covers tools I’ve actually used or deeply researched.
📧 Found a great tool to recommend? Email [email protected]
广告