NetEase Open-Sources Confucius4-TTS: 3-Second Voice Cloning Across 14 Languages, Zero Accent

NetEase Youdao has thrown open the doors on one of the most ambitious voice AI projects to date, announcing the full open-source release of Confucius4-TTS — the “Ziyue 4.0” text-to-speech engine that can clone a human voice from just three seconds of audio and reproduce it across 14 languages with zero detectable accent. It is, the company claims, the first open-source model to achieve cross-lingual, accent-free voice cloning without requiring reference text.

Confucius4-TTS

The numbers are striking. With no pre-training and no reference transcript, Confucius4-TTS achieves over 85% voice similarity on a three-second sample and hits 97% accuracy on cloning tasks. That three-second clip is all it takes: feed it a short utterance in Mandarin, and the model will speak fluent Japanese, English, Spanish, French, German, Korean, Thai, Vietnamese, and more — all in the same voice, without the telltale foreign accent that has plagued cross-lingual synthesis for years.

Beneath the hood, Confucius4-TTS represents a complete architectural rethink from NetEase’s earlier EmotiVoice system. Gone are the HiFi-GAN vocoder and the speaker-ID lookup tables. In their place sits a GPT-style semantic language model as the backbone, paired with an SSL-pretrained, ECAPA-TDNN-based learnable speaker encoder. Generation runs through a Flow Matching framework — a departure from traditional vocoder pipelines that gives the model finer-grained control over both timbre and prosody.

The emotion handling is where things get particularly interesting. Most TTS systems rely on crude text-label tags to steer emotional expression — think “happy” or “sad” directives that produce stilted, one-note delivery. Confucius4-TTS instead uses audio-prompt emotion cloning: it extracts the emotional fingerprint directly from the reference clip — the intonation, the rhythm, the micro-pauses — and transplants it across languages without loss. A whispered Mandarin sample yields a whispered English output. An urgent, rapid-fire delivery in Korean carries the same urgency into French.

NetEase has released the full 54 GB model package under the Apache license, which means commercial use is unrestricted. Developers can deploy it locally for air-gapped environments where data security and customization are paramount. The company positions Confucius4-TTS as a low-barrier domestic technology base for multilingual content production, digital human dubbing, cross-language education, and the rapidly growing market for localizing short-form video content for global audiences.

The code and model weights are available now on GitHub at github.com/netease-youdao/Confucius4-TTS. NetEase has framed the release in almost philosophical terms: a bet that open-sourcing voice cloning at this level will lower the threshold enough that “every voice can cross the boundaries of language.”