According to Beating, Xiaomi's AI Lab Kaldi team has open-sourced OmniVoice, a zero-shot voice cloning TTS model supporting 646 languages. The model clones voice characteristics from just seconds of reference audio and works across languages—a single voice can synthesize speech in Mandarin, Japanese, Korean, and other languages. All code, weights, and training data are open-sourced under Apache-2.0 license.
OmniVoice uses a simplified architecture with a single bidirectional Transformer that directly maps text to discrete acoustic tokens, achieving 40x faster-than-realtime inference in PyTorch. Trained on 580,000 hours of audio from 50 open-source datasets, OmniVoice outperformed commercial systems in voice similarity and intelligibility across 24 tested languages and matched or exceeded human recordings in 102 languages.