GlobalPhone Language Models

by Ngoc Thang Vu, Tanja Schultz, 2012


GlobalPhone is an ongoing database collection that provides transcribed speech data for the development and evaluation of large speech processing systems in the most widespread languages of the world. GlobalPhone is designed to be uniform across languages with respect to the amount of text and audio data per language, the audio data quality (microphone, noise, channel), the collection scenario (task, setup, speaking style etc.), and the transcription conventions. The GlobalPhone corpus provides an excellent basis for research in the areas of (1) multilingual speech recognition, (2) rapid deployment of speech processing systems to new languages, (3) language and speaker identification tasks, (4) multilingual speech synthesis, (5) monolingual speech recognition in a large variety of languages, as well as (6) comparisons across major languages based on text and speech data.

Download 3-gram Language Models

Languages Perplexity (PPL) OOV [%] Vocabulary size Download
Bulgarian 454 1.0 274k BG.lm
Czech 1421 4.0 267k CZ.lm
French 324 2.4 65k FR.lm
German 672 0.3 38k GE.lm
Hausa 97 0.5 41k HAU.lm
Croatian 721 3.6 362k HR.lm
Japanese 89 1.0 67k JP.lm
Korean(char) 25 0 1.3k KO.lm
Mandarin 262 0.8 13k MAN.lm
Portuguese 58 9.8 62k PT.lm
Polish 951 0.8 243k PL.lm
Russian 1310 3.9 293k RU.lm
Spanish 154 0.1 19k SP.lm
Swedish 423 5.3 73k SWE.lm
Tamil 730 1.0 288k TA.lm
Thai 70 0.1 22k TH.lm
Turkish XXX 13.2 29k TU.lm
Vietnamese 218 0 30k VN.lm


Tanja Schultz - tanja dot schultz at uni-bremen dot de


We would like to thank to all who have helped us to collect the data corpus.


1. GlobalPhone: A Multilingual Text and Speech Database in 20 Languages. Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe. In Proc. of ICASSP, Canada, 2013. pdf
2. Language Independent and Language Adaptive Acoustic Modeling for Speech Recognition. Tanja Schultz and Alex Waibel, Speech Communication, Volume 35, Issue 1-2, pp 31-51. pdf
3. GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University. Tanja Schultz. In Proc. of the International Conference of Spoken Language Processing, ICSLP, Denver, CO, 2002. pdf
4. Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit. Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, and Tanja Schultz. In Proc. of Interspeech, Japan, 2010. pdf
5. Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil. Ngoc Thang Vu, Franziska, Tanja Schultz. In Proc. of ICASSP, Czech, 2011. pdf

For further Publications on language specific issues, please refer to the CSL publication server at