ECAI Conference 2025 Conference Paper
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
- Mehdi Ali
- Michael Fromm 0001
- Klaudia Thellmann
- Jan Ebert
- Alexander Arno Weber
- Richard Rutmann
- Charvi Jain
- Max Lübbering
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe’s linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing Large Language Models (LLMs) that predominantly focus on English or a few high-resource languages. We detail the models’ development principles, i. e. , data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.