ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.03730
56
6

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

30 September 2024
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
Richard Rutmann
Charvi Jain
Max Lübbering
Daniel Steinigen
Johannes Leveling
Katrin Klug
Jasper Schulze Buschhoff
Lena Jurkschat
Hammam Abdelwahab
Benny Jörg Stein
Karl-Heinz Sylla
Pavel Denisov
Nicolo' Brandizzi
Qasid Saleem
Anirban Bhowmick
Lennard Helmer
Chelsea John
Pedro Ortiz Suarez
Malte Ostendorff
Alex Jude
Lalith Manjunath
Samuel Weinbach
Carolin Penke
Oleg Filatov
Shima Asaadi
Fabio Barth
R. Sifa
Fabian Küch
A. Herten
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Köhler
Nicolas Flores-Herr
ArXivPDFHTML
Abstract

We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

View on arXiv
Comments on this paper