ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.07321
9
3

On the Impact of Cross-Domain Data on German Language Models

11 October 2023
Amin Dada
Aokun Chen
C.A.I. Peng
Kaleb E. Smith
Ahmad Idrissi-Yaghir
C. Seibold
Jianning Li
Lars Heiliger
Xi Yang
Christoph M. Friedrich
Daniel Truhn
Jan Egger
Jiang Bian
Jens Kleesiek
Yonghui Wu
ArXivPDFHTML
Abstract

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45%4.45\%4.45% over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen

View on arXiv
Comments on this paper