226
v1v2 (latest)

Danoliteracy of Generative Large Language Models

Main:8 Pages
14 Figures
Bibliography:2 Pages
9 Tables
Appendix:6 Pages
Abstract

The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at ρ0.8\rho \sim 0.8 with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining 95%95\% of scenario performance variance for GLLMs in Danish, suggesting a gg factor of model consistency in language adaptation.

View on arXiv
Comments on this paper