14
0

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Abstract

While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Large Language Models (LLMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy. We use these benchmarks to evaluate 9 powerful language models showing that current language models are bound by 70% accuracy in mathematical understanding, achieved by Llama 3 70b and by 85% in language understanding. We also compare LLMs with the average performance of Italian students to show that Llama 3 is the only one to perform better than students on Invalsi MATE while most models outperform students on Invalsi ITA. We will make data and evaluation code openly available to pave the way for the future development of larger and harder benchmarks to evaluate LLMs' mathematical and linguistic understanding in Italian.

View on arXiv
Comments on this paper