Spanish Language Models

15 July 2021

Asier Gutiérrez-Fandiño

Jordi Armengol-Estapé

Marc Pàmies

Joan Llop-Palao

Joaquín Silveira-Ocampo

C. Carrino

Aitor Gonzalez-Agirre

Carme Armentano-Oller

Carlos Rodríguez-Penagos

Marta Villegas

VLM

ArXiv (abs)PDF HTML

Abstract

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.

View on arXiv

Comments on this paper