v1v2v3v4 (latest)

Automatic register identification for the open web using multilingual deep learning

28 June 2024

Erik Henriksson

A. Myntti

Anni Eskelinen

Selcen Erten-Johansson

Saara Hellström

Veronika Laippala

ArXiv (abs)PDF HTML Github (1★)

Main:27 Pages

12 Figures

Bibliography:6 Pages

20 Tables

Appendix:8 Pages

Abstract

This article presents multilingual deep learning models for identifying web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid texts (those combining multiple registers) reveals that the main challenge lies not in classifying hybrids themselves, but in distinguishing hybrid from non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly for languages with limited training data. Zero-shot performance on unseen languages drops by an average of 7%, though this varies by language (3--8%), indicating that while registers share features across languages, they also retain language-specific characteristics.

View on arXiv

Comments on this paper