Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

4 November 2025

Patrick Haller

Jonas Golde

Alan Akbik

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Main:8 Pages

3 Figures

Bibliography:2 Pages

18 Tables

Appendix:7 Pages

Abstract

We study architectural and optimization tech- niques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM to- ken mixer and explores lightweight enhance- ments, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support train- ing in low-resource settings, we curate a high- quality corpus emphasizing readability and ped- agogical structure. Experiments across both STRICT and STRICT-SMALL tracks show that (1) linear attention combined with sliding win- dow attention consistently improves zero-shot performance, and (2) the Muon optimizer stabi- lizes convergence and reduces perplexity over AdamW. These results highlight effective strate- gies for efficient language modeling without relying on scale.

View on arXiv

Comments on this paper