From Outliers to Topics in Language Models: Anticipating Trends in News Corpora

Main:7 Pages
6 Figures
Bibliography:2 Pages
8 Tables
Appendix:5 Pages
Abstract
This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.
View on arXivComments on this paper
