Cost-Effective Text Clustering with Large Language Models

22 April 2025

Hongtao Wang

Taiyan Zhang

Renchi Yang

Jianliang Xu

ArXiv (abs)PDF HTML Github (1★)

Main:10 Pages

6 Figures

Bibliography:3 Pages

8 Tables

Appendix:1 Pages

Abstract

Text clustering aims to automatically partition a collection of text documents into distinct clusters based on linguistic features. In the literature, this task is usually framed as metric clustering based on text embeddings from pre-trained encoders or a graph clustering problem upon pairwise similarities from an oracle, e.g., a large ML model. Recently, large language models (LLMs) bring significant advancement in this field by offering contextualized text embeddings and highly accurate similarity scores, but meanwhile, present grand challenges to cope with substantial computational and/or financial overhead caused by numerous API-based queries or inference calls to the models.

View on arXiv

Comments on this paper