Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions

Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster subdivision and reorganisation as cluster number increases, offering insights that static, single-resolution metrics cannot capture. The results show that instead of seeking a single óptimal' solution, choosing a cluster number involves balancing informativeness and complexity.
View on arXiv@article{miller2025_2502.17020, title={ Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions }, author={ Justin Miller and Tristram Alexander }, journal={arXiv preprint arXiv:2502.17020}, year={ 2025 } }