ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.06863
22
2

Hierarchical Clustering: New Bounds and Objective

12 November 2021
Mirmahdi Rahgoshay
M. Salavatipour
ArXivPDFHTML
Abstract

Hierarchical Clustering has been studied and used extensively as a method for analysis of data. More recently, Dasgupta [2016] defined a precise objective function. Given a set of nnn data points with a weight function wi,jw_{i,j}wi,j​ for each two items iii and jjj denoting their similarity/dis-similarity, the goal is to build a recursive (tree like) partitioning of the data points (items) into successively smaller clusters. He defined a cost function for a tree TTT to be Cost(T)=∑i,j∈[n](wi,j×∣Ti,j∣)Cost(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times |T_{i,j}| \big)Cost(T)=∑i,j∈[n]​(wi,j​×∣Ti,j​∣) where Ti,jT_{i,j}Ti,j​ is the subtree rooted at the least common ancestor of iii and jjj and presented the first approximation algorithm for such clustering. Then Moseley and Wang [2017] considered the dual of Dasgupta's objective function for similarity-based weights and showed that both random partitioning and average linkage have approximation ratio 1/31/31/3 which has been improved in a series of works to 0.5850.5850.585 [Alon et al. 2020]. Later Cohen-Addad et al. [2019] considered the same objective function as Dasgupta's but for dissimilarity-based metrics, called Rev(T)Rev(T)Rev(T). It is shown that both random partitioning and average linkage have ratio 2/32/32/3 which has been only slightly improved to 0.6670780.6670780.667078 [Charikar et al. SODA2020]. Our first main result is to consider Rev(T)Rev(T)Rev(T) and present a more delicate algorithm and careful analysis that achieves approximation 0.716040.716040.71604. We also introduce a new objective function for dissimilarity-based clustering. For any tree TTT, let Hi,jH_{i,j}Hi,j​ be the number of iii and jjj's common ancestors. Intuitively, items that are similar are expected to remain within the same cluster as deep as possible. So, for dissimilarity-based metrics, we suggest the cost of each tree TTT, which we want to minimize, to be CostH(T)=∑i,j∈[n](wi,j×Hi,j)Cost_H(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times H_{i,j} \big)CostH​(T)=∑i,j∈[n]​(wi,j​×Hi,j​). We present a 1.39771.39771.3977-approximation for this objective.

View on arXiv
Comments on this paper