Private Two-Party Cluster Analysis Made Formal & Scalable
- FedML

Machine Learning (ML) is widely used for predictive tasks in numerous important applications---most successfully, in the context of collaborative learning, where a plurality of entities contribute their own datasets to jointly deduce global ML models. Despite its efficacy, this new learning paradigm fails to encompass critical application domains, such as healthcare and security analytics, that involve learning over highly sensitive data, wherein privacy risks limit entities to individually deduce local models using solely their own datasets. In this work, we present the first comprehensive study for privacy-preserving collaborative hierarchical clustering, overall featuring scalable cryptographic protocols that allow two parties to safely perform cluster analysis over their combined sensitive datasets. For this problem at hand, we introduce a formal security notion that achieves the required balance between intended accuracy and privacy and presents a class of two-party hierarchical clustering protocols that guarantee strong privacy protection, provable in our new security model. Crucially, our solution employs modular design and judicious use of cryptography to achieve high degrees of efficiency and extensibility. Specifically, we extend our core protocol to obtain two secure variants that significantly improve performance, an optimized variant for single-linkage clustering and a scalable approximate variant. Finally, we provide a prototype implementation of our approach and experimentally evaluate its feasibility and efficiency on synthetic and real datasets, obtaining encouraging results. For example, end-to-end execution of our secure approximate protocol, over 1M 10-dimensional records, completes in 35 sec, transferring only 896KB and achieving 97.09% accuracy.
View on arXiv