Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

17 August 2016

Abstract

Clustering datasets with a mix of continuous and categorical attributes is encountered routinely by data analysts. This work presents a method for clustering such datasets using Homogeneity Analysis. An optimal Euclidean representation of mixed datasets is obtained using Homogeneity Analysis. This representation is then clustered. The clustering solutions from this method are compared to the clustering solutions obtained using the method based on the Gower distance that is popularly used with such datasets. This comparison is made on datasets that have been the subject of other research investigations. The Homogeneity Analysis solution is an eigenvalue based solution. The eigenvalues are used to produce the optimal Euclidean representation. Even with a single eigenvalue, the Homogeneity Analysis based solution performed better than the method based on the Gower distance. Extending the solution to use multiple eigenvalues from the Homogeneity Analysis solution is illustrated on real world datasets. This method can be used in conjunction with the mini-batch K-Means algorithm to cluster large datasets. This is illustrated on a real world dataset. The relevant theory from Homogeneity Analysis is presented.

View on arXiv

Comments on this paper