Exact log-likelihood for clustering parameterised models and normally distributed data

The log-likelihood for clustering multivariate normal distributions is calculated for a partition with equal means in each cluster. The result has terms to penalise poor fits and model complexity, and determines both the number and composition of clusters. The procedure is equivalent to calculating the Bayesian Information Criterion (BIC) without approximation, and can produce similar, but less subjective results as the ad-hoc "elbow criterion". An intended application is clustering of parametric models, whose maximum likelihood estimates (MLEs) are normally distributed. Many parametric models are more familiar and interpretable than directly clustered data. For example, survival models can build-in prior knowledge, adjust for known confounders, and use marginalisation to emphasise parameters of interest. The combined approach is equivalent to a multi-layer clustering algorithm that characterises features through the normally distributed MLE parameters of a fitted model, and then clusters the normal distributions. The results can alternately be applied directly to measured data and their estimated covariances.
View on arXiv