30
48

Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler

Abstract

Latent class analysis is used to perform model based clustering for multivariate categorical responses. Selection of the variables most relevant for clustering is an important task which can affect the quality of clustering considerably. This work considers a Bayesian approach for selecting the number of clusters and the best clustering variables. The main idea is to reformulate the problem of group and variable selection as a probabilistically driven search over a large discrete space using Markov chain Monte Carlo (MCMC) methods. This approach results in estimates of degree of relevance of each variable for clustering along with posterior probability for the number of clusters. Bayes factors can then be easily calculated, and a suitable model chosen in a principled manner. Both selection tasks are carried out simultaneously using an MCMC approach based on a collapsed Gibbs sampling method, whereby several model parameters are integrated from the model, substantially improving computational performance. Approaches for estimating posterior marginal probabilities of class membership, variable inclusion and number of groups are proposed, and post-hoc procedures for parameter and uncertainty estimation are outlined. The approach is tested on simulated and real data.

View on arXiv
Comments on this paper