Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling is an increasingly important component of Big Data analytics, enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM), while mathematically elegant, do not lend themselves well to direct parallelization because of dependencies from one time step to another. Data decomposition approaches that partition data across time segments and then combine results in a global view of the dynamic change of topics enable execution of topic models on much larger datasets than is possibly without data decomposition. However, these methods are difficult to analyze mathematically and are relatively untested for quality of topics and performance on parallel systems. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. CLDA uses a data decomposition strategy to partition data. CLDA takes advantage of parallelism, enabling fast execution for even very large datasets and a large number of topics. A large corpus is split into local segments to extract textual information from different time steps. Latent Dirichlet Allocation (LDA) is applied to infer topics at local segments. The results are merged, and clustering is used to combine topics from different segments into global topics. Results show that the perplexity is comparable and that topics generated by this algorithm are similar to those generated by DTM. In addition, CLDA is two orders of magnitude faster than existing approaches and allows for more freedom of experiment design. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers, seventeen years of computer science journal abstracts, and to forty years of the PubMed corpus.
View on arXiv