Modeling Social Annotation: a Bayesian Approach

9 November 2008

Abstract

Collaborative tagging systems, such as del.icio.us, CiteULike, and others, allow users to annotate objects, e.g., Web pages or scientific papers, with descriptive labels called tags. The social annotations, contributed by thousands of users, can potentially be used to infer categorical knowledge, classify documents or recommend new relevant information. Traditional text inference methods do not make best use of socially-generated data, since they do not take into account variations in individual users' perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes interests of individual annotators into account in order to find hidden topics of annotated objects. Unfortunately, our proposed approach had a number of shortcomings, including overfitting, local maxima and the requirement to specify values for some parameters. In this paper we address these shortcomings in two ways. First, we extend the model to a fully Bayesian framework. Second, we describe an infinite version of the model that enables it to automatically estimate values of key parameters. We evaluate the model in detail on the synthetic data by comparing its performance to Latent Dirichlet Allocation, a popular text categorization algorithm, on the topic extraction task. Finally, we validate the proposed model on real-world data. Specifically, we apply it to infer topics of Web resources from the social annotations obtained from. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated metadata to solve real-world information processing tasks.

View on arXiv

Comments on this paper