415

Correcting Sociodemographic Selection Biases for Population Prediction from Social Media

International Conference on Web and Social Media (ICWSM), 2019
Abstract

Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. Across four tasks of predicting U.S. county population health statistics from Twitter, we find standard restratification techniques provide no improvement and often degrade prediction accuracies. The core reasons for this seems to be both shrunken estimates (reduced variance of model predicted values) and sparse estimates of each population's socio-demographics. We thus develop and evaluate three methods to address these problems: estimator redistribution to account for shrinking, and adaptive binning and informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods significantly outperforms the standard restratification approaches. Combining approaches, we find substantial improvements over non-restratified models, yielding a 53.0% increase in predictive accuracy (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks.

View on arXiv
Comments on this paper