Global Reactions to COVID-19 on Twitter: A Labelled Dataset with Latent
Topic, Sentiment and Emotion Attributes
This paper presents a large, labelled dataset on people's responses and expressions related to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 Jan 2021, we retrieved over 132 million public Twitter posts (i.e., tweets) from more than 20 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging natural language processing techniques and pre-trained machine learning-based emotion analytic algorithms, we labelled each tweet with seventeen latent semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion intensity attributes indicating the degree of intensity of the valence or sentiment (from extremely negative to extremely positive), and the degree of intensity of fear, of anger, of sadness and of joy emotions (from barely noticeable to extremely high intensity), and c) two qualitative attributes indicating the sentiment category and the dominant emotion category the tweet is mainly expressing. We report the descriptive statistics around the topic, sentiment and emotion attributes, and their temporal distributions, and discuss the dataset's possible usage in communication, psychology, public health, economics, and epidemiology research.
View on arXiv