Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

2 May 2020

Muhammad Abdul-Mageed

AbdelRahim Elmadany

El Moatez Billah Nagoudi

Dinesh Pabbi

Kunal Verma

ArXiv (abs)PDF HTML

Abstract

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (169M tweets). We release tweet IDs from the dataset. We also develop and release a powerful model (acc=94\%) for teasing apart messages related to the pandemic from those that are not. Our data and models can be useful for studying various phenomena related to the pandemic and accelerating viable solutions to associated problems. Our data and models are publicly available for research.

View on arXiv

Comments on this paper