The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.
As of 02/19/2021 there were a total of 1,431,742,565 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets.
Citation
Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1
The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all five tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.
Table | Feature Name | Description |
---|---|---|
Primary key | Tweet\_ID | Integer representation of the tweets unique identifier |
1.Summary\_Details | Language | When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text |
Geolocation\_cordinate | Indicates whether or not the geographic location of the tweet was reported | |
RT | Indicates if the tweet is a retweet (YES) or original tweet (NO) | |
Likes | Number of likes for the tweet | |
Retweets | Number of times the tweet was retweeted | |
Country | When present, indicates a list of uppercase two-letter country codes from which the tweet comes | |
Date\_Created | UTC date and time the tweet was created | |
2.Summary\_Hastag | Hashtag | Hashtag (\#) present in the tweet |
3.Summary\_Mentions | Mentions | Mention (@) present in the tweet |
4.Summary\_Sentiment | Sentiment\_Label | Most probable tweet sentiment (neutral, positive, negative) |
Logits\_Neutral | Non-normalized prediction for neutral sentiment | |
Logits\_Positive | Non-normalized prediction for positive sentiment | |
Logits\_Negative | Non-normalized prediction for negative sentiment | |
5.Summary\_NER | NER\_text | Text stating a named entity recognized by the NER algorithm |
Start\_Pos | Initial character position within the tweet of the NER\_text | |
End\_Pos | End character position within the tweet of the NER\_text | |
NER\_Label Prob | Label and probability of the named entity recognized by the NER algorithm |
For more information visit: Twitter API and the Documentation for API Tweet-object
As of 02/19/2021:
Total Number of tweets: 1,431,742,565
Average daily number of tweets: 148,812
Year | Month | Daily Avg. Original | Daily Avg. Retweets | Daily Avg. Tweets | Total of Orignal | Total of Retweets | Total of Tweets | Total with Geolocation | Max No. Retweets | Max No. Likes |
---|---|---|---|---|---|---|---|---|---|---|
2020 | 1 | 5,947 | 30,576 | 35,501 | 1,958,346 | 7,852,504 | 9,810,850 | 1,773 | 674,151 | 334,802 |
2020 | 2 | 10,978 | 29,918 | 40,604 | 7,624,648 | 21,944,443 | 29,568,948 | 8,103 | 469,739 | 637,589 |
2020 | 3 | 13,095 | 44,714 | 56,283 | 12,610,824 | 46,659,589 | 59,270,412 | 19,952 | 1,064,693 | 1,255,858 |
2020 | 4 | 30,091 | 89,513 | 119,859 | 20,591,357 | 60,301,889 | 80,893,244 | 38,213 | 649,823 | 662,005 |
2020 | 5 | 35,163 | 99,928 | 135,709 | 26,258,213 | 73,618,083 | 99,876,289 | 47,684 | 1,007,616 | 929,811 |
2020 | 6 | 51,033 | 142,569 | 193,096 | 34,786,076 | 95,171,388 | 129,957,461 | 58,138 | 790,652 | 882,693 |
2020 | 7 | 53,720 | 155,042 | 209,738 | 39,611,015 | 111,876,344 | 151,487,359 | 56,808 | 615,768 | 1,287,117 |
2020 | 8 | 51,330 | 143,291 | 195,037 | 37,549,475 | 102,834,375 | 140,383,850 | 55,912 | 2,183,434 | 860,162 |
2020 | 9 | 50,068 | 132,040 | 182,947 | 35,861,979 | 92,957,247 | 128,819,226 | 32,381 | 1,925,489 | 839,689 |
2020 | 10 | 54,716 | 137,722 | 200,741 | 39,945,510 | 102,236,659 | 141,886,653 | 318,121 | 946,810 | 785,385 |
2020 | 11 | 64,125 | 111,686 | 177,062 | 45,096,171 | 77,885,575 | 122,981,746 | 26,488 | 1,187,438 | 619,643 |
2020 | 12 | 64,840 | 121,149 | 186,852 | 49,065,436 | 87,366,002 | 133,179,589 | 3,277,244 | 1,402,911 | 1,038,164 |
2021 | 1 | 58,225 | 134,387 | 192,272 | 40,878,618 | 92,341,359 | 133,219,977 | 24,293 | 1,437,164 | 867,275 |
2021 | 2 | 49,554 | 108,641 | 157,671 | 22,735,468 | 47,671,511 | 70,406,961 | 17,393 | 964,107 | 644,697 |
There is a total of 3,982,503 tweets with geolocation information, which are shown on a map below:
Languages | Total No. Tweets | Percentage of Tweets |
---|---|---|
English | 958,566,904 | 67.12 |
Spanish; Castilian | 180,950,034 | 12.67 |
Portuguese | 55,194,220 | 3.86 |
French | 40,294,548 | 2.82 |
Bahasa | 38,740,257 | 2.71 |
Others | 154,449,063 | 10.81 |
The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .
The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)
Mentions | Hashtags | NER Person | NER Location | NER Organization | NER Miscellaneous |
---|---|---|---|---|---|
@realDonaldTrump | \#covid19 | trump | us | cdc | covid-19 |
14,106,218 | 82,266,767 | 37,962,335 | 22,549,086 | 9,199,997 | 26,385,506 |
@realdonaldtrump | \#coronavirus | biden | china | trump | americans |
7,153,404 | 38,281,576 | 6,960,748 | 21,760,380 | 3,439,634 | 14,656,079 |
@joebiden | \#covid | covid | uk | senate | coronavirus |
3,380,681 | 8,834,076 | 6,147,950 | 8,309,697 | 2,265,922 | 11,423,419 |
@JoeBiden | \#covid\19 | donald trump | america | covid | covid |
1,901,092 | 2,252,353 | 3,975,921 | 7,200,629 | 2,139,138 | 8,110,958 |
@mippcivzla | \#lockdown | fauci | india | congress | chinese |
1,508,475 | 1,506,861 | 3,153,327 | 5,222,054 | 1,444,597 | 4,347,302 |
Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages.
There are also some known gaps of data shown below:
Date | Time |
---|---|
2020-08-06 | 07:00 UTC |
2020-08-08 | 07:00 UTC |
2020-08-09 | 07:00 UTC |
2020-08-14 | 07:00 UTC |
The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.
You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.
In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.
The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.
For those who prefer a command-line interface over a GUI, we recommend using Twarc.
Follow the instructions on the Hydrator github repository.
Follow the instructions on the Twarc github repository.
For questions about the dataset, please contact Dr. Christian Lopez at [email protected], Dr. Caleb Gallemore at [email protected], or Malolan Vasu at [email protected].
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:
Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020