Understanding Covid-19 using Twitter NLP

Rich Larrabee April 20, 2020

6 5,575 5 minutes read

Last week, the United States surpassed Italy as the country with the most deaths from the coronavirus, now making the US the epicenter of the virus. The country has been in a lockdown for several weeks, and while it appears the outbreak is beginning to reach its peak, many unknowns remain ahead of us. What are people thinking and doing in response to the current situation, and what is to come next?

In past crises, we’ve had no means of gauging the mood of the American people, but with today’s technology, we can gain some insight into what people are thinking and doing during this unprecedented time.

One new source of information available to us during the 2020 Coronavirus Pandemic is the microblogging messages from Twitter.

According to recent data, there are 30 million daily Twitter users in the United States. Additionally, many of these messages contain geospatial information so we can pinpoint the location of the sender at the time the message was sent. So, what information might these messages carry, and how can we gain insight as to how people are coping as we enter our second month of social distancing and the death count increases daily?

This is an up to date chart of all of the COVID-19 cases around the world since December put together by @AviSchiffmann who is only 17 and paying more attention than most of our country. Big ups Avi! https://t.co/2UrqeRMvbr
— LondonBridge ?SPACE YACHT? (@LBHouseMusic) March 15, 2020

With modern Natural Language Processing (NLP) methods, we can analyze this message traffic to gain insight into what people are communicating. But what can we glean from thousands of tweets regarding the coronavirus? How can we summarize the message traffic and pull out general themes from the information?

One approach is the use of a “Topic Model,” a probabilistic model that communicates information about topics in a body of text (or corpus). Using this method, we can extract general themes and gain insight into a large body of words and extract a probabilistic distribution of topics.

Here is an example of 42K messages, taken from the United States on the 15th of March that has been confirmed accurate based on their geospatial location.

While there are several different algorithms that perform topic modeling I’ll focus on the Latent Dirichlet Allocation (LDA) algorithm which is widely used for topic modeling and visualized using pyLDAvis. I’ll also explore the use of the Non-Negative Matrix Factorization (NMF) algorithm that provided a cleaner set of topics based on my observations for this data. Both algorithms are unsupervised learning methods to cluster documents for topic analysis; the NMF algorithm has the reputation for being better for learning compact topics, producing more succinct labels (my goal). In our models we are using n-grams (using adjacent words to provide context) since particular phrases such as “social distancing” and “toilet paper” are significant.

pyLDAvis is designed to help users interpret the topics in a topic model that have been fit to a corpus of text data.

The work uses a Python library for interactive topic model visualization.
pyLDAvis

On the top right is listed the overall term frequency of the “Top-30 Most Salient Terms.” Not surprising at the top of the list is “social distancing.” So, even at this earlier date, the previously unknown term, “social distancing” quickly became a central focus of slowing the spread; the generated word cloud pictured below (Topic 2) further illustrates the public awareness of this phrase.

On the top left is the “Intertopic Distance Map” which has taken the multidimensional data and simplified it into the observed 2-dimensions. I have generated 5 different topics and those can be seen on the graph. Given these are mathematical models, the topics are labeled as numbers 1 through 5 (logical topic names will be derived from the numerical topics and related words) and the placement is a representation of the distance between topics. The significance of a topic is represented by the area of the circle. As you can see, topics 3 and 5 have a large intersection (related to testing and the pending pandemic).

The next step is to review some of the discovered topics.

These visuals are a powerful tool for the LDA algorithm as you can easily see how the possible topics are grouped, how the related phrases are ranked, and what the related word frequencies are. For this topic, the NMF algorithm captures the following top 10 phrases: “social distancing, practice social distancing, practice social, slow spread, urge social, urge social distancing, stop spread, message urge, illness share message, illness share”.

New York now has 729 confirmed cases of #COVID19. Gov Cuomo says they’re trying to expand testing & need federal assistance to do that. One way to speed things up, go from manual lab testing to automated testing. Cuomo says they’d go from processing 60 samples a day to thousands.
— Kevin Rincon (@KevRincon) March 15, 2020

From here, we’ll drill into one section of the country that has become the epicenter of the virus: New York City.

So we are only considering the tweets that originated from the NYC area using geospatial functions to separate those out from the larger group. In mid-March the cases were just beginning to accelerate and people were petitioning the city government to close the schools to slow the spread. Under immense pressure Mayor de Blasio closed the nation’s largest public school system several days later. We can also see that Topic 1 is separated out from this other topics in the Intertopic Distance Map making it more unique.

Here Topic 1 “Public Schools” is the dominant topic as we see phrases such as “slow spread,” “sign petition” and “close public”, indicating rising public pressure to close the public school system to slow the spread of the virus. The NYC Public School System was shut-down later in the week. In this case, the NMF algorithm closely matched the LDA algorithm with the following top 10 phrases: “public school, close public, close public school, sign petition, slow spread, school slow, school slow spread, public school slow, spread sign petition, spread sign”.

Coronavirus word cloud — Word Cloud for the Public School

As a point of comparison, we next look at what people were tweeting in Los Angeles on the same date. Again, we are using geospatial functions to only consider tweets from the LA area. At this point in time, Los Angeles had under 100 cases. In Topic 1, “Case” was the dominant word, but it also included phrases such as “add case”, “case death” and “add case death”, indicating people were aware of the escalating cases and the death count from the virus both here and abroad. On this day Gov Newson announced restrictions enacted within the state as published in the Los Angeles Times.

Also, Topic 1 “Case” is more unique then the others viewing the Intertopic Distance Map.

Natural Language Processing Twitter for Los Angeles

For this topic the NMF algorithm captured the following top 10 phrases “case death, add case, add case death, trump add, trump add case, addition trump, addition trump add, play risk, death play risk, death play”. The context of the references to President Trump were that he had increased the number of cases by downplaying the risk of the virus.

I hope you’ve seen value in the NLP work demonstrated here as I’m looking for people to collaborate with me and perhaps sponsor this effort to capture all the data to-date on the virus and publish this out to a website for all to view and study. Until then, connect with me here.

While it is too early to say I do believe we can gain predictive insights into the spread of the virus and also how we will resume our normal lives once the lockdown is lifted. In my next article I’ll look at why New York City became the epicenter as compared to cities such as Los Angeles. While New York is much larger with higher population density, Los Angeles has had a very different trajectory of cases. Why?

6 Comments

Chris Carter says:
April 20, 2020 at 3:36 pm
I really wish people would stop comparing countries by total cases and total deaths in discussions about which countries are managing this better or worse.
In every other country to country measure, we use per capita numbers.
100 deaths in China is nothing.
100 deaths in Monaco is worrying.
Rob Smithson says:
April 20, 2020 at 3:58 pm
There is a very stark contrast in the math of the first sentence “Last week, the United States surpassed Italy as the country with the most deaths from the coronavirus, now making the US the epicenter of the virus.” vs the mathematical sophistication of the rest.
Looking at deaths per capita paints a very different picture (US at 8th w/ 112 per million, Belgium at 1st with 470, Italy 3rd with 376) than what the first sentence states. Looking a bit deeper, the NYC area dominates all US statistics, treating its 12M people as a country, it is nearly double Belgium’s numbers!
Deanna Ramirez says:
April 20, 2020 at 4:35 pm
Are there any free programs available to get this kind of data from other custom data sets?
Rich Larrabee says:
April 21, 2020 at 9:41 am
Thanks for the feedback Chris. My intention was to illustrate the spread of the virus and the implications for the United States. While I understand your point regarding the per capita cases and deaths, the United States currently has the highest death count (https://news.google.com/covid19/map?hl=en-US&gl=US&ceid=US:en) of any nation. By most any measure the impacts of the virus have been significant and will continue with millions of people out of work and an economic recession if not depression on the horizon. I don’t believe I made any reference to how good or bad the situation is being handled but only met to convey what impact the virus is having on the nation and it’s people.
Be Well.
Rich
Rich Larrabee says:
April 21, 2020 at 10:06 am
Thanks for your comments Rob. I completely agree regarding your comment about the NYC metro area. New York City and the surrounding metro areas (including parts of New Jersey) have the highest per capita cases (https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html) in the nation.
Be Well.
Rich
Rich Larrabee says:
April 21, 2020 at 10:39 am
Hi Deanna, Here are a few resources to gather tweeter data (using the Twitter APIs)
1. https://github.com/twintproject/twint
2. https://github.com/taspinar/twitterscraper
3. https://github.com/twitterdev/search-tweets-python
4. https://github.com/ptwobrussell/Recipes-for-Mining-Twitter
Here is a good book on the topic (https://www.amazon.com/Mining-Social-Web-Analyzing-Facebook/dp/1449388345/ref=sr_1_17?dchild=1&keywords=twitter+api&qid=1587483246&s=books&sr=1-17)
With a google search you can find other resources.
Be Well,
Rich

The Most Important Focus about the Future of Media

International Songwriting Camp with Karma Studios and CREAT-ED

Strategic Partnership with Creat-Ed Introduces Innovative Courses Blending Art, Storytelling, and Technology

MediaTech Ventures’ Work in Funded House to Match Startups with Investors at Balkan Tech Summit and IFA Berlin

Texas State to Launch First Future Maker Accelerator Through our Partnership with The Experience Firm

MDDAO: The First Medical Metaverse + Ted Cohen Joins Advisory Board

TALK2RAMI Founder’s Series with Paul O’Brien

Our Startup Program Makes its Way to Chicago, Supporting Women Led CannaTech Ventures

The Innovation Hub of Houston

Presenting Startups from our Fifth Collective Cohort

Understanding Covid-19 using Twitter NLP

Rich Larrabee

6 Comments

Leave a Reply Cancel reply

Susan Paley’s Making Beats for Sneakers

Strategic Partnership with Creat-Ed Introduces Innovative Courses Blending Art, Storytelling, and Technology

A Startup Pitch is NOT a Sales Pitch!

A Startup Pitch is NOT a Sales Pitch!

WTH Happened to the News?

The Books Founders Should Read

Microsoft, Google, Apple, and Amazon… or Facebook? – Innovation to Come

The Door is Open to Venture Capital, Why are We Told We Need an Intro?

The Startup Advice that Matters, which you Rarely Hear

Subscribe to our mailing list for news and updates!

The “Live Music Capital of the World” is silent

Investor Connect - Where You're Raising Capital Matters

Related Articles

6 Comments

Leave a Reply Cancel reply

Susan Paley’s Making Beats for Sneakers

Strategic Partnership with Creat-Ed Introduces Innovative Courses Blending Art, Storytelling, and Technology

A Startup Pitch is NOT a Sales Pitch!

A Startup Pitch is NOT a Sales Pitch!

WTH Happened to the News?

The Books Founders Should Read

Microsoft, Google, Apple, and Amazon… or Facebook? – Innovation to Come

The Door is Open to Venture Capital, Why are We Told We Need an Intro?

The Startup Advice that Matters, which you Rarely Hear