Last week, the United States surpassed Italy as the country with the most deaths from the coronavirus, now making the US the epicenter of the virus. The country has been in a lockdown for several weeks, and while it appears the outbreak is beginning to reach its peak, many unknowns remain ahead of us. What are people thinking and doing in response to the current situation, and what is to come next?
In past crises, we’ve had no means of gauging the mood of the American people, but with today’s technology, we can gain some insight into what people are thinking and doing during this unprecedented time.
One new source of information available to us during the 2020 Coronavirus Pandemic is the microblogging messages from Twitter.
According to recent data, there are 30 million daily Twitter users in the United States. Additionally, many of these messages contain geospatial information so we can pinpoint the location of the sender at the time the message was sent. So, what information might these messages carry, and how can we gain insight as to how people are coping as we enter our second month of social distancing and the death count increases daily?
With modern Natural Language Processing (NLP) methods, we can analyze this message traffic to gain insight into what people are communicating. But what can we glean from thousands of tweets regarding the coronavirus? How can we summarize the message traffic and pull out general themes from the information?
One approach is the use of a “Topic Model,” a probabilistic model that communicates information about topics in a body of text (or corpus). Using this method, we can extract general themes and gain insight into a large body of words and extract a probabilistic distribution of topics.
Here is an example of 42K messages, taken from the United States on the 15th of March that has been confirmed accurate based on their geospatial location.
While there are several different algorithms that perform topic modeling I’ll focus on the Latent Dirichlet Allocation (LDA) algorithm which is widely used for topic modeling and visualized using pyLDAvis. I’ll also explore the use of the Non-Negative Matrix Factorization (NMF) algorithm that provided a cleaner set of topics based on my observations for this data. Both algorithms are unsupervised learning methods to cluster documents for topic analysis; the NMF algorithm has the reputation for being better for learning compact topics, producing more succinct labels (my goal). In our models we are using n-grams (using adjacent words to provide context) since particular phrases such as “social distancing” and “toilet paper” are significant.
pyLDAvis is designed to help users interpret the topics in a topic model that have been fit to a corpus of text data.pyLDAvis
The work uses a Python library for interactive topic model visualization.
On the top right is listed the overall term frequency of the “Top-30 Most Salient Terms.” Not surprising at the top of the list is “social distancing.” So, even at this earlier date, the previously unknown term, “social distancing” quickly became a central focus of slowing the spread; the generated word cloud pictured below (Topic 2) further illustrates the public awareness of this phrase.
On the top left is the “Intertopic Distance Map” which has taken the multidimensional data and simplified it into the observed 2-dimensions. I have generated 5 different topics and those can be seen on the graph. Given these are mathematical models, the topics are labeled as numbers 1 through 5 (logical topic names will be derived from the numerical topics and related words) and the placement is a representation of the distance between topics. The significance of a topic is represented by the area of the circle. As you can see, topics 3 and 5 have a large intersection (related to testing and the pending pandemic).
The next step is to review some of the discovered topics.
These visuals are a powerful tool for the LDA algorithm as you can easily see how the possible topics are grouped, how the related phrases are ranked, and what the related word frequencies are. For this topic, the NMF algorithm captures the following top 10 phrases: “social distancing, practice social distancing, practice social, slow spread, urge social, urge social distancing, stop spread, message urge, illness share message, illness share”.
From here, we’ll drill into one section of the country that has become the epicenter of the virus: New York City.
So we are only considering the tweets that originated from the NYC area using geospatial functions to separate those out from the larger group. In mid-March the cases were just beginning to accelerate and people were petitioning the city government to close the schools to slow the spread. Under immense pressure Mayor de Blasio closed the nation’s largest public school system several days later. We can also see that Topic 1 is separated out from this other topics in the Intertopic Distance Map making it more unique.
Here Topic 1 “Public Schools” is the dominant topic as we see phrases such as “slow spread,” “sign petition” and “close public”, indicating rising public pressure to close the public school system to slow the spread of the virus. The NYC Public School System was shut-down later in the week. In this case, the NMF algorithm closely matched the LDA algorithm with the following top 10 phrases: “public school, close public, close public school, sign petition, slow spread, school slow, school slow spread, public school slow, spread sign petition, spread sign”.
As a point of comparison, we next look at what people were tweeting in Los Angeles on the same date. Again, we are using geospatial functions to only consider tweets from the LA area. At this point in time, Los Angeles had under 100 cases. In Topic 1, “Case” was the dominant word, but it also included phrases such as “add case”, “case death” and “add case death”, indicating people were aware of the escalating cases and the death count from the virus both here and abroad. On this day Gov Newson announced restrictions enacted within the state as published in the Los Angeles Times.
Also, Topic 1 “Case” is more unique then the others viewing the Intertopic Distance Map.
For this topic the NMF algorithm captured the following top 10 phrases “case death, add case, add case death, trump add, trump add case, addition trump, addition trump add, play risk, death play risk, death play”. The context of the references to President Trump were that he had increased the number of cases by downplaying the risk of the virus.
I hope you’ve seen value in the NLP work demonstrated here as I’m looking for people to collaborate with me and perhaps sponsor this effort to capture all the data to-date on the virus and publish this out to a website for all to view and study. Until then, connect with me here.
While it is too early to say I do believe we can gain predictive insights into the spread of the virus and also how we will resume our normal lives once the lockdown is lifted. In my next article I’ll look at why New York City became the epicenter as compared to cities such as Los Angeles. While New York is much larger with higher population density, Los Angeles has had a very different trajectory of cases. Why?