2020 has been rough, there’s no denying that. With COVID and the election, there has never been a better time to really dig into data and see what trends may exist.
Earlier this year, I got quite dismayed with data reporting by many news sources. One particular article on CNN noted that (and I paraphrase) “five states now make up a third of all Covid-19 cases.”
The problem with this statement is that it was blatant click bait, as the five states listed made up a third of the US population.
That brings me to one of the biggest issues I have seen with data-based reporting in 2020: context.
Before we get started, I’ll point out that there are plenty of pre-existing sources available to see data on Covid. After the fact, for example, I stumbled upon a few (now defunct) great dashboards on Covid stats in Texas.
Additionally, for the purpose of this document, I’m not linking to my own Tableau Public account. My laptop at the time of writing is too old to run Tableau Desktop..
A Look at Texas and COVID-19
Let’s take a look at Texas, a state that ranks in the top 3 for total population within the United States. For example, the Dallas-Fort Worth-Arlington area alone has approximately 7 million people, which is roughly the same as the entire population my previous home state of Washington.
To begin, I decided to use Tableau, a data analytics tool and dashboard that’s being used in 40 or more states to report on Covid. I have spent the past several months learning the program and have already used it in several presentations at work to generate and analyze data. But until now, I haven’t used its mapping capabilities.
First, here are the data sets I pulled:
Existing dashboards, most relevant being that of the State of Texas, show totals by county. Total cases by county. Total fatalities by county. So on.
But while those numbers are representative of overall and accumulative state-wide infection and mortality counts and totals by county, I wanted to dig a bit deeper and get a better idea of estimated infection and mortality rates.
Why does this matter?
Let’s take a look at two Texas counties. As of December 16, 2020, Dallas County has reported 147,591 confirmed cases of Covid. Childress County, for comparison, has reported just 1,087 cases.
No comparison, right?
But, if we look at it from a comparative perspective, in other words infection rate, it allows us to compare the two counties.
Dallas has 147,591 confirmed cases in a population of 2.647 million, which accounts for 5.57% of the population. Childress has 1,087 cases in a population of 7,052, which accounts for a whopping 15.41% of the county’s population.*
*County population uses estimated population as of January 1, 2020 as reported on the state demographics website.
Population Density of Texas Counties
Tableau allows users to plot data on maps when certain geographic-based fields exist. It does this using Mapbox. The beauty behind this is that it allows us to visually see trends geographically.
So, the first step was to take a look at the different counties in Texas and visually see population counts by county. I called this Population Density and color coded it on a green to blue gradient with darker green being lesser-dense rural counties and darker blue being higher populated, more urban counties.
I set the darkest threshold at 150,000 and the midpoint at 15,000. There are obviously more counties with populations greater than 150,000; the aforementioned Dallas being a prime example. The midpoint was set as a rough visual with approximately half of counties showing blue (populations greater than the midpoint) and roughly half showing green (below the midpoint).
Let’s look at the chart.
Based on the visual above, we can see that the more urban counties are in the eastern side of the state with the darkest trending along the border or coast and more central. West Texas is quite rural.
Great! Now we have an idea of where the Texas population resides.
Now let’s look at confirmed cases.
Cases & Fatalities by Texas County
The two charts within this section will demonstrate the folly behind much of the Covid reporting. We will look at a map of total cases and total fatalities by Texas County.
Here are the confirmed cases plotted by county:
Now here are the fatalities plotted by county:
Remember the population density chart? Look familiar? Obviously, higher populated counties will report more cases simply because there are more people. Likewise, disproportionately high case counts will lead to much higher fatality counts.
So, how do we even the playing field a little, and compare densely populated counties with sparsely populated ones?
I generated the Infection Rate as a calculated field, simply taking the last reported Confirmed Cases number per county from the Texas Covid website data export and compared it to the estimated county population as of January 1, 2020 per the Texas Demographics website.
Johns-Hopkins calls this the “Observed Case-Fatality Ratio” and reports it at 1.8% for the United States as of December 18, 2020. For this chart, I set the max threshold to be reached at 5%. Among the highest is Kenedy County in Southeast Texas at 15.38% as of December 16, 2020.
We can see that the highest rates are reported in north-central Texas and along the Louisiana border. Does this mean that individuals in these counties are more likely to die of Covid if they get it? Not necessarily, and here’s why:
If we jump back to the infection rate chart, there have been fewer reported cases in many of the counties along the LA border. This could point to unreported spread with actual case count being a lot higher than reported in these counties.
Other factors that may come into play: testing capabilities and availability within these counties, perception of Covid as a threat within these counties, behavioral components like mask-wearing, etc.
Deviations & Outliers
When looking at any data, you must poke holes in it. Only by questioning the data do you glean better insight into why outliers and deviations exist. I’m going to highlight a few questions, and provide hypotheticals as to why.
Why do some counties with extremely high infection rates have extremely low mortality rates? And vice versa: why do some states with extremely high mortality rates have Extremely low infection rates?
Data is only as good as the data we’re given.
Let’s look at Childress, which has the highest infection rate in Texas at 15.41% but an extremely low mortality rate of 0.28%. If the infections are recent, it could be that data does not yet exist on the mortality within the county yet due to many of the cases being active, or having an overwhelmed mortician who has not yet had a chance to pass along findings.
Now take a look at Kenedy in SE Texas near the Mexico border. The county has a population of just 378. Thirteen have been confirmed as having contracted Covid (3.44% of the county population) and two have died for a mortality rate of 15.38%. The high mortality rate could simply be due to the limited data set, or it could point to higher unconfirmed / unreported cases.
What else do we have to take into account?
There are a multitude of other factors we have to take into account here are a few:
- What are the perceptions of Covid as a threat between urban and rural locations given the politicized nature of the pandemic?
- How available is testing across all counties, and how willing is the population to get tested?
- What are differences in reporting between counties for confirmed cases and fatalities?
- How do the trends look over time, given we are merely accounting for accumulative county data at this point in the pandemic? Could current spikes in cases be skewing mortality rates down given an overwhelmed system?
What are your thoughts? How would you look at the data differently?