Warehousing & Analytics:
NYT Covid Data
My most robust and detailed project. A visual tool to compare Covid case rates by county across the USA, utilizing 3 years of Covid data from the New York Times. Program ingests data from the NYT Github, cleans the data, and uses functions to compare cases over time and region.
Click here to view dashboard in Tableau.
Skills/important points:
Star schema
Large dataset—3 million+ data points
Data cleaning in Pandas
Tableau dashboard building
Overview:
I had always been frustrated by the limits of covid19 dashboards during the pandemic. I always wanted to see it interpreted in different ways. During 2022, I started really digging in and learning python, SQL, and shell scripting. At the end of my first course, I realized that I could probably interpret the data myself, using the tools I had acquired.
Here were the main goals from the start:
to compare case trends between regions
to compare case trends from different years on the same axis
to calculate an informal ‘infection level’ for each region using its population
to compare regional outcomes for case numbers against their populations
Gathering the data
Throughout the pandemic, I had used the NYT covid dashboard most, since the New York Times is a fairly well trusted source. The times seemed to collect the most diverse data, both by using the johns hopkins data and reaching out to local heath authorities (as mentioned in their methodology). They also include probable cases, which likely accounts for some of the irregularities in county level data.
The New York Times has stopped their data collection, and thus the data has gone stale. Still, as historical data, it is useful as a comprehensive dataset to portray case movement over the three years.
the first task was to download the necessary csv files, from the NYT GitHub site. The US total sheet, US states sheet and county level data were separated for simplicity. Also, the data grew unwieldy at the county level, so the NYT data professionals separated county data by year. The US total and State totals were easy enough, but the counties needed to be concatenated to show trends over the three years.
———
US.info()
States.info
Counties.info()
At first I used the states and US datasets to extract state data, just for the sake of simplicity. Eventually, I wanted to be able to aggregate the county data and use that as the primary datasource.
the columns are very simple, having identifiers (date, state, county, fips code where applicable) and cumulative cases and cumulative deaths.
on import, make sure the date column is the date time type,
the next dataset to bring in was the US census data, which was much more complex.
census_data.info
Originally, I had planned to create a function to take into account the population estimates over the years, but for the sake of brevity I will only use the 2020 census base population.
It’s here that my very informal, unexperienced exploratory analysis began.
The EDA:
I started by cherry picking locations I was interested in; My hometown (Kanawha County, WV), where I live now (Cook County, IL), where my family members live, and so on. And with this I learned how to get the visualizations I wanted, using pyplot from matplotlib. And my code usually worked, but as I panned out further, I realized many exceptions in the data set. Not every state has counties: some have parishes, some have boroughs. Getting a naming convention across the board to work was a challenge. This would be solved later by creating a star schema and assigning unique ids to each state, county, and region.
At first glance trends looked incredibly similar, especially when set against a region’s population. This would need to be tested later against regions I had not selected for personal reasons.
Missing and unused data:
In the NYT dataset, there are case counts that are not tied to a region. In their methodology, the NYT data professionals include cases that are reported by county that are classified “unknown.” These are reported by independent county health departments and are stating that the patient residence is unknown. The methodology also says that location mostly pertains to where the case is treated, rather than where the patient is from.
I downloaded a table of county and state level fips codes from the census bureau to match to the dataset to patch the holes.
Also, the original census data does not contain data on island territory populations, while the NYT data does include those regions. After debating whether to include the island territories or not, I decided I wanted a more comprehensive data set. Finding Island population data was difficult, but the census bureau released their 2020 base populations by region/municipio/county.
So now I have 6 data sources: my original NYT data, and 5 csv files for population. As the files were growing, I realized it would probably be best to create a database to organize the information.
Creating the Schema
The schema was necessary to connect the population data to the the case data. So to build the Schema I organized the data into four tables: dimState and dimCounty for population information, dimDate for date details, and factCases for the case data.
The first challenge was getting the list of regions from the NYT data to match the regions from the Census data. The Census data did not include any US island protectorates, so their population data was sourced from other documents from the Census Bureau. It should be noted that Puerto Rico’s 2020 population figure is an estimate, as Puerto Rico has its own census methods.
For the state data fixing that was easy enough, Just add tuples to the state tables with the population data. Counties presented a larger challenge. Since counties often share names with other counties, making sure the rows were unique was a challenge. The FIPS data for counties was actually pretty accurate, The issues were the FIPS numbers for American Samoa and Guam, and the fips codes for cities; like New York City, Kansas city, and Joplin, MI. I tried to deaggregate the data for the cities by county, but it just never came out right. So I found there federal city codes and put those in as their FIPS, along with their state code.
While I had originally thought to just use FIPS codes to index everything, I realized that I would be storing many large integers if I were to go that route. So I opted to make an incremental primary key. After much debate, I decided to leave the FIPS data in the tables, since it can be used to further identify each region.
So, that group of NaNs is gone, now deaths. For me the solution here was simple; just zero them out. The only region that did not report deaths at the county level was Puerto Rico. If I do end up using the death data from Puerto Rico, maybe I can come up with some way to impute the county level data. But for now, fillna(0) will have to do.
There was a whole category of cases with known states and “Unknown" counties. This was very difficult to figure out how to treat. Originally, I thought I could just drop them. But that would eliminate about 1 percent of the data. As of now, I have added each state and unknown values as its own row in the county table. It adds 50 more rows and some redundant data, but right now I just need a solution that gives me access to the data quickly.
For generating the id columns, themselves, I was originally just going to make a dictionary with the unique values as the keys and a range of numbers as the values, then use apply() to match them to their values in the table. This worked for the most part, but when I created the county_id, there were doubles from where there a multiple counties with the same name. One way I could have dealt with this is to have a multi Index table. The other option was simply to generate a unique index number for the combination of state and county. I took that option, since at this moment that *seems* like it would use less memory long term. All, instead using an app function, I opted to use the pd.merge() function, mainly because it is vectorized and can supply a validation.
Loading the Data to Tableau
Since this dataset is not a small dataset; around 3.5 million rows, I was bit concerned about performance in tableau public. With Tableau public, your options are much slimmer; no apache files such as feather or parquet, no outside database connections. Just good old CSV files. I was able to keep my four files under a gigabyte, which was promising.
Organizing the data into a star schema ended being perfect for Tableau. While I found out later how many awesome data blending features Tableau offers, the star schema architecture allowed everything to communicate smoothly and clearly. It also allowed certain calculations to be run on ID number instead of their categories, which saved a bit of compute time in a few places.
After lots of trial and error, I finally got everything cleaned up and ready to go in tableau.
Creating the Visualizations
I should begin this section with a disclaimer; I am not an epidemiologist. I’m admittedly a novice in statistics, but I would like to think I can ask good questions. Questions like “how has the ratio of deaths to cases changed over the 3 years?” or “If you lay all four years on top of each other, do you see yearly trends?”
The first chart I worked on was a map that could produce the population data for each state and county. It was trickier than expected, and is still a work in progress. The borders of states are visible but the select feature is only at the county level. I am learning how to drill down from state to county in an efficient way.
The raw data provides the cumulative count for each county by the day. From that I created a calculated field to get the new cases by day, and their average over 7 days. I mostly did the 7 say average to mimic the dashboards I saw from the New York Times during the pandemic.
From new cases, I calculated something similar to an epidemiology curve, with an assumed contagious period of 2 weeks. I realize that 2 weeks is a large assumption, but I chose it because it is an easy period of time to work with and it is close to the midpoint between the CDC’s recommended quarantine periods being 10 days to 20 days. The purpose of using a windowed sum instead of an average was so that the figure might reflect the number of potentially contagious people in a region. I realize this makes a number of assumptions; not every case is contagious as long as another, some people get tested at the end of their contagious period, and so on. With the data we have, there is no way to get those case level details; all we have is a climbing ticker number.
And finally from the 2 week windowed sum, I divided it by the population of the selected region(s) to get a figure similar to the incidence rate.
Now I had all of the measures I was interested in; Now the trick was plotting them together, which is fairly difficult in tableau. In python, its very easy to plot more than one line on an axis.