California air quality study with the R statistical computing package.


California air quality study with the R statistical computing package

It still amazes me the quantity of data freely available on the internet. Take air quality for example. The EPA has been collecting huge amounts of data for various air pollutants all over the United States. The data is freely available from this link. Often the challenge is taking that data and extracting the relevant information and statistics. As part of a data science specialization course I took, I learned to use the R statistical computing language to perform this type of tasks, so I thought of using it to analyze some of  the local (California) air quality metrics. The type of questions I set-out to answer were: how has pollution in my region been changing in the last couple of years? And how does it compare with other regions in the state?

In this investigation I focused on "PM2.5 Particulate Matter" as a measure of air pollution. The EPA website defines it best:

"What is PM2.5? Particulate matter, or PM, is the term for particles found in the air, including dust, dirt, soot, smoke, and liquid droplets."

"Where does PM2.5 come from? Sources of fine particles include all types of combustion activities (motor vehicles, power plants, wood burning, etc.) and certain industrial processes."

Let's find-out more about PM2.5 measurements in this region.

The code 

R is a very powerful statistical package and programming environment. It allows complex statistical analysis with just a few lines of code that can take hours to code in other programming languages. The code listing in Figure 1 is all it takes to process the data files from the website linked above. Here I'm focusing on the data contained in the "PM 2.5 - Local Conditions" row in that page. This data consists of fairly large datasets, one for each year. The code below assumes you saved these dataset files in the working directory. Each file contains data for the various states, counties and "Sites". The later are measurement stations dispersed around the country. In the example below, I focused on one location in the Sacramento region and plotted the levels for each year. Data available for this site is relatively recent since this policy of pollution measurement is relatively new. I used the 'ggplot' and 'dplyr' packages to help process these large datasets.



Figure 1 - R Code. Retreiving data for a particular county, state and location. Line Plot

Figures 2 though 5 show measurements for years 2007, 2010 and 2013 along with the calculated median for the year at this site.


 Figure 1 - 2007 Data. Median = 8



 Figure 2 - 2010 Data. Median = 6.15


  Figure 3 - 2013 Data. Median = 6.8

It's interesting to observe the seasonal behavior in these plots. In the Sacramento county, I've seen a tendency for particulate matter to increase in the early and later months of the year (Oct to Feb roughly) which I found slightly surprising. There's also a small bump around June-August time-frame (summer months). As for year-to-year changes, It's hard to draw a conclusion from this data. One can say the median improved from 2007 to 2013, but it's far from clear that this is an actual trend and not just a random variation. Although not shown, I also plotted intervening years and calculated mean pollution levels:

Year Median PM2.5
2007 8
2008  8.8
2009 7.1 
2010  6.15
2011  7.15
2013  6.8

Comparing County Data

So, if the year-to-year dta isn't conclusive, what can we say when we compare the Sacramento County to other counties in the region? As is evident in Figure 4, California is a large and geographically diverse state with mountainous regions, coastal regions and even deserts. For the purpose of this informal study, I focused on a couple of counties I'm familiar with: Sacramento, Placer, San Francisco, and Los Angeles.


 Figure 4 - California County Map

Figures 5 and 6 show code that I used to collect county data and produce composite plots. Unlike the previous example, I aggregated all the site monitors for each county and calculated the median for each day of the year (year being 2013 in this case; the latest year where data was available as of this writing). This gives us a more accurate sense of the pollution levels in each of these distinct regions.



Figure 5 - R Code. Retrieving data for a particular county and state



 Figure 6 - Combining Plots in R

Figures 7 and 8 contain the same data but with a different vertical scale. Figure 7 shows us that there are some "outlier" days, even for counties with otherwise low pollution levels such as Placer County. This could be due to rare events such as forest fires or unusual atmospheric conditions. Figure 8 is in a 'tighter' vertical scale thus allowing us to better resolve the differences between the counties.


 Figure 7 - Comparing Several Counties



Figure 8 - Comparing Several Counties (tighter scale)

There are not too many surprises here. As one would expect, LA pollution levels are consistently higher than in the other three counties. On the other end of the scale, Placer county, being mostly a rural and mountainous county, has the lowest pollution levels. Sacramento and San Francisco are harder to compare and it surprised me that actually Sacramento had lower pollution levels for part of the year. Being that San Francisco is near the coast, I was expecting a better showing...

Hopefully this study is of interest to anyone concerned with pollution levels. The scripts provided can be easily adapted to any other county in the United States by simply changing the county and state codes.


Comments, questions, suggestions? You can reach me at: contact (at sign) paulorenato (dot) com