What is the best time to fly if you want to be on time?
Local Airport Departure Delay Analysis with R
If you travel frequently by plane , or even if you travel only occasionally but would like to understand the common delay pattern as a function of the time of day, there's an excellent article by 'dan' in the r-bloggers.com blog.
This article inspired me to do a similar analysis for the two local airports I most commonly use: Sacramento International (SMF) and San Francisco International (SFO). Once you understand the patterns, you stand a better chance of booking a flight that will be on time and avoid the "problem spots".
Fortunately, the original author's R code is easily adapted by changing a few lines. Let's have a look.
Average Departure Delays
The 2013 data is available from the TSA as noted in the r-bloggers article. Fortunately, one can download cleaned-up data from that article through the link kindly provided by the original author:
With the data available, one simply has to do some additional processing to filter for SMF and SFO and plot the data for both. So here's how the average departure delays look like for each airport:
Figure 1 - Average Flight Delays per Hour of Day. SFO vs SMF
Figure 1 shows the resulting plot. Delay values (y axis) are in minutes. Interesting to see how the SFO airport departure delays seem, in average, significantly higher than in Sacramento. However, both airports reveal a pattern (also mentioned in the original article), where early morning flights are mostly on time whereas things tend to get bad around noon. You would have to wait until 10PM or so for things to improve again, though the SMF and the SFO patterns are somewhat different here. The plot in Figure 1 includes rough "error bars" that the original author intended as showing the mean +/- 1 standard error. If this were a normal distribution, one could expect about 68% of the flights to be within these error bars. However, this may be a bit misleading. The distribution here is skewed and doesn't look "normal" at all. Let's look for example at the distribution of delays for Sacramento departures scheduled for 12 Noon. Figure 2 shows that most delays are below 20 minutes though there is a "long tail" to the distribution (and I haven't even shown the most egregious delay cases in order to keep the scale reasonable).
Figure 2 - Distribution of Departure Delays - Filghts scheduled for Noon Departure
Figure 3 - Sacramento (SMF) Departure Delays
Figure 4 - San Francisco (SFO) Departure Delays
Another way to look at the data is with a "jitter" plot that shows us all the departure delays for a given hour of day in a single plot (rather than the average). This makes the spread of values quite apparent. See Figure s 3 and 4 above. The higher average dparture delays for SFO are obvious in these tow plots as well, but they also give us a sense of how much variance there is in the actual departure times!
For completeness, I'm also including below the R code I used in this analysis.
library(plyr) library(tidyr) library(lubridate) library(ggplot2) library(dplyr) # Change working directory to where the 2013flights.Rdata file is located: setwd("~/Data Science Specialization/Exploratory Data Analysis/AirportDelay Data") load(file="2013flights.Rdata") df = df %>% mutate( day_of_week=factor(day_of_week,levels=c(1:7,9), labels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Unknown" )), month=substr(date,6,7), datenum=substr(date,9,10), departure_hour = round(departure_hour/100,0), delay=ifelse(delay<0,0,delay), arr_delay=ifelse(arr_delay<0,0,arr_delay)) %>% filter(departure_hour > 5 & departure_hour< 24) ### PLOT RESULTS JUST FOR "SMF","SFO" plot_data = df %>% filter(airport %in% c("SMF","SFO")) %>% group_by(departure_hour,airport) %>% dplyr::summarise(mu=mean(delay,na.rm=TRUE), se=sqrt(var(delay,na.rm=TRUE)/length(na.omit(delay))), obs=length(na.omit(delay))) p=ggplot(plot_data,aes(x=departure_hour,y=mu,min=mu-se,max=mu+se,group=airport,color=airport)) + geom_line(lwd = 1.5) + geom_point() + geom_errorbar(width=.33) + scale_x_continuous(breaks=seq(5,23)) + labs(x="Hour of Day",y="Average Departure Delay",title="Flight Delays by Departure Time and Airport") + theme(legend.position="bottom") + scale_color_discrete(name="Delay Type") p
### LOOKING AT THE DISTRIBUTION. 12 Noon in SMF: plot_data = df %>% filter(airport == "SMF" & departure_hour == 12) plot_data = na.omit(plot_data) p1<-ggplot(plot_data, aes(x=delay)) + geom_histogram(aes(y = ..density..), color="black", fill="NA") + geom_density(color="blue", fill = "blue", alpha = 0.3)+ xlim(0, 60) + labs(x="Departure Delay",y="Density",title="8AM Departure Delay Distribution - SMF") p1
### ADDED JITTER PLOTS. FIRST SMF: plot_data = df %>% filter(airport == "SMF") %>% group_by(departure_hour,airport) plot_data = na.omit(plot_data) b1<-ggplot(plot_data, aes(departure_hour, delay)) + ylim(0, 600) + geom_jitter(alpha=I(1/4), col = "blue") + theme(legend.position = "none") + scale_x_continuous(breaks=seq(5,23)) + labs(x="Hour of Day",y="Departure Delay",title="Flight Delays by Departure Time - SMF") b1
### ADDED JITTER PLOTS. SFO: plot_data = df %>% filter(airport == "SFO") %>% group_by(departure_hour,airport) plot_data = na.omit(plot_data) b1<-ggplot(plot_data, aes(departure_hour, delay)) + ylim(0, 600) + geom_jitter(alpha=I(1/4), col = "red") + theme(legend.position = "none") + scale_x_continuous(breaks=seq(5,23)) + labs(x="Hour of Day",y="Departure Delay",title="Flight Delays by Departure Time - SFO") b1
Hopefully this study is of interest to anyone travelling a lot thorugh these two local airports.
Comments, questions, suggestions? You can reach me at: contact (at sign) paulorenato (dot) com