When collisions happen

I downloaded the NYPD Collisions data set yesterday to play with this morning, and what a pleasant surprise. The portion I downloaded covers collisions from July 7, 2012 through October 30, 2015, everything’s meaning was obvious, and there didn’t seem to be too many missing fields. Reports are added daily, so going forward this seems like a good stream to play around with.

I don’t know anything about traffic and collision patterns, so I thought I’d start with getting to know when there’s a lot of collisions. First, I wanted to know, what days had the most accidents each year (remembering that 2012 and 2015 are incomplete)? This was easy (and fast) to determine with R’s data.table.

dt = fread(“nypd_collisions.csv”)
dt[ , DATE:= as.Date(DATE, format = “%m/%d/%Y”)]
max = dt[, .N, by=DATE][, .SD[which.max(N)], by=year(DATE), .SDcols=1:2]

Incidentally I didn’t come up with that expression above on my own but got some help from one of data.table’s main contributors, Arun Srinivasan. I hadn’t known it was possible to subset .SD in the j column to select rows, but now I do. That handy line of code quickly yielded the following information:

                                      Date                   Reported Collisions
                                      2015-01-18                         958
                                      2014-01-21                         1161
                                      2013-11-26                          867
                                      2012-12-21                          751

January 21, 2014 was quite the outlier, and it seems like it was quite a snowy day. This was the cover of the The New York Post that day, so clearly there was something major on (because it’s not like the Post ever exaggerates).


It was also the day after MLK weekend, so I’d think traffic could have been heavier than otherwise with people driving home the morning after a long weekend. But the principal component in the sheer size of the collision number that day would have to be the snow storm, because the number of collisions is just so enormous.

Just how exceptional this day is can best be seen in the plot below, where the top 10 days in number of reported collisions are plotted for each year. Keep in mind that 2012 and 2015 are incomplete years, meaning the highest collision day we have for these years isn’t entirely ‘fair’ to compare with the full years of 2013 and 2014, but it still gives us some idea of how special January 21, 2014 was.

top_10 = dt[, .N, by=DATE][, .(sort(N, decreasing=TRUE)[1:10]), by=year(DATE)]
p = ggplot(data=top_10, aes(x=index, y = V1, group = as.factor(year), color = as.factor(year))) + geom_line(size = 2)
p = p + scale_x_continuous(breaks = 1:10)
p = p + labs(title = “Top 10 days per year in # of reported collisions”, x = “Within-year rank”, y = “Number of reported accidents “)
p = p + theme(legend.title = element_blank(), axis.text =element_text(size=18), axis.title=element_text(size=20), title=element_text(size=17), legend.text = element_text(size = 15))


Wow. If not for that day, 2014 would be almost identical to 2013, and if you plot out to more days (20 or 50 days) that trend continues, with 2013 and 2014 virtually identical while 2012 lags and 2015 surpasses these years in number of accidents on a given ranked day.

The trend of increasing reported collisions, if it really is a trend, is likely more related to the ‘reported’ part of that than to the number of collisions. Such a clear increase from year to year, just when NYC has been making open data a priority, makes me think this the more likely explanation for this year-on-year increase, especially given that the past few years have been record-breaking years for mass transit use. I don’t know if NYC is following the nationwide trend of less driving, but if it is, that’s another reason to think the year-on-year rise seen above is a reporting artifact and not a real trend.

Going with my assumption that we’re in a relatively static period of years when driving and collisions are about the same from year to year, I’m most interested in figuring out when collisions are happening. I note that the ‘mega days’ – those with the highest number of collisions per year – don’t follow the received wisdom that the holidays are the most dangerous time for drivers (whatever that means without more qualification). Notably, these mega-collision days mostly don’t fall inside the holiday season. For example, the second and third most collisions per day per year are listed below, and none of them occur between Thanksgiving and New Year’s. Winterish months  dominate this list, but not holiday-specific times.

                                      Date                   Reported Collisions
2015-03-06                        935
                                      2015-03-05                        829
                                      2014-02-03                        960
                                      2014-02-14                         791
                                      2013-03-08                        851
                                      2013-06-07                        791
                                      2012-10-19                         738
                                      2012-11-07                          718

Still, it’s interesting that the mega days are mostly winter days, because traffic experts will tell you that summer is actually the most dangerous time on the road. If you plot collisions by season, you can see that the summer months dominate the yearly count, with 20% more reported collisions than the 4th quarter/holiday season.

dt[, QUARTER:=get.quarter(DATE)]
by_quarter = dt[, .N, by = QUARTER][order(QUARTER)]
p = ggplot(data = by_quarter, aes(x = QUARTER, y = N, group = as.factor(QUARTER), fill = as.factor(QUARTER))) + geom_bar(stat=’identity’)
p = p + labs(title = “Collision count by quarter”, x = “Quarter of the year”, y = “Number of reported accidents “)
p = p + theme(legend.title = element_blank(), axis.text =element_text(size=18), axis.title=element_text(size=20), title=element_text(size=17), legend.text = element_text(size = 15))
p = p + scale_x_discrete(breaks = 1:4, labels = c(“1(Jan-Mar)”, “2(Apr-Jun)”, “3(Jul-Sep)”, “4(Oct-Dec)”)) + xlim(.5, 4.5)


So now that we know what seasons generate the most collisions, we know which months to skip driving. We can also do more fine-grained thinking about timing – what hours of what days are the worst as far as colllisions go? Let’s do a collision count for each day of the week for each hour of each day. Whew, say that five times fast. This is a bit wordier, even with lovely data.table, but it’s ok.

worst_time = dt[ , .N, by = .(DAY_OF_WEEK, HOUR_OF_DAY)][, DAY_ORDER:=sapply(DAY_OF_WEEK, get_day_num)]
worst_time = worst_time[order(DAY_ORDER, HOUR_OF_DAY)][ , DAY_OF_WEEK:=as.factor(DAY_OF_WEEK)]
worst_time$DAY_OF_WEEK = factor(worst_time$DAY_OF_WEEK, weekday_order)
p = ggplot(data = worst_time, aes(x=HOUR_OF_DAY, y = N, group = DAY_OF_WEEK, color = DAY_OF_WEEK))
p = p + geom_line(size=1.5) + ylim(0, 9000) + labs(title = “Accidents per time of day per day (Jun-Aug)”, x = “Time of day”, y = “Number reported accidents”)
p = p + scale_x_continuous(breaks = 0:23) + theme(legend.title = element_blank(), axis.text =element_text(size=18), axis.title=element_text(size=20), title=element_text(size=17), legend.text = element_text(size = 15))


This is the amalgamated number of accidents at a given time of day on a given day of the week. So all accidents since July 2012 recorded here on a Monday at 9 am are counted in the red Monday line at the 9 am slot, for example. When we sum the accidents this way, we can think of it as giving us a picture of what a typical day looks like (I haven’t averaged, but that would just be a denominator changing the magnitude but not the shape of the curves above).

We see what we’d expect. Weekday mornings have a morning rush hour peak for number of collisions and also an evening peak. Somewhat surprising to me is that there is also a consistent lunchtime peak and that Saturday and Sunday have the same lunch-time and evening peak as the weekdays. Some might argue that the lunchtime peak is really just a continuing trajectory interrupted by a 2 pm siesta dip, but I’m less convinced of this, not knowing anyone in NYC over the age of give who has the luxury of taking siestas.

Average behaviors of this sort can always be more than a bit misleading. We are averaging over all neighborhoods of NYC and also averaging over all times of year. When you think about what a large portion of the work force changes their schedule in the summer (there are more than 70,000 public school teachers in NYC and the possible prevalence of folks in NYC who own second homes, it makes sense to think trends could be different in the warmer months. And so they are, if we plot the day-by-day hour-by-hour trends, but now only for the summer months.


We see the overall pattern is the same, with three peaks on weekdays and two on weekdays, both the relative magnitudes are different. Now there’s two kinds of weekday mornings: Monday/Tuesday (more collisions, perhaps more lucky folks driving back to the city from their lovely second homes) vs later in the week. Also Wednesday evening rush hour has almost as many collisions as Friday – perhaps more lucky folks who can take short weeks to ‘work from home’ from their houses in the country!

It’s worth remembering that the data above is still averaged out by region even if we’ve narrowed in our focus on one kind of time period (summer). Below, we’ll look at year-long trends but now divvy the data up by geography, plotting it according to Borough.  If we break down the number of collisions per day of the week per hour further into borough, more variation emerges in the average time series, as we can see below.

worst_time = dt[ nchar(BOROUGH)>1, .N, by = .(DAY_OF_WEEK, HOUR_OF_DAY, BOROUGH)][, # … same as before…
p + facet_grid(BOROUGH ~ .) + ylim(0, 1900)


Queens, Brooklyn, and the Bronx have almost as many morning collisions as evening collisions, unlike Manhattan which lacks a distinct morning peak in collisions but shows something like a strong and steady increase to a daily plateau of accidents that doesn’t have such a strong peak at any point during day-time hours. Staten Island, ever its own beast, barely registers the rush hours as far as its collisions go, which surprises me because I remember many hours spent in traffic on the Staten Island expressway during rush hour. Maybe all that traffic prevents collisions?

We could take this ‘turtles all the way down‘ and divvy up by time of year and by borough. I’ll leave that as an exercise to the reader.

I’m planning to match census tract data to collision locations, but so far the only free API I’ve found to map latitude, longitude pairs to census tract is the FCC Census Block Conversion. I’m doing roughly 1 API call per second on my tiny, free EC2, so the data (~700,000 rows) should be ready to go in about a week.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s