Mapping NYC’s air quality monitors

The Data and My Discontent

Trolling NYC Open Data for complementary geographic data for the collisions data set, I noticed a small 2009 collection of csv‘s with air quality measurements from about 60 points scattered through the city and called the Clean Air Survey. I thought it would be interesting to see whether any air quality measurements correlated with number of collisions, perhaps as proxy for overall traffic density/traffic flow.

Now this is a bad idea for a whole lot of reasons. Two are provided in this 2013 report on NYC Trends in Air Pollution. First, this data is old, all the way back to 2009, and the whole point of the report is that certain kinds of pollutants have dropped drastically, as illustrated in the figure below, taken from that report.

history_geography

The available data is for 2009, the plot on the left. However in 2013, the pattern of pollution had clearly changed – and for the better. This suggests that even if there had been a tenuous connection between air quality and collisions by way of traffic density or quality, it’s not going to be a strong connection and might even be washed out in the data. The principal reason these pollutants decreased as highlighted by the city’s report is the Clean Heat program, related to helping buildings reduce their emissions and improve their heating mechanisms. So this data may not have been relevant in the first place, and even if it were, the data we have vs the situation now are clearly different.

But we’ve downloaded the csv so we might as well take a look at what’s on offer while we file a FOIA request to get the rest of the goodies. Inspired by the NYC open data’s posting of taxi rides following someone’s FOIA request, I’m making one of my own using this handy template. I’ll be sure to update if I hear anything back.

All this being said, I realized that I’m much more comfortable making graphs with R than I am with Python, so I decided it was time to give MatPlotLib a whirl and see what it could do with geographic coordinates…after I got those coordinates.

The first order of business was mapping the neighborhood names given to indicate air quality sensors to geographic coordinates. Google’s API, though a blackbox and imperfect, is wonderfully forgiving about what kinds of data you send it to geocode, and so luckily requests like “Give me the coordinates for Chelsea, New York City” seemed to work out pretty well.

However, there were two tricky points to the Google geocoding calls, both of which surprised me given that I’d easily used it with JavaScript calls in the past without worrying about these things.

Google API (slight) tricks for the unwary

First, since there were spaces in the place names, it was necessary to call urllib.urlencode() on the place names before appending them to the API URL, something I hadn’t done in the past without a problem.

Second, and this one was a bit irritating, the API kept returning a Bad Request Error until I added a fake User-agent with the following modification to my <code>urllib2.build_opener()</code>

opener.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')]

I get that Google’s in charge, not me, but I don’t see the logic in forcing developers to put in a User-agent string not the least because these strings are so irritatingly long and also because I would have thought the API is built for web scrapers. How often does someone pull up the API in an actual browser window?

The script as it finally worked, is on Github. You’ll notice the note also comes with a warning not to use the code. That’s because the final trick for the unwary is there is already an easy-to-use Google Maps API wrapper, geopy, as well as a Python Google Maps module authored by Google. So use it. Avoid my mistakes!

Visualizing the sensors

Next up was getting an idea of where the sensors were and whether there seemed to be any rationale for chosen sensor locations. It was a breeze to calculate all possible two-way combinations of location coordinations with itertools.combinations
and then calculate the distance for each combination with geopy’s great_circle method, which even lets you easily access your chosen units for distance, like so:
combos = combinations(coords, 2)
distance_in_miles = []
for c in combos:
distance_in_miles.append(great_circle(c[0], c[1]).miles)
It doesn’t get much easier than that, does it?

From this we can generate a nice little histogram with 1 mile bins showing us where the points are relative to one another:


binwidth = 1
plt.hist(distance_in_miles, bins=np.arange(min(distance_in_miles), max(distance_in_miles) + binwidth, binwidth))
plt.show()

blog2

Hmmm, I can’t say I’m too happy about that. There’s a whole tale of nonsensical answers out above 50 miles. I know the distance from Perth Amboy, NJ to Yonkers, NY, essentially the longest north-south distance in NYC, is less than 46 miles driving, and presumably less as the crow flies, so there’s reason to think there’s a bunch of spurious points.

Luckily with easy joins and re-ordering of Pandas Data Frames, it’s not too difficult to get to the bottom of this. First, we make a data frame with pairs of coordinates and the distance between them

c = DataFrame({"c1":c1, "c2":c2, "distance":distance_in_miles})

Then we join this to the data with neighborhood names, first joining the neighborhood name to the coordinates in c1 and then joining the neighborhood name to the coordinates in c2:

result2 = merge(c, result.ix[:,['Name', 'Coords']], left_on = 'c1', right_on = 'Coords')
result3 = merge(result2, result.ix[:,['Name', 'Coords']], left_on = 'c2', right_on = 'Coords')
result4 = result3.sort(['distance'], ascending = [0])

I also saved the bad pairs in case we need this later to weed out data, and got a count of neighborhoods in the bad pairs.

bad_pairs = result4[result4['distance']>50][['distance', 'Name_x', 'Name_y']]
bad_pairs.to_csv("bad_pairs.csv")
bad_pairs.groupby('Name_y').count()

Not surprisingly, it turned out there was a single culprit,

distance Name_x
Name_y
Astoria, Long Island City 1 1
Bay Ridge, Dyker Heights 1
Bayside, Douglastown, Little neck 1
Bedford Park, Norwood, Fordham 1
Bensonhurst, Bath Beach 1
Borough Park, Ocean Parkway 1
Brownsville, Ocean Hill 1
Canarsie, Flatlands 1
Coney Island, Brighton Beach 1
Crown Heights South, Wingate 37 
East Flatbush, Rugby, Farragut 1
Elmhurst, South Corona 1
...etc...

When I plotted the coordinates I’d recorded for this troublesome Crown Heights South, Wingate neighborhood, I saw that Google had taken this as Wingate, NY, which is quite a ways north of New York City, hence gibing hte large distances.

bad_point

Now if I repeated the histogram suppressing these points, we see the reasonable values we expect, spread out in a seemingly Gaussian distribution that cleary shows the sensors aren’t laid out in anything like a grid (if they were we’d see concentrated modes at multiples of the grid’s spacing parameters).

blog3

That’s much better! We can move on now to see the geographic distribution of the points and confirm that they are not in a grid-like layout.

Mapping the sensors

In fact, thanks to some helpful instructions I found online (and shamelessly copied below) we can even make a responsive graph, which makes it easy to check the accuracy of geocoding. With the following you hover your mouse over a point, and it shows the neighborhood name. If it a neighborhood is far from where it should be, this will be much more obvious with a graph app like this rather than going over the coordinates in written form. You can see the code for this bit here. It’s a bit long to paste into this blog post, but you can see the lovely results here.

blog4

What you can’t see until you try this yourself is that when you mouse over each of the red circles, the neighborhood name pops up. What we can see here is that most of the neighborhoods are in the right place (check the code for comments on the names of the ones that do seem to be misplaced). We can also see, as we could tell from the histogram above, that the points aren’t on a grid. Finally we can see the sensors aren’t designed to be spread out evenly. Rather it seems like the density of sensors is somewhat similar to population density, with the bulk of sensors in Manhattan, western Brooklyn and western Queens, and through much of the Bronx.

So that’s a wrap unless that FOIA-ing comes through.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s