A more organic approach using geotagged social media data.
What makes a community?
As a group of ethnographers and data scientists, this is a philosophical question that we struggle with every day at Spatial. If you are unsure what ethnography is, webster defines it as “the study and systematic recording of human cultures. With 7.5 billion people on this planet, it is a daunting task to do this accurately and consistently for the entire human race.
Let’s consider just one question then: “What makes a community?”. Well we can go with a dozen different definitions, but let me propose a unique one:
A community is a group of individuals who make a cohesive social structure within a geographic region.
Now, you might wonder, “At what scale are we talking here? We can satisfy this definition with two individuals, a neighborhood, or even the entire globe.” All of these are valid and focusing on any one of them would yield interesting results. In this case, I wanted to take a look at neighborhood level communities.
There already exist datasets that we can reference. For example, Zillow provides a data set for neighborhood boundaries in many of the cities that they support. Below are the boundaries for Detroit as provided in the Zillow neighborhood data set.
Where did these boundaries come from? More than likely, these particular boundaries are derived from the work that Arthur Mullen did for Cityscape Detroit in 2003 to identify major neighborhoods in Detroit. This work has since been spread all around the web and has become a sort of final authority on Detroit neighborhood boundaries. It is important to point out that while this is very impressive work, the boundaries are almost fifteen years old and miss many areas of the city. One can’t help but wonder if there are communities that these boundaries don’t currently account for.
It is not easy to tell right away whether these are accurate depictions of the real communities. At the very least, these boundaries respect large natural barriers and human made barriers, like roads. Now, let’s take a look at this in the context of another data set — geotagged social media, coined Geosocial data. If you aren’t sure what geotagged social media is, imagine thousands of pins on a map and each one represents where a person was when he or she posted to social media. This kind of visualization can show us where people gather and spend their time.
Further reading: The Essential Guide to Geosocial Data
The blue dots on the map above represent a random sample of geotagged social media from the city of Detroit — From here on I will refer to this as the media data set. This sample has roughly 5000 data points, or 5000 pins on a map, and has gone through a series of filtering routines in order to remove things like spam. As you can see, these points can give us an idea of areas of congregation and high activity. Even though the Zillow neighborhood boundaries do a good job encapsulating some of this activity, it misses many areas and even slices through some probable communities.
These boundaries are a good start, but they are not effective at finding cohesive social structures within geographic regions. So, I looked for methods of creating boundaries based on social activity. One organization, called Livehoods, uses foursquare checkins to divide regions of a city by the the restaurants and shops that have the same visitors. Unfortunately, I couldn’t find their data on more than a few cities. So instead, I wrote a program that would use our scrubbed media data set for Detroit to generate boundaries that reflect the proximity of the different points in the set. Below are the results for Detroit.
This method has some weaknesses. For example, there is a lack of natural barriers. On the other hand, we are starting to get somewhere when it comes to defining regions by areas of congregation and social activity. This could be a great starting point for us if we want to understand the different communities in Detroit. For instance, a clear next step is to find the topics being discussed in each region and how they change over time. Imagine that for the month of October we generate a word cloud for some community and compare it to a word cloud generated for September. How did the conversations change?
Another interesting aspect of this approach is that the algorithm we chose actually generates a nested topology. What I mean by that, is we can find sub-communities within larger communities, from the city as a whole down to a areas encompassing a few blocks. That question we had earlier — neighborhood scale or global scale? — with this approach we can generate boundaries at any level on that spectrum.
Although, I question the usefulness of this approach for anything larger than a city and the accuracy of anything smaller than a few city blocks. You can see the process of dividing larger communities into smaller communities in the gif below.
Last, but not least, this approach is easily scalable to any city in the United States. So wherever traditional neighborhood boundaries are used, this community discovery tool can augment your products in the same way, with the added benefit of being closer to real time. For example, if you are using traditional neighborhood boundaries to help predict property values, consider using our boundaries as well; you may be surprised at what a real-time dataset can do.