Skip to content
Paul Winkler edited this page Aug 6, 2012 · 3 revisions

Table of Contents

Feed Formats

The best and most usable data for OpenBlock is:

  • available in a standard, machine readable, open format
  • accessible publicly over the internet
  • updated frequently
  • explicitly tagged with geographic metadata
  • free of restrictive license constraints

Common Requirements

ScraperScripts can be written to use any number of source data formats. There are three key pieces of information you especially want for each feed entry:

  1. A date. Any RSS or Atom feed will already have this.

2. A geographic location (latitude & longitude).

3. A name for the location (eg. street address, intersection, block name, neighborhood name, etc.) This is used on various pages to tell the user where this news item is relevant, other than by showing it on a map.

Both 2 and 3 need to be in an easily parsed format such as GeoRSS.

What if I can't get both a location and a location name?

Either one alone can be sufficient, it just takes a little more work in the scraper script.

  • If you have a latitude & longitude, but no location name: your scraper can use OpenBlock's reverse geocoding code to find the nearest block for that point. For example, in our Boston demo, if we use latitude=42.357778 and longitude=-71.07, the reverse geocoder gives us the block "103-124 Mount Vernon St". As long as the database of blocks is fairly complete, this technique works well.
  • If you have a location name, but no latitude & longitude: your scraper can use OpenBlock's http://en.wikipedia.org/wiki/Geocoding|geocoding code to try to convert the place name into a point within your city. This technique is always problematic because a place name can take any number of formats, and there are many ways the geocoder can fail: either there's no known location by that name, or the format is different from anything the geocoder can handle, or it could have been mangled by human error, or the geocoder can find several conflicting results and not know which is best. But on many common cases, it can work well. For example, n the current Boston demo, the place name '100 Massachusetts Ave' geocodes to latitude 42.348126999999998, longitude -71.088190999999995.

What if all I have is a feed with no place name or latitude/longitude?

As a last resort, OpenBlock contains code that can attempt to extract addresses from any text. These can then be fed to the geocoder, and from the first one that works, we'll use the resulting location name and point.

The reason this is a last resort is that it introduces two possible kinds of failure: failure to find the location name in the raw text, and failure to geocode properly if the location name is found. There's a lot that can go wrong.

Data Formats

RSS

For RSS, the http://www.georss.org/|GeoRSS extension is ideal for providing both the location and the location name. The location name can go in 'georss:featurename' and the location can go in 'georss:point'. We might not need to use the 'title' or 'guid' fields; not sure yet about 'link'.

For example, here's a hypothetical police report feed entry:

<rss version="2.0" xmlns:georss="http://www.georss.org/georss">
...
    <item>
      <title>Accident at 321 Main St</title>
      <description>ACCIDENT</description>
      <georss:featurename>321 Main St</georss:featurename>
      <georss:point>38.971684 -92.275641</georss:point>
      <pubDate>Sun, 12 Sep 2010 12:26:59 -0500</pubdate>
      <link href="http://example-site.com/policereport/4567"/>
      <guid isPermaLink="true">http://example-site.com/policereport/4567</guid>
    </item>
...
</rss>

The 'geo' rdf namespace for WGS84 coordinates is also fine for coordinates, using the latitude & longitude or lat_long elements, although it doesn't provide a way to specify the location name. For example, here's two equivalent item excerpts:

<rss version="2.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos">
...
    <item>
      <geo:latitude>38.971684</geo:latitude>
      <geo:longitude>-92.275641</geo:longitude>
      ...
    </item>
    <item>
      <geo:lat_long>38.971684,-92.275641</geo:lat_long>
      ...
    </item>
...
</rss>

Atom

Atom feeds work well, and you can use GeoRSS or geo elements just like with RSS (see above). Here's our police report feed entry in Atom format with GeoRSS:

<feed xmlns="http://www.w3.org/2005/Atom"
  xmlns:georss="http://www.georss.org/georss">
...
    <entry>
      <title>Accident at 321 Main St</title>
      <summary>ACCIDENT</summary>
      <georss:featurename>321 Main St</georss:featurename>
      <georss:point>38.971684 -92.275641</georss:point>
      <updated>2010-09-12T12:26:59-0500</updated>
      <link href="http://example-site.com/policereport/4567"/>
      <id>http://example-site.com/policereport/4567</id>
     </entry>
 ...
</feed>

Or using geo instead of georss (coordinates only, no location name):

<feed xmlns="http://www.w3.org/2005/Atom"
  xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos">
...
    <entry>
      <geo:latitude>38.971684</geo:latitude>
      <geo:longitude>-92.275641</geo:longitude>
     ...
    </entry>
 ...
</feed>

HTML

It's possible to scrape data from any HTML page, but this is a last resort - because the data is structured for visual presentation, rather than for machine-readability, your scraper script will break anytime somebody redesigns the website. But sometimes HTML is all you can get.

KML

http://code.google.com/apis/kml/documentation/|KML is not so great for our purposes, because all the interesting info other than the latitude/longitude is usually just unstructured HTML in a CDATA block. So it has the same problem we have with scraping any HTML page – the scraper can break anytime they change the way things look.

GeoJSON

It may not be widespread yet, but http://geojson.org|GeoJSON would work fine. Our police report example again:

{ "type": "FeatureCollection",
  "features": [
     { "type": "Feature",
       "geometry": {"type": "Point", "coordinates": [-92.275641, 38.971684]},
       "properties": {"type": "accident",
                      "updated": "2010-09-12T12:26:59-0500",
                      "location": "321 MAIN ST",
       "apartment/lot": ""}
     },
     ...
  ]
}

PDF

It can be done, but has the same problems as HTML.

CSV or Excel

This is actually better than HTML for our purposes. You'd just need to write a scraper script that knows what to do with each column.