Skip to content

Latest commit

 

History

History
59 lines (42 loc) · 1.88 KB

README.md

File metadata and controls

59 lines (42 loc) · 1.88 KB

Crime Scraper

This is a very simple, single use-case scraper for DC crime data. Still, a few lessons can be learned from its implementation.

##Scraping

Because the source data required form navigation (dropdowns, checkboxes, etc.) to download a csv, I used the selenium package which automates browswer functionality (in this case, Firefox). Running this script locally opens Firefox, navigates the form, and downloads the file (ignoring the "save" dialog box).

##Running remotely

Because this script runs on a remote Ubuntu server, I had to run Firefox with a headless driver (i.e. a browswer driver with no Graphical User Interface), which I did using the Xvfb display server. From the server I ran:

####Installation:

$ apt-get install xvfb

####Set Xvfb's display number to one unlikey to be used (:98 in this case):

$ Xvfb :98 -ac

####Set environment display number to 98:

$ export DISPLAY=:98

Now the selenium script can run remotely.

##Scheduling the script:

I wrote the following cron job to schedule the script (since the DC crime data updates daily)

0 5 * * * rm -f /home/ubuntu/data/SearchResults* && export DISPLAY=:98 && /usr/bin/python /home/ubuntu/crimescrape.py

Every night at 5:00 AM, the script deletes the old data file (SearchResults.txt), sets the environment display, and runs the python scraper

##Hosting the data

I installed nginx on the server, then modified the nginx config file:

vim /etc/nginx/nginx.conf

And modified the location of static files in the http/server block

http{
  server {
      location / {
          root /path/to/data;
      }
  }
}

Then just navigate to http://your-server's-IP/ to download the most recent version SearchResults.txt