bus_catchers is a experimental web scraping suite to extract schedule data from the bus websites of Bolt Bus, Greyhound, Peterpan, Megabus, Amtrak, and more. This is the code that allowed http://www.buscatchers.com/ to work before it was shut down for legal reasons. Although the source code is fairly messy, it is fully functional at the present time. At some point, I would like to develop a python framework for extracting data from any website, including websites that use javascript extensively.
The code uses Selenium + Firefox to navigate through the websites, and Scrapy to parse html. The code has the following dependencies:
- Selenium
A framework for controlling your browser, can be downloaded at: https://pypi.python.org/pypi/selenium
- Scrapy
A framework for web scraping and html parsing. bus_catchers uses Scrapy's XML parsing engine extensively. You can downoad this at: https://pypi.python.org/pypi/Scrapy
- pyvirtualdisplay (with Xvfb) (optional, so that firefox runs in the background )
A nice python module that will allow you to run Selenium scripts without having browsers popping up everywhere. Can be downloaded at: https://pypi.python.org/pypi/PyVirtualDisplay
- python-MySQL (optional)
Allows you to execute mysql queries directly from python: https://pypi.python.org/pypi/MySQL-python
Once dependencies are met the code can be run from the terminal as
python RUN_ME.py short
The "short" input tells the program how many days in advance to scrape. Warning: running this command will launch a web browser for each website to be scraped. This will be done using the multiprocessing python module to run the scripts in parallel.
In order to try scraping a single website at a time, you may run the scraping scripts by themselves, as:
python get_greyhound.py
or:
python get_amtrack.py
etc.
The code supports the following features:
- Extensive logging, automatic emailing of log files (in the crawl_log directory)
- Direct conversion from parsed html to MySQL insert statements (in the sql_files directory). The outputs of the scripts are writtent to .sql files. I then use a script (dump_sql.py) to import the .sql files and remove all the loaded queries from the .sql files. The script is run periodically at a schedule determined by the cron unix utility program.
- Proxy support. That way if a site blocks you (this happens a lot with Peterpan) you can just switch to a new proxy and keep scraping.
- Parallel execution. This aspect of the code is still in progress.
The crontab file in the python_code directory shows how the scheduling can be set up.
The files need to be run from within the python_code directory, otherwise the outputs of the files will appear in the wrong directories.
The outputs of the scripts are written to sql files with a random integer 1-5 in the file name. The base file name of the output sql files is set by the :meth:`MyController.BusCatcher.setup_my_logger` method.
The settings for the scraping are found in python_code/MyDict.py. This file determines whether the scraping is done in the background using pyvirtualdisplay, how fast the scraping happens, etc.
Contact me at nicodjimenez [at] gmail.com if you have any questions / comments.