Skip to content

Quick Setup

fghso edited this page Apr 9, 2015 · 6 revisions

To setup the collection program you basically have to write a crawler class with an implementation of the crawl method (inside file crawler.py) that suits your scenario and adjust the settings inside the XML configuration file. Assuming the simplest configuration possible, the setup workflow would be as follows:

  1. Implement the crawl method inside crawler.py
  2. Create a XML configuration file, for example config.xml
  3. For global settings specify:
    • Hostname of the server's machine
    • Port where server will be listening for new connections
  4. For server settings, inidicate the persistence handler to be used (as well as the handler specific configurations)
  5. For client settings, specify the name of the crawler class
  6. On server's machine, run the command python ./server.py config.xml
  7. On each client machine, start one or more clients by running the command python ./client.py config.xml
  8. Monitor and manage the collection process using the script manager.py
  9. Wait for the collection to be finished
  10. Enjoy your new data!

The final config.xml file would be something like this:

<?xml version="1.0" encoding="ISO8859-1" ?>
<config>
    <global>
        <connection>
            <address>myserver</address>
            <port>7777</port>
        </connection>
    </global>
    <server>
        <persistence>
            <!--- Handler specific configurations--->
        </persistence>
    </server>
    <client>
        <crawler>
            <class>MyCrawler</class>
        </crawler>
    </client>
</config>