Quick Setup

To setup the collection program you basically have to write a crawler class with an implementation of the crawl method (inside file crawler.py) that suits your scenario and adjust the settings inside the XML configuration file. Assuming the simplest configuration possible, the setup workflow would be as follows:

Implement the crawl method inside crawler.py
Create a XML configuration file, for example config.xml
For global settings specify:
- Hostname of the server's machine
- Port where server will be listening for new connections
For server settings, inidicate the persistence handler to be used (as well as the handler specific configurations)
For client settings, specify the name of the crawler class
On server's machine, run the command python ./server.py config.xml
On each client machine, start one or more clients by running the command python ./client.py config.xml
Monitor and manage the collection process using the script manager.py
Wait for the collection to be finished
Enjoy your new data!

The final config.xml file would be something like this:

<?xml version="1.0" encoding="ISO8859-1" ?>
<config>
    <global>
        <connection>
            <address>myserver</address>
            <port>7777</port>
        </connection>
    </global>
    <server>
        <persistence>
            <!--- Handler specific configurations--->
        </persistence>
    </server>
    <client>
        <crawler>
            <class>MyCrawler</class>
        </crawler>
    </client>
</config>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Setup

Clone this wiki locally