Skip to content
odbol edited this page Aug 13, 2011 · 9 revisions

max (default: 1)

The maximum number of threads (calls to run()) allowed to run concurrently, per process. When scraping, this can be used to limit the number of concurrent requests.

take (default: 1)

How many elements of input to send to each thread. If this is greater than 1, run() will receive an array

Example when take: 2

input: [0,1,2,3,4],
run: function(input) {
    console.log(input);  //Outputs [0,1] \n [2,3] \n [4] \n
} 

retries (default: 2)

The maximum number of times an element (or elements) of input can be retried using retry() before the thread fails and fail() is called

wait (default: undefined)

Specifies an amount of time in seconds to wait between threads. Useful if the API/server you are scraping defines a limit of how many requests you can make in a given amount of time. You can use this option along with the max option to limit the number of concurrent requests. Note that this will also wait when you call skip().

auto_retry (default: false)

When this is set to true, failed requests or threads that throw an exception will automatically call this.retry()

timeout (default: false)

The maximum amount of time (in seconds) each thread can run for before fail() is called

global_timeout (default: false)

The maximum amount of time (in seconds) the entire job has to complete before it exits with an error. This option can also be set from the command line using the -t or --timeout switch

flatten (default: true)

When calling emit() with an array argument, this option determines whether the array is flattened before being output

Example when max: 3

run: function() {
    this.emit([1,2,3]);
}
output: function(output) {
    console.log(output);
    //When flatten is true (default) this outputs [1,2,3,1,2,3,1,2,3] 
    //When flatten is false this outputs [ [1,2,3],[1,2,3],[1,2,3] ]
}

benchmark (default: false)

If this is true, node.io outputs benchmark information on a job's completion: 1) completion time, 2) bytes read + speed, 3) bytes written + speed. This can also be enabled from the command line using the -b or --benchmark switch

fork (default: false)

EDIT: Currently broken - fix coming soon.

Whether to use child processes to distribute processing. Set this to the number of desired workers. This can also be enabled from the command line using the -f or --fork switch. Run node.io --help for details.

input (default: false)

This option is used to set a limit on how many lines / rows / elements are input before forcing a job to complete

Example when input: 100 and var i = 0;

input: function () {
    return i++;
}
run: function(num) {
    console.log(num); //Outputs the numbers 0 to 100
}

recurse (default: false)

If input is a directory, this option is used to recurse through each subdirectory.

read_buffer (default: 8096)

The read buffer to use when reading files

newline (default: \n)

The char to use as newline when outputting data. Note that input newlines are automatically detected as \n or \r\n

encoding (default: 'utf8')

The encoding to use when reading and writing data

jsdom (default: false)

Whether to use JSDOM to parse HTML (default is to use node-htmlparser). If JSDOM is used, jQuery is used as the default $ object

external_resources (default: false)

If you set jsdom to true and want to fetch and process external Javascript files, set external_resources to ['script']. Other values will not work.

proxy (default: false)

All requests will be made through this proxy. Alternatively, you can specify a function that returns a proxy (e.g. to cycle proxies).

redirects (default: 3)

The maximum number of redirects to follow before calling fail()

args (default: [])

This option is automatically filled with any extra arguments passed to the command line.

Example

$ node.io myjob arg1 arg2
    => this.options.args = ['arg1','arg2']
Clone this wiki locally