-
Notifications
You must be signed in to change notification settings - Fork 140
Getting Started
Data scraping and processing code is organised into modular and extendable jobs written in JavaScript or CoffeeScript. A typical node.io job consists of of taking some input, processing / reducing it in some way, and then outputting the emitted results, although no step is compulsory. Some scraping jobs don't require input, etc.
Jobs can be run from the command line or through a web interface. To run a job from the command line (extension can be omitted), run
$ node.io myjob
To run jobs through the web interface, copy your jobs to ~/.node_modules
and run
$ node.io-web -p 8080
The web interface can be accessed at http://localhost:8080/
Sometimes a job may display incorrect behavior. To find out why and see what's going on under the hood, use the -g
or --debug
switch
$ node.io --debug myjob
Each example includes a JavaScript and CoffeeScript version and omits the required var nodeio = require('node.io');
Example 1: Hello World!
hello.js
exports.job = new nodeio.Job({
input: false,
run: function () {
this.emit('Hello World!');
}
});
hello.coffee
class Hello extends nodeio.JobClass
input: false
run: (num) -> @emit 'Hello World!'
@class = Hello
@job = new Hello()
To run the example
$ node.io -s hello
=> Hello World!
Note: the -s
switch omits status messages from output
Example 2: Double each element of input
double.js
exports.job = new nodeio.Job({
input: [0,1,2],
run: function (num) {
this.emit(num * 2);
}
});
double.coffee
class Double extends nodeio.JobClass
input: [0,1,2]
run: (num) -> @emit num * 2
@class = Double
@job = new Double()
Example 3: Inheritance
quad.js
var Double = require('./double').job;
exports.job = Double.extend({
run: function (num) {
Double.run.call(this, num * 2);
//Same as: this.emit(num * 4)
}
});
quad.coffee
Double = require('./double').Class
class Quad extends Double
run: (num) -> super num * 2
@class = Quad
@job = new Quad()
Job options
Options allow you to easily incorporate common or complex behavior. A full list of options can be found in the API.
Options are specified as an object containing key/value pairs
var options = {
timeout: 10, //Timeout after 10 seconds
max: 20, //Run 20 threads concurrently (when run() is async)
retries: 3 //Threads can retry 3 times before failing
};
exports.job = new nodeio.Job(options, methods);
Determining when a job is complete
Being asynchronous, node.io needs to be able to determine when each thread (a call to run()
) is complete, and when the entire job is complete.
A thread is complete after:
-
emit()
,fail()
,retry()
orskip()
has been called - any subsequent calls in the same thread are ignored - An option, such as timeout, causes the thread to automatically call one of the methods above
-
run()
returns something other thannull
- in this case, the return value is emitted
** Important: if one of the above conditions is not met, the thread will hang indefinitely **
The job is complete when:
- All of the input has been consumed, or in the case of
input: false
, when one thread has completed -
exit()
is called
Passing arguments to jobs
Sometimes it may be desirable to be able to specify arguments to a job, e.g.
$ node.io myjob arg1 arg2 arg3
Arguments can be accessed through this.options.args
, e.g.
run: function() {
console.log(this.options.args[0]); //"arg1"
}
Retrying, skipping or failing a thread
To retry or skip a thread, use the retry()
or skip()
methods (no arguments required), e.g. to remove empty lines
remove_empty_lines.js
exports.job = new nodeio.Job({
run: function(line) {
if (line.trim() == '') {
this.skip()
} else {
this.emit(line)
}
}
});
Some job options (timeout, retries, redirects) cause fail()
to be called automatically after some condition
exports.job = new nodeio.Job({timeout: 5}, {
run: function(input) {
//There are no conditions that would cause this thread to be marked as complete, so it will timeout after 5 seconds
},
fail: function (input, status) {
//status = "timeout"
this.emit('Thread failed'); //You still need to complete the thread with an emit or skip, etc.
}
});
Goto part 2: Working with input / output