The Diffbot PHP interface is a class, named diffbot. You can create one or more instances (if neccessary).
- Ensure JSON PECL extension is installed on your system. As of PHP 5.2.0, the extension is bundled and compiled into PHP by default. For older versions, see json installation for details.
- Place diffbot.class.php to your PHP library directory (e.g. /usr/share/php/)
- Include the file once to use its functions. E.g.:
require_once '/usr/share/php/diffbot.class.php';
If JSON is not supported by your system, it will throw an exception.
First, create a diffbot object. The only mandatory parameter is your personal developer token. The second, optional parameter is the API version.
require_once 'diffbot.class.php';
$diffbot = new diffbot("DEVELOPER_TOKEN", 2);
Then, the diffbot object can be used to call the Diffbot API several times.
diffbot {
/* variables */
var $logfile = "diffbot.log";
var $timeformat = "Y-m-d H:i:sP";
var $timezone = "PST";
var $tracefile = "diffbot.trc";
var $diffbot_base = "http://api.diffbot.com/v%d/%s?";
/* methods */
public __construct(string $Token [, int $Version=2] )
/* automatic APIs */
public object analyze(string $Url [, array $Fields] )
public object article(string $Url [, array $Fields] )
public object frontpage(string $Url [, array $Fields] )
public object product(string $Url [, array $Fields] )
public object image(string $Url [, array $Fields] )
/* crawlbot API */
public object crawlbot_start(string $name, mixed $seeds, mixed $apiQuery=false [, array $Options ] )
public object crawlbot_pause(string $name) // pause a runnning job
public object crawlbot_continue(string $name) // continue a paused job
public object crawlbot_restart(string $name) // restart a job, cleaning previous results
public object crawlbot_delete(string $name) // delete a job with all of its results
}
Each option is a public variable, you can change its default value after the object is created.
- $logfile is the filename where API names and passed URLs are logged. If set to false, no logging performed.
- $timeformat is the format of date() function, used in log file.
- $timezone is the timezone used in log file.
- $tracefile is the file name of the trace file where raw request and response data is saved for debugging purposes. In production environment, you should set it to false to disable tracing.
- $diffbot_base contains the URL pattern to use when calling Diffbot API. First value will be replaced to version number, the second will be the api name. Usually, you do not need to change this.
E.g., to disable trace information:
$diffbot->tracefile = false;
For each API, a different public function can be called. The function name is the same as the API name. The first, mandatory parameter is the URL to be analyzed, the second, optional parameter contains the fields to be returned. Functions return an object hierarchy or false if an error occurs.
Code:
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$d->timezone = "CET"; // set the logging timezone to Central European Time
$c = $d->analyze("http://diffbot.com/products/");
var_dump($c);
Returns:
object(stdClass)#2 (4) {
["title"]=> string(17) "Diffbot: Products"
["type"]=> string(4) "serp"
["human_language"]=> string(2) "en"
["url"]=> string(28) "http://diffbot.com/products/"
}
Code:
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$fields = array("icon","text","title"); // fields to be returned
$c = $d->article("http://diffbot.com/products/", $fields);
var_dump($c);
Returns:
object(stdClass)#2 (6) {
["author"]=>
string(0) ""
["icon"]=>
string(34) "http://diffbot.com/favicon.ico?v=2"
["text"]=>
string(294) ""name": "Automatic APIs", "type": "computer vision", "author": "Diffy", "target": "common web pages"
"name": "Custom API Toolkit", "type": "custom extraction", "author": "Diffy", "target": "any kind of page"
"name": "Crawlbot", "type": "spidering", "author": "Diffy", "target": "entire domains""
["title"]=>
string(8) "Products"
["type"]=>
string(7) "article"
["url"]=>
string(28) "http://diffbot.com/products/"
}
For choosing $Fields, see the official api documentation:
- http://diffbot.com/products/automatic/classifier/
- http://diffbot.com/products/automatic/article/
- http://diffbot.com/products/automatic/frontpage/
- http://diffbot.com/products/automatic/product/
- http://diffbot.com/products/automatic/image/
Synopsys:
public object crawlbot_start(string $name, mixed $seeds, mixed $apiQuery=false [, array $Options ] )
The parameters are:
- name - The name of your crawl job.
- seeds - The URL(s) to crawl. Pass one URL as a string, more URLs as an array.
- apiQuery - If you set this parameter to false or just ignore it, your crawl will run in automatic mode.
Here you can define what Diffbot API should the crawlbot use. It is an associated array where array keys are:
- api : one of Diffbot API name, e.g. "article"
- fields (optional) : array of field names to processed, e.g. array("meta","image")
- Options - An associated array for optional crawl arguments and/or refining your crawl. See crawl documentation for details.
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$ret = $d->crawlbot_start("testJob","http://diffbot.com/"
,false
,array("maxToProcess"=>5)
);
print_r($ret->response);
Returns:
Successfully added urls for spidering.
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$ret = $d->crawlbot_start("testJob","http://diffbot.com/"
,array(
"api"=>"product",
"fields"=>array("querystring","meta")
)
,array("maxToProcess"=>5));
print_r($ret->response);
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$ret = $d->crawlbot_pause("testJob");
require_once 'diffbot.class.php';
$d = new diffbot("DEVELOPER_TOKEN");
$ret = $d->crawlbot_delete("testJob");
print_r($ret->response);
Returns:
Successfully deleted job.