PHP web scraping made easy.
Please note: Documentation is always a work in progress, please excuse any errors.
You can install the package via composer:
composer require scrapy/scrapy
Scrapy is essentially a reader which can modify read data trough series of tasks. To simply read an url you can do the following.
use Scrapy\Builders\ScrapyBuilder;
$html = ScrapyBuilder::make()
->url('https://www.some-url.com')
->build()
->scrape();
Just reading HTML from some source is not a lot of fun. Scrapy allows you to crawl HTML with simple yet expressive API relying on Symphony's DOM crawler.
You can think of parsers as actions meant to extract data valuable to you from HTML.
Parsers are meant to be self-containing scraping rules allowing you to extract data from HTML string.
use Scrapy\Parsers\Parser;
use Scrapy\Crawlers\Crawly;
class ImageParser extends Parser
{
public function process(Crawly $crawly, array $output): array
{
$output['hello'] = $crawly->filter('h1')->string();
return $output;
}
}
Once you have your parsers defined, it's time to add them to Scrapy.
use Scrapy\Builders\ScrapyBuilder;
// Add by class reference
ScrapyBuilder::make()
->parser(ImageParser::class);
// Add concrete instance
ScrapyBuilder::make()
->parser(new ImageParser());
// Add multiple parsers
ScrapyBuilder::make()
->parsers([ImageParser::class, new ImageParser()]);
You don't have to write a class for each parser, you can also do inline parsing. Let's see how would that look.
use Scrapy\Crawlers\Crawly;
use Scrapy\Builders\ScrapyBuilder;
ScrapyBuilder::make()
->parser(function (Crawly $crawly, array $output) {
$output['count'] = $crawly->filter('li')->count();
return $output;
});
Sometimes you want to pass some extra context to your parsers. With Scrapy, you can pass an associative array of parameters which would become available to every parser.
use Scrapy\Crawlers\Crawly;
use Scrapy\Builders\ScrapyBuilder;
ScrapyBuilder::make()
->params(['foo' => 'bar'])
->parser(function (Crawly $crawly, array $output) {
$output['foo'] = $this->param('foo'); // 'bar'
$output['baz'] = $this->has('baz'); // false
$output['bar'] = $this->param('baz'); // null
});
The same principle applies no matter if you define parsers as separate classes or inline them with functions.
You might noticed that first argument to parser's process method is instance Crawly class.
Crawly is an HTML crawling tool. It is based on Symphony's DOM Crawler.
Instance of Crawly can be made from any string.
use Scrapy\Crawlers\Crawly;
$crawly1 = new Crawly('<ul><li>Hello World!</li></ul>');
$crawly2 = new Crawly('Hello World!');
$crawly1->html(); // '<ul><li>Hello World!</li></ul>'
$crawly2->html(); // '<body>Hello World!</body>'
Crawly provides few helper methods allowing you to more easily get the wanted data from HTML.
Allows you to filter elements with CSS selector. Similar to what document.querySelector('...')
does.
$crawly = new Crawly('<ul><li>Hello World!</li></ul>');
$crawly->filter('li')->html(); // <li>Hello World!</li>
Narrow your selection by taking the first element from it.
$crawly = new Crawly('<ul><li>Hello</li><li>World!</li></ul>');
$crawly->filter('li')->first()->html(); // <li>Hello</li>
Narrow your selection by taking the nth element from it. Note that indices are 0-based;
$crawly = new Crawly('<ul><li>Hello</li><li>World!</li></ul>');
$crawly->filter('li')->nth(1)->html(); // <li>World!</li>
Get access to Symphony's DOM crawler.
Crawly does not aim to replace Symphony's DOM crawler, rather just to make it's usage more pleasant. That's why not all methods are exposed directly trough Crawly.
Using raw
method allows you to utilise the underlying Symphony's crawler.
$crawly = new Crawly('<ul><li>Hello</li><li>World!</li></ul>');
$crawly->filter('li')->first()->raw()->html(); // Hello
Trims the output string.
$crawly = new Crawly('<div><span> Hello! </span></div>');
$crawly->filter('span')->trim()->string(); // 'Hello!'
Extract attributes from selection.
$crawly = new Crawly('<ul><li attr="1">1</li><li attr="2">2</li></ul>');
$crawly->filter('li')->pluck(['attr']); // ["1","2"]
$crawly = new Crawly('<img width="200" height="300"></img><img width="400" height="500"></img>');
$crawly->filter('img')->pluck(['width', 'height']); // [ ["200", "300"], ["400", "500"] ]
Returns the count of currently selected nodes.
$crawly = new Crawly('<ul><li>1</li><li>2</li></ul>');
$crawly->filter('li')->count(); // 2
Returns the integer value of current selection
$crawly = new Crawly('<span>123</span>');
$crawly->filter('span')->int(); // 123
// Use default if selection is not numeric
$crawly = new Crawly('');
$crawly->filter('span')->int(55); // 55
Returns the integer value of current selection
$crawly = new Crawly('<span>18.5</span>');
$crawly->filter('span')->float(); // 18.5
// Use default if selection is not numeric
$crawly = new Crawly('');
$crawly->filter('span')->float(22.4); // 22.4
Returns current selection's inner content as string.
$crawly = new Crawly('<span>Hello World!</span>');
$crawly->filter('span')->string(); // 'Hello World!'
// Use default in case exception arises
$crawly = new Crawly('');
$crawly->filter('non-existing-selection')->string('Hello'); // 'Hello'
Returns HTML string representation of current selection, including the parent element.
$crawly = new Crawly('<span>Hello World!</span>');
$crawly->filter('span')->html(); // <span>Hello World!</span>
// Use default in case exception arises
$crawly = new Crawly('');
$crawly->filter('non-existing-selection')->html('<div>Hi</div>'); // <div>Hi</div>
Returns HTML string representation of current selection, excluding the parent element.
$crawly = new Crawly('<span>Hello World!</span>');
$crawly->filter('span')->innerHtml(); // 'Hello World!'
// Use default to handle exceptional cases
$crawly = new Crawly('');
$crawly->filter('non-existing-selection')->innerHtml('<div>Hi</div>'); // 'Hi'
Checks if given selection exists.
You can get boolean response or raise an exception.
$crawly = new Crawly('<span>Hello World!</span>');
$crawly->filter('span')->exists(); // true
$crawly = new Crawly('');
$crawly->filter('non-existing-selection')->exists(); // false
$crawly->filter('non-existing-selection')->exists(true); // new ScrapeException(...)
Resets the crawler back to its original HTML.
$crawly = new Crawly('<ul><li>1</li></ul>');
$crawly = $crawly->filter('li')->html(); // <li>1</li>
$crawly->reset()->html(); // <ul><li>1</li></ul>
This method creates a new array populated with the results of calling a provided function on every node in a selection.
For each node a callback function is called with Crawly intance created from that node. Additionally, callback function takes second argument which is the 0-based index of a node.
$crawly = new Crawly('<ul><li> Hello </li><li> World </li></ul>');
$crawly->filter('li')->map(function (Crawly $crawly, int $index) {
return $crawly->trim()->string() . ' - ' . $index;
}); // ['Hello - 0', 'World - 1']
// limit the map function
$crawly->filter('li')->map(function (Crawly $crawly, int $index) {
return $crawly->trim()->string() . ' - ' . $index;
}, 1); // ['Hello - 0']
Returns the first DOMNode of the selection.
$crawly = new Crawly('<ul><li>1</li></ul>');
$crawly = $crawly->filter('li')->node(); // DOMNode representing '<li>1</li>' is returned
Readers are data source classes used by Scrapy to fetch the HTML content.
Scrapy comes with some readers predefined, and you can also write your own if you need to.
Scrapy comes with two built in readers: UrlReader
and FileReader
. Lets see how you may use them.
use Scrapy\Builders\ScrapyBuilder;
use Scrapy\Readers\UrlReader;
use Scrapy\Readers\FileReader;
ScrapyBuilder::make()
->reader(new UrlReader('https://www.some-url.com'));
ScrapyBuilder::make()
->reader(new FileReader('path-to-file.html'));
As you can see built in readers allow you to use Scrapy by either reading from a url or from a specific file.
You don't have to be limited to built in readers. Writing you own is a piece of cake.
use Scrapy\Readers\IReader;
class CustomReader implements IReader
{
public function read(): string
{
return '<h1>Hello World!</h1>';
}
}
And then use it during the build process.
ScrapyBuilder::make()
->reader(new CustomReader());
A user agent is a computer program representing a person, in this case a Scrapy instance. Scrapy provides several built in user agents for simulating different crawlers.
User agents make sense only in a context of readers that fetch their data over HTTP protocol. More precisely, in cases where you want to read a web page that creates its content dynamically using JavaScript.
Scrapy by default can not parse JavaScript files. This is a problem all web crawlers face. There are numerous techniques for overcoming this problem, usually by using external services like Prerender which redirect crawling bots to cached HTML pages.
Several user agents are provided to allow Scrapy to represent itself as some of the common user agents. Please not that in case a web page implements more advance crawling security checks (for example an IP check) than provided checker would fail, since they only modify the HTTP request headers.
If you want to find out more, there is a great article on pre-rendering over at Netlify.
Scrapy comes with few built in agents you can use.
ScrapyBuilder::make()
->agent(new GoogleAgent()); // Googlebot
ScrapyBuilder::make()
->agent(new GoogleChromeAgent(81, 0, 4043, 0)); // Googlebot
ScrapyBuilder::make()
->agent(new BingUserAgent()); // Bing
ScrapyBuilder::make()
->agent(new YahooUserAgent()); // Yahoo
ScrapyBuilder::make()
->agent(new DuckUserAgent()); // Duck
Just like with readers, you can write your own custom user agents.
use Scrapy\Agents\IUserAgent;
use Scrapy\Readers\UrlReader;
class UserAgent implements IUserAgent
{
public function reader(string $url): UrlReader
{
$reader = new UrlReader($url);
$reader->setConfig(['headers' => ['...']]);
return $reader;
}
}
And then use it during the build process.
ScrapyBuilder::make()
->agent(new UserAgent());
One thing to note is the precedence of different parameters you may set during the build process.
Setting the url is same as setting the reader to be UrlReader with that url. On the other hand, explicitly setting reader will have higher precedence over explicitly setting the url and/or user agent.
use Scrapy\Readers\UrlReader;
use Scrapy\Agents\GoogleAgent;
use Scrapy\Builders\ScrapyBuilder;
ScrapyBuilder::make()
->url('https://www.facebook.com')
->agent(new GoogleAgent())
->reader(new UrlReader('https://www.youtube.com')); // Youtube will be read without GoogleAgent, Facebook will be ignored.
In general, Scrapy tries to handle all possible exceptions wrapping them in base Scrapy exception class: ScrapeException.
What this means is that you can organize your app around a single exception for general error handling.
A more granular system is planned for future release which would allow you to react to a specific parser exceptions.
use Scrapy\Builders\ScrapyBuilder;
use Scrapy\Exceptions\ScrapeException;
try {
$html = ScrapyBuilder::make()
->url('https://www.invalid-url.com')
->build()
->scrape();
} catch (ScrapeException $e) {
//
}
To run entire suite of unit tests you can do:
composer test
Please see CHANGELOG for more information what has changed recently.
The MIT License (MIT). Please see License File for more information.