Skip to content

Commit

Permalink
Fix reading input sitemap in HTTP crawl step
Browse files Browse the repository at this point in the history
The `Http::crawl()` step now also work with sitemaps as input URL, where
the `<urlset>` tag contains attributes that would cause the symfony
DomCrawler to not find any elements.
  • Loading branch information
otsch committed Jul 13, 2023
1 parent a118b77 commit bca54fb
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 6 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.1.4] - 2023-07-14
### Fixed
* The `Http::crawl()` step now also work with sitemaps as input URL, where the `<urlset>` tag contains attributes that would cause the symfony DomCrawler to not find any elements.

## [1.1.3] - 2023-06-29
### Fixed
* Improved `Json` step: if the target of the "each" (like `Json::each('target', [...])`) does not exist in the input JSON data, the step yields nothing and logs a warning.
Expand Down
3 changes: 2 additions & 1 deletion src/Steps/Loading/HttpCrawl.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
use Closure;
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http\Document;
use Crwlr\Crawler\Steps\Sitemap\GetUrlsFromSitemap;
use Crwlr\Url\Url;
use Exception;
use Generator;
Expand Down Expand Up @@ -249,7 +250,7 @@ protected function getUrlsFromInitialResponse(RespondedRequest $respondedRequest
*/
protected function getUrlsFromSitemap(RespondedRequest $respondedRequest): array
{
$domCrawler = new Crawler(Http::getBodyString($respondedRequest));
$domCrawler = GetUrlsFromSitemap::fixUrlSetTag(new Crawler(Http::getBodyString($respondedRequest)));

$urls = [];

Expand Down
24 changes: 19 additions & 5 deletions src/Steps/Sitemap/GetUrlsFromSitemap.php
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,24 @@ class GetUrlsFromSitemap extends Step
{
protected bool $withData = false;

/**
* Remove attributes from a sitemap's <urlset> tag
*
* Symfony's DomCrawler component has problems when a sitemap's <urlset> tag contains certain attributes.
* So, if the count of urls in the sitemap is zero, try to remove all attributes from the <urlset> tag.
*
* @param Crawler $dom
* @return Crawler
*/
public static function fixUrlSetTag(Crawler $dom): Crawler
{
if ($dom->filter('urlset url')->count() === 0) {
return new Crawler(preg_replace('/<urlset.+>/', '<urlset>', $dom->outerHtml()));
}

return $dom;
}

public function withData(): static
{
$this->withData = true;
Expand All @@ -22,11 +40,7 @@ public function withData(): static
*/
protected function invoke(mixed $input): Generator
{
if ($input->filter('urlset url')->count() === 0) {
$xml = preg_replace('/<urlset.+>/', '<urlset>', $input->outerHtml());

$input = new Crawler($xml);
}
$input = self::fixUrlSetTag($input);

foreach ($input->filter('urlset url') as $urlNode) {
$urlNode = new Crawler($urlNode);
Expand Down
13 changes: 13 additions & 0 deletions tests/_Integration/Http/CrawlingTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,19 @@ public function getLoader(): TestLoader
expect($crawler->getLoader()->loadedUrls)->toHaveCount(1);
});

it(
'extracts URLs from a sitemap where the <urlset> tag contains attributes that cause symfony DomCrawler to fail',
function () {
$crawler = (new Crawler())
->input('http://www.example.com/crawling/sitemap2.xml')
->addStep(Http::crawl()->inputIsSitemap());

$crawler->runAndTraverse();

expect($crawler->getLoader()->loadedUrls)->toHaveCount(7);
}
);

it('loads only pages where the path starts with a certain string when method pathStartsWith() is called', function () {
$crawler = (new Crawler())
->input('http://www.example.com/crawling/sitemap.xml')
Expand Down
14 changes: 14 additions & 0 deletions tests/_Integration/_Server/Crawling.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,20 @@
XML;
}

if ($route === '/crawling/sitemap2.xml') {
echo <<<XML
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/typo3/sysext/seo/Resources/Public/CSS/Sitemap.xsl"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.com/crawling/main</loc></url>
<url><loc>http://www.example.com/crawling/sub1</loc></url>
<url><loc>http://www.example.com/crawling/sub1/sub1</loc></url>
<url><loc>http://www.example.com/crawling/sub2</loc></url>
<url><loc>http://www.example.com/crawling/sub2/sub1</loc></url>
<url><loc>http://www.example.com/crawling/sub2/sub1/sub1</loc></url>
</urlset>
XML;
}

if ($route === '/crawling/main') {
echo <<<HTML
<!doctype html>
Expand Down

0 comments on commit bca54fb

Please sign in to comment.