Skip to content

Commit

Permalink
Fix issue in Sitemap::getUrlsFromSitemap()
Browse files Browse the repository at this point in the history
Fixes the regex for fixing the <urlset> tag, when XML content contains
no linebreaks.
  • Loading branch information
otsch committed Feb 7, 2024
1 parent 9152d00 commit 9f04c17
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.5.2] - 2024-02-07
### Fixed
* Issue in `GetUrlsFromSitemap` (`Sitemap::getUrlsFromSitemap()`) step when XML content has no line breaks.

## [1.5.1] - 2024-02-06
### Fixed
* For being more flexible to build a separate headless browser loader (in an extension package) extract the most basic HTTP loader functionality to a new `HttpBaseLoader` and important functionality for the headless browser loader to a new `HeadlessBrowserLoaderHelper`. Further, also share functionality from the `Http` steps via a new abstract `HttpBase` step. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.
Expand Down
2 changes: 1 addition & 1 deletion src/Steps/Sitemap/GetUrlsFromSitemap.php
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ class GetUrlsFromSitemap extends Step
public static function fixUrlSetTag(Crawler $dom): Crawler
{
if ($dom->filter('urlset url')->count() === 0) {
return new Crawler(preg_replace('/<urlset.+>/', '<urlset>', $dom->outerHtml()));
return new Crawler(preg_replace('/<urlset.+?>/', '<urlset>', $dom->outerHtml()));
}

return $dom;
Expand Down
14 changes: 14 additions & 0 deletions tests/Steps/Sitemap/GetUrlsFromSitemapTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,17 @@ function () {
expect($outputs)->toHaveCount(3);
}
);

it(
'doesn\'t fail when the urlset tag contains attributes, that would cause the symfony DomCrawler to not find the ' .
'elements, when the XML content has no line breaks',
function () {
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"><url><loc>https://www.crwlr.software/blog/whats-new-in-crwlr-crawler-v0-5</loc></url><url><loc>https://www.crwlr.software/blog/dealing-with-http-url-query-strings-in-php</loc></url><url><loc>https://www.crwlr.software/blog/whats-new-in-crwlr-crawler-v0-4</loc></url></urlset>
XML;

$outputs = helper_invokeStepWithInput(Sitemap::getUrlsFromSitemap(), $xml);

expect($outputs)->toHaveCount(3);
}
);

0 comments on commit 9f04c17

Please sign in to comment.