Releases · crwlrsoft/crawler

Also add cookies, set during headless browser usage, to the cookie jar. When switching back to the (guzzle) HTTP client the cookies should also be sent.
Don't call Loader::afterLoad() when Loader::beforeLoad() was not called before. This can potentially happen, when an exception is thrown before the call to the beforeLoad hook, but it is caught and the afterLoader hook method is called anyway. As this most likely won't make sense to users, the afterLoad hook callback functions will just not be called in this case.
The Throttler class now has protected methods _internalTrackStartFor(), _requestToUrlWasStarted() and _internalTrackEndFor(). When extending the Throttler class (be careful, actually that's not really recommended) they can be used to check if a request to a URL was actually started before.

Assets 2

18 Oct 22:36

otsch

v2.1.0

1b9b4d2

v2.1.0

Added

The new postBrowserNavigateHook() method in the Http step classes, which allows to define callback functions that are triggered after the headless browser navigated to the specified URL. They are called with the chrome-php Page object as argument, so you can interact with the page. Also, there is a new class BrowserAction providing some simple actions (like wait for element, click element,...) as Closures via static methods. You can use it like Http::get()->postBrowserNavigateHook(BrowserAction::clickElement('#element')).

Assets 2

15 Oct 20:06

otsch

v2.0.1

52dd36b

v2.0.1

Fixed

Issue with the afterLoad hook of the HttpLoader, introduced in v2. Calling the hook was commented out, which slipped through because the test case was faulty.

Assets 2

15 Oct 15:08

otsch

v2.0.0

464d42b

v2.0.0

Changed

BREAKING: Removed methods BaseStep::addToResult(), BaseStep::addLaterToResult(), BaseStep::addsToOrCreatesResult(), BaseStep::createsResult(), and BaseStep::keepInputData(). These methods were deprecated in v1.8.0 and should be replaced with Step::keep(), Step::keepAs(), Step::keepFromInput(), and Step::keepInputAs().
BREAKING: Added the following keep methods to the StepInterface: StepInterface::keep(), StepInterface::keepAs(), StepInterface::keepFromInput(), StepInterface::keepInputAs(), as well as StepInterface::keepsAnything(), StepInterface::keepsAnythingFromInputData() and StepInterface::keepsAnythingFromOutputData(). If you have a class that implements this interface without extending Step (or BaseStep), you will need to implement these methods yourself. However, it is strongly recommended to extend Step instead.
BREAKING: With the removal of the addToResult() method, the library no longer uses toArrayForAddToResult() methods on output objects. Instead, please use toArrayForResult(). Consequently, RespondedRequest::toArrayForAddToResult() has been renamed to RespondedRequest::toArrayForResult().
BREAKING: Removed the result and addLaterToResult properties from Io objects (Input and Output). These properties were part of the addToResult feature and are now removed. Instead, use the keep property where kept data is added.
BREAKING: The signature of the Crawler::addStep() method has changed. You can no longer provide a result key as the first parameter. Previously, this key was passed to the Step::addToResult() method internally. Now, please handle this call yourself.
BREAKING: The return type of the Crawler::loader() method no longer allows array. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. As part of this change, the UnknownLoaderKeyException was also removed as it is now obsolete. If you have any references to this class, please make sure to remove them.
BREAKING: Refactored the abstract LoadingStep class to a trait and removed the LoadingStepInterface. Loading steps should now extend the Step class and use the trait. As multiple loaders are no longer supported, the addLoader method was renamed to setLoader. Similarly, the methods useLoader() and usesLoader() for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new withLoader() method (e.g., Http::get()->withLoader($loader)). The trait now also uses phpdoc template tags, for a generic loader type. You can define the loader type by putting /** @use LoadingStep<MyLoader> */ above use LoadingStep; in your step class. Then your IDE and static analysis (if supported) will know what type of loader, the trait methods return and accept.
BREAKING: Removed the PaginatorInterface to allow for better extensibility. The old Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator class has also been removed. Please use the newer, improved version Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator. This newer version has also changed: the first argument UriInterface $url is removed from the processLoaded() method, as the URL also is part of the request (Psr\Http\Message\RequestInterface) which is now the first argument. Additionally, the default implementation of the getNextRequest() method is removed. Child implementations must define this method themselves. If your custom paginator still has a getNextUrl() method, note that it is no longer needed by the library and will not be called. The getNextRequest() method now fulfills its original purpose.
BREAKING: Removed methods from HttpLoader:
- $loader->setHeadlessBrowserOptions() => use $loader->browser()->setOptions() instead
- $loader->addHeadlessBrowserOptions() => use $loader->browser()->addOptions() instead
- $loader->setChromeExecutable() => use $loader->browser()->setExecutable() instead
- $loader->browserHelper() => use $loader->browser() instead
BREAKING: Removed method RespondedRequest::cacheKeyFromRequest(). Use RequestKey::from() instead.
BREAKING: The HttpLoader::retryCachedErrorResponses() method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager class. This class provides the methods only() and except() to restrict retries to specific HTTP response status codes. Previously, this method returned the HttpLoader itself ($this), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code.
BREAKING: Removed the Microseconds class from this package. It has been moved to the crwlr/utils package, which you can use instead.

Added

New methods FileCache::prolong() and FileCache::prolongAll() to allow prolonging the time to live for cached responses.

Fixed

The maxOutputs() method is now also available and working on Group steps.
Improved warning messages for step validations that are happening before running a crawler.
A PreRunValidationException when the crawler finds a problem with the setup, before actually running, is not only logged as an error via the logger, but also rethrown to the user. This way the user won't get the impression, that the crawler ran successfully without looking at the log messages.

Detailed upgrade guide on https://www.crwlr.software/packages/crawler/v2.0/upgrade-guide

Assets 2

26 Aug 10:32

otsch

v2.0.0-beta.2

576d1db

v2.0.0-beta.2 Pre-release

Pre-release

Added

New methods FileCache::prolong() and FileCache::prolongAll() to allow prolonging the time to live for cached responses.

Assets 2

09 Aug 09:55

otsch

v2.0.0-beta

b4fa1b0

v2.0.0-beta Pre-release

Pre-release

Changed

BREAKING: Removed methods BaseStep::addToResult(), BaseStep::addLaterToResult(), BaseStep::addsToOrCreatesResult(), BaseStep::createsResult(), and BaseStep::keepInputData(). These methods were deprecated in v1.8.0 and should be replaced with Step::keep(), Step::keepAs(), Step::keepFromInput(), and Step::keepInputAs().
BREAKING: With the removal of the addToResult() method, the library no longer uses toArrayForAddToResult() methods on output objects. Instead, please use toArrayForResult(). Consequently, RespondedRequest::toArrayForAddToResult() has been renamed to RespondedRequest::toArrayForResult().
BREAKING: Removed the result and addLaterToResult properties from Io objects (Input and Output). These properties were part of the addToResult feature and are now removed. Instead, use the keep property where kept data is added.
BREAKING: The return type of the Crawler::loader() method no longer allows array. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below.
BREAKING: Refactored the abstract LoadingStep class to a trait and removed the LoadingStepInterface. Loading steps should now extend the Step class and use the trait. As multiple loaders are no longer supported, the addLoader method was renamed to setLoader. Similarly, the methods useLoader() and usesLoader() for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new withLoader() method (e.g., Http::get()->withLoader($loader)).
BREAKING: Removed the PaginatorInterface to allow for better extensibility. The old Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator class has also been removed. Please use the newer, improved version Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator. This newer version has also changed: the first argument UriInterface $url is removed from the processLoaded() method, as the URL also is part of the request (Psr\Http\Message\RequestInterface) which is now the first argument. Additionally, the default implementation of the getNextRequest() method is removed. Child implementations must define this method themselves. If your custom paginator still has a getNextUrl() method, note that it is no longer needed by the library and will not be called. The getNextRequest() method now fulfills its original purpose.
BREAKING: Removed methods from HttpLoader:
- $loader->setHeadlessBrowserOptions() => use $loader->browser()->setOptions() instead
- $loader->addHeadlessBrowserOptions() => use $loader->browser()->addOptions() instead
- $loader->setChromeExecutable() => use $loader->browser()->setExecutable() instead
- $loader->browserHelper() => use $loader->browser() instead
BREAKING: Removed method RespondedRequest::cacheKeyFromRequest(). Use RequestKey::from() instead.
BREAKING: The HttpLoader::retryCachedErrorResponses() method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager class. This class provides the methods only() and except() to restrict retries to specific HTTP response status codes. Previously, this method returned the HttpLoader itself ($this), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code.
BREAKING: Removed the Microseconds class from this package. It has been moved to the crwlr/utils package, which you can use instead.

Assets 2

05 Aug 17:33

otsch

v1.10.0

a73e6a4

v1.10.0

Added

URL refiners: UrlRefiner::withScheme(), UrlRefiner::withHost(), UrlRefiner::withPort(), UrlRefiner::withoutPort(), UrlRefiner::withPath(), UrlRefiner::withQuery(), UrlRefiner::withoutQuery(), UrlRefiner::withFragment() and UrlRefiner::withoutFragment().
New paginator stop rules PaginatorStopRules::contains() and PaginatorStopRules::notContains().
Static method UserAgent::mozilla5CompatibleBrowser() to get a UserAgent instance with the user agent string Mozilla/5.0 (compatible) and also the new method withMozilla5CompatibleUserAgent in the AnonymousHttpCrawlerBuilder that you can use like this: HttpCrawler::make()->withMozilla5CompatibleUserAgent().

Assets 2

24 Jul 22:30

otsch

v1.9.5

a064aeb

v1.9.5

Fixed

Prevent PHP warnings when an HTTP response includes a Content-Type: application/x-gzip header, but the content is not actually compressed. This issue also occurred with cached responses, because compressed content is decoded during caching. Upon retrieval from the cache, the header indicated compression, but the content was already decoded.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed

Fixed

Contributors

Fixed

Added

Fixed

Changed

Added

Fixed

Added

Changed

Added

Fixed

Releases: crwlrsoft/crawler

v2.1.3

Fixed

v2.1.2

Fixed

Contributors

v2.1.1

Fixed

v2.1.0

Added

v2.0.1

Fixed

v2.0.0

Changed

Added

Fixed

v2.0.0-beta.2

Added

v2.0.0-beta

Changed

v1.10.0

Added

v1.9.5

Fixed