Skip to content

Releases: crwlrsoft/crawler

v2.0.0-beta.2

26 Aug 10:32
Compare
Choose a tag to compare
v2.0.0-beta.2 Pre-release
Pre-release

Added

  • New methods FileCache::prolong() and FileCache::prolongAll() to allow prolonging the time to live for cached responses.

v2.0.0-beta

09 Aug 09:55
Compare
Choose a tag to compare
v2.0.0-beta Pre-release
Pre-release

Changed

  • BREAKING: Removed methods BaseStep::addToResult(), BaseStep::addLaterToResult(), BaseStep::addsToOrCreatesResult(), BaseStep::createsResult(), and BaseStep::keepInputData(). These methods were deprecated in v1.8.0 and should be replaced with Step::keep(), Step::keepAs(), Step::keepFromInput(), and Step::keepInputAs().
  • BREAKING: With the removal of the addToResult() method, the library no longer uses toArrayForAddToResult() methods on output objects. Instead, please use toArrayForResult(). Consequently, RespondedRequest::toArrayForAddToResult() has been renamed to RespondedRequest::toArrayForResult().
  • BREAKING: Removed the result and addLaterToResult properties from Io objects (Input and Output). These properties were part of the addToResult feature and are now removed. Instead, use the keep property where kept data is added.
  • BREAKING: The return type of the Crawler::loader() method no longer allows array. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below.
  • BREAKING: Refactored the abstract LoadingStep class to a trait and removed the LoadingStepInterface. Loading steps should now extend the Step class and use the trait. As multiple loaders are no longer supported, the addLoader method was renamed to setLoader. Similarly, the methods useLoader() and usesLoader() for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new withLoader() method (e.g., Http::get()->withLoader($loader)).
  • BREAKING: Removed the PaginatorInterface to allow for better extensibility. The old Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator class has also been removed. Please use the newer, improved version Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator. This newer version has also changed: the first argument UriInterface $url is removed from the processLoaded() method, as the URL also is part of the request (Psr\Http\Message\RequestInterface) which is now the first argument. Additionally, the default implementation of the getNextRequest() method is removed. Child implementations must define this method themselves. If your custom paginator still has a getNextUrl() method, note that it is no longer needed by the library and will not be called. The getNextRequest() method now fulfills its original purpose.
  • BREAKING: Removed methods from HttpLoader:
    • $loader->setHeadlessBrowserOptions() => use $loader->browser()->setOptions() instead
    • $loader->addHeadlessBrowserOptions() => use $loader->browser()->addOptions() instead
    • $loader->setChromeExecutable() => use $loader->browser()->setExecutable() instead
    • $loader->browserHelper() => use $loader->browser() instead
  • BREAKING: Removed method RespondedRequest::cacheKeyFromRequest(). Use RequestKey::from() instead.
  • BREAKING: The HttpLoader::retryCachedErrorResponses() method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager class. This class provides the methods only() and except() to restrict retries to specific HTTP response status codes. Previously, this method returned the HttpLoader itself ($this), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code.
  • BREAKING: Removed the Microseconds class from this package. It has been moved to the crwlr/utils package, which you can use instead.

v1.10.0

05 Aug 17:33
Compare
Choose a tag to compare

Added

  • URL refiners: UrlRefiner::withScheme(), UrlRefiner::withHost(), UrlRefiner::withPort(), UrlRefiner::withoutPort(), UrlRefiner::withPath(), UrlRefiner::withQuery(), UrlRefiner::withoutQuery(), UrlRefiner::withFragment() and UrlRefiner::withoutFragment().
  • New paginator stop rules PaginatorStopRules::contains() and PaginatorStopRules::notContains().
  • Static method UserAgent::mozilla5CompatibleBrowser() to get a UserAgent instance with the user agent string Mozilla/5.0 (compatible) and also the new method withMozilla5CompatibleUserAgent in the AnonymousHttpCrawlerBuilder that you can use like this: HttpCrawler::make()->withMozilla5CompatibleUserAgent().

v1.9.5

24 Jul 22:30
Compare
Choose a tag to compare

Fixed

  • Prevent PHP warnings when an HTTP response includes a Content-Type: application/x-gzip header, but the content is not actually compressed. This issue also occurred with cached responses, because compressed content is decoded during caching. Upon retrieval from the cache, the header indicated compression, but the content was already decoded.

v1.9.4

24 Jul 19:45
Compare
Choose a tag to compare

Fixed

  • When using HttpLoader::cacheOnlyWhereUrl() to restrict caching, the filter rule is not only applied when adding newly loaded responses to the cache, but also for using cached responses. Example: a response for https://www.example.com/foo is already available in the cache, but $loader->cacheOnlyWhereUrl(Filter::urlPathStartsWith('/bar/')) was called, the cached response is not used.

v1.9.3

05 Jul 08:51
Compare
Choose a tag to compare

Fixed

  • Add HttpLoader::browser() as a replacement for HttpLoader::browserHelper() and deprecate the browserHelper() method. It's an alias and just because it will read a little better: $loader->browser()->xyz() vs. $loader->browserHelper()->xyz(). HttpLoader::browserHelper() will be removed in v2.0.
  • Also deprecate HttpLoader::setHeadlessBrowserOptions(), HttpLoader::addHeadlessBrowserOptions() and HttpLoader::setChromeExecutable(). Use $loader->browser()->setOptions(), $loader->browser()->addOptions() and $loader->browser()->setExecutable() instead.

v1.9.2

17 Jun 23:47
Compare
Choose a tag to compare

Fixed

  • Issue with setting the headless chrome executable, introduced in 1.9.0.

v1.9.1

17 Jun 15:21
Compare
Choose a tag to compare

Added

  • Also add HeadlessBrowserLoaderHelper::getTimeout() to get the currently configured timeout value.

v1.9.0

17 Jun 13:15
Compare
Choose a tag to compare

Added

  • New methods HeadlessBrowserLoaderHelper::setTimeout() and HeadlessBrowserLoaderHelper::waitForNavigationEvent() to allow defining the timeout for the headless chrome in milliseconds (default 30000 = 30 seconds) and the navigation event (load (default), DOMContentLoaded, firstMeaningfulPaint, networkIdle, etc.) to wait for when loading a URL.

v1.8.0

05 Jun 00:10
Compare
Choose a tag to compare

Added

  • New methods Step::keep() and Step::keepAs(), as well as Step::keepFromInput() and Step::keepInputAs(), as alternatives to Step::addToResult() (or Step::addLaterToResult()). The keep() method can be called without any argument, to keep all from the output data. It can be called with a string, to keep a certain key or with an array to keep a list of keys. If the step yields scalar value outputs (not an associative array or object with keys) you need to use the keepAs() method with the key you want the output value to have in the kept data. The methods keepFromInput() and keepInputAs() work the same, but uses the input (not the output) that the step receives. Most likely only needed with a first step, to keep data from initial inputs (or in a sub crawler, see below). Kept properties can also be accessed with the Step::useInputKey() method, so you can easily reuse properties from multiple steps ago as input.
  • New method Step::outputType() with default implementation returning StepOutputType::Mixed. Please consider implementing this method yourself in all your custom steps, because it is going to be required in v2 of the library. It allows detecting (potential) problems in crawling procedures immediately when starting a run instead of failing after already running a while.
  • New method Step::subCrawlerFor(), allowing to fill output properties from an actual full child crawling procedure. As the first argument, you give it a key from the step's output, that the child crawler uses as input(s). As the second argument you need to provide a Closure that receives a clone of the current Crawler without steps and with initial inputs, set from the current output. In the Closure you then define the crawling procedure by adding steps as you're used to do it, and return it. This allows to achieve nested output data, scraped from different (sub-)pages, more flexible and less complicated as with the usual linear crawling procedure and Step::addToResult().

Deprecated

  • The Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() methods. Instead, please use the new keep methods. This can cause some migration work for v2, because especially the add to result methods are a pretty central functionality, but the new "keep" methodology (plus the new sub crawler feature) will make a lot of things easier, less complex and the library will most likely work more efficiently in v2.

Fixed

  • When a cache file was generated with compression, and you're trying to read it with a FileCache instance without compression enabled, it also works. When unserializing the file content fails it tries decoding the string first before unserializing it.