Releases: crwlrsoft/crawler
Releases · crwlrsoft/crawler
v2.1.3
v2.1.2
v2.1.1
Fixed
- Also add cookies, set during headless browser usage, to the cookie jar. When switching back to the (guzzle) HTTP client the cookies should also be sent.
- Don't call
Loader::afterLoad()
whenLoader::beforeLoad()
was not called before. This can potentially happen, when an exception is thrown before the call to thebeforeLoad
hook, but it is caught and theafterLoader
hook method is called anyway. As this most likely won't make sense to users, theafterLoad
hook callback functions will just not be called in this case. - The
Throttler
class now has protected methods_internalTrackStartFor()
,_requestToUrlWasStarted()
and_internalTrackEndFor()
. When extending theThrottler
class (be careful, actually that's not really recommended) they can be used to check if a request to a URL was actually started before.
v2.1.0
Added
- The new
postBrowserNavigateHook()
method in theHttp
step classes, which allows to define callback functions that are triggered after the headless browser navigated to the specified URL. They are called with the chrome-phpPage
object as argument, so you can interact with the page. Also, there is a new classBrowserAction
providing some simple actions (like wait for element, click element,...) as Closures via static methods. You can use it likeHttp::get()->postBrowserNavigateHook(BrowserAction::clickElement('#element'))
.
v2.0.1
v2.0.0
Changed
- BREAKING: Removed methods
BaseStep::addToResult()
,BaseStep::addLaterToResult()
,BaseStep::addsToOrCreatesResult()
,BaseStep::createsResult()
, andBaseStep::keepInputData()
. These methods were deprecated in v1.8.0 and should be replaced withStep::keep()
,Step::keepAs()
,Step::keepFromInput()
, andStep::keepInputAs()
. - BREAKING: Added the following keep methods to the
StepInterface
:StepInterface::keep()
,StepInterface::keepAs()
,StepInterface::keepFromInput()
,StepInterface::keepInputAs()
, as well asStepInterface::keepsAnything()
,StepInterface::keepsAnythingFromInputData()
andStepInterface::keepsAnythingFromOutputData()
. If you have a class that implements this interface without extendingStep
(orBaseStep
), you will need to implement these methods yourself. However, it is strongly recommended to extendStep
instead. - BREAKING: With the removal of the
addToResult()
method, the library no longer usestoArrayForAddToResult()
methods on output objects. Instead, please usetoArrayForResult()
. Consequently,RespondedRequest::toArrayForAddToResult()
has been renamed toRespondedRequest::toArrayForResult()
. - BREAKING: Removed the
result
andaddLaterToResult
properties fromIo
objects (Input
andOutput
). These properties were part of theaddToResult
feature and are now removed. Instead, use thekeep
property where kept data is added. - BREAKING: The signature of the
Crawler::addStep()
method has changed. You can no longer provide a result key as the first parameter. Previously, this key was passed to theStep::addToResult()
method internally. Now, please handle this call yourself. - BREAKING: The return type of the
Crawler::loader()
method no longer allowsarray
. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. As part of this change, theUnknownLoaderKeyException
was also removed as it is now obsolete. If you have any references to this class, please make sure to remove them. - BREAKING: Refactored the abstract
LoadingStep
class to a trait and removed theLoadingStepInterface
. Loading steps should now extend theStep
class and use the trait. As multiple loaders are no longer supported, theaddLoader
method was renamed tosetLoader
. Similarly, the methodsuseLoader()
andusesLoader()
for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's newwithLoader()
method (e.g.,Http::get()->withLoader($loader)
). The trait now also uses phpdoc template tags, for a generic loader type. You can define the loader type by putting/** @use LoadingStep<MyLoader> */
aboveuse LoadingStep;
in your step class. Then your IDE and static analysis (if supported) will know what type of loader, the trait methods return and accept. - BREAKING: Removed the
PaginatorInterface
to allow for better extensibility. The oldCrwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator
class has also been removed. Please use the newer, improved versionCrwlr\Crawler\Steps\Loading\Http\AbstractPaginator
. This newer version has also changed: the first argumentUriInterface $url
is removed from theprocessLoaded()
method, as the URL also is part of the request (Psr\Http\Message\RequestInterface
) which is now the first argument. Additionally, the default implementation of thegetNextRequest()
method is removed. Child implementations must define this method themselves. If your custom paginator still has agetNextUrl()
method, note that it is no longer needed by the library and will not be called. ThegetNextRequest()
method now fulfills its original purpose. - BREAKING: Removed methods from
HttpLoader
:$loader->setHeadlessBrowserOptions()
=> use$loader->browser()->setOptions()
instead$loader->addHeadlessBrowserOptions()
=> use$loader->browser()->addOptions()
instead$loader->setChromeExecutable()
=> use$loader->browser()->setExecutable()
instead$loader->browserHelper()
=> use$loader->browser()
instead
- BREAKING: Removed method
RespondedRequest::cacheKeyFromRequest()
. UseRequestKey::from()
instead. - BREAKING: The
HttpLoader::retryCachedErrorResponses()
method now returns an instance of the newCrwlr\Crawler\Loader\Http\Cache\RetryManager
class. This class provides the methodsonly()
andexcept()
to restrict retries to specific HTTP response status codes. Previously, this method returned theHttpLoader
itself ($this
), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code. - BREAKING: Removed the
Microseconds
class from this package. It has been moved to thecrwlr/utils
package, which you can use instead.
Added
- New methods
FileCache::prolong()
andFileCache::prolongAll()
to allow prolonging the time to live for cached responses.
Fixed
- The
maxOutputs()
method is now also available and working onGroup
steps. - Improved warning messages for step validations that are happening before running a crawler.
- A
PreRunValidationException
when the crawler finds a problem with the setup, before actually running, is not only logged as an error via the logger, but also rethrown to the user. This way the user won't get the impression, that the crawler ran successfully without looking at the log messages.
Detailed upgrade guide on https://www.crwlr.software/packages/crawler/v2.0/upgrade-guide
v2.0.0-beta.2
Added
- New methods
FileCache::prolong()
andFileCache::prolongAll()
to allow prolonging the time to live for cached responses.
v2.0.0-beta
Changed
- BREAKING: Removed methods
BaseStep::addToResult()
,BaseStep::addLaterToResult()
,BaseStep::addsToOrCreatesResult()
,BaseStep::createsResult()
, andBaseStep::keepInputData()
. These methods were deprecated in v1.8.0 and should be replaced withStep::keep()
,Step::keepAs()
,Step::keepFromInput()
, andStep::keepInputAs()
. - BREAKING: With the removal of the
addToResult()
method, the library no longer usestoArrayForAddToResult()
methods on output objects. Instead, please usetoArrayForResult()
. Consequently,RespondedRequest::toArrayForAddToResult()
has been renamed toRespondedRequest::toArrayForResult()
. - BREAKING: Removed the
result
andaddLaterToResult
properties fromIo
objects (Input
andOutput
). These properties were part of theaddToResult
feature and are now removed. Instead, use thekeep
property where kept data is added. - BREAKING: The return type of the
Crawler::loader()
method no longer allowsarray
. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. - BREAKING: Refactored the abstract
LoadingStep
class to a trait and removed theLoadingStepInterface
. Loading steps should now extend theStep
class and use the trait. As multiple loaders are no longer supported, theaddLoader
method was renamed tosetLoader
. Similarly, the methodsuseLoader()
andusesLoader()
for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's newwithLoader()
method (e.g.,Http::get()->withLoader($loader)
). - BREAKING: Removed the
PaginatorInterface
to allow for better extensibility. The oldCrwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator
class has also been removed. Please use the newer, improved versionCrwlr\Crawler\Steps\Loading\Http\AbstractPaginator
. This newer version has also changed: the first argumentUriInterface $url
is removed from theprocessLoaded()
method, as the URL also is part of the request (Psr\Http\Message\RequestInterface
) which is now the first argument. Additionally, the default implementation of thegetNextRequest()
method is removed. Child implementations must define this method themselves. If your custom paginator still has agetNextUrl()
method, note that it is no longer needed by the library and will not be called. ThegetNextRequest()
method now fulfills its original purpose. - BREAKING: Removed methods from
HttpLoader
:$loader->setHeadlessBrowserOptions()
=> use$loader->browser()->setOptions()
instead$loader->addHeadlessBrowserOptions()
=> use$loader->browser()->addOptions()
instead$loader->setChromeExecutable()
=> use$loader->browser()->setExecutable()
instead$loader->browserHelper()
=> use$loader->browser()
instead
- BREAKING: Removed method
RespondedRequest::cacheKeyFromRequest()
. UseRequestKey::from()
instead. - BREAKING: The
HttpLoader::retryCachedErrorResponses()
method now returns an instance of the newCrwlr\Crawler\Loader\Http\Cache\RetryManager
class. This class provides the methodsonly()
andexcept()
to restrict retries to specific HTTP response status codes. Previously, this method returned theHttpLoader
itself ($this
), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code. - BREAKING: Removed the
Microseconds
class from this package. It has been moved to thecrwlr/utils
package, which you can use instead.
v1.10.0
Added
- URL refiners:
UrlRefiner::withScheme()
,UrlRefiner::withHost()
,UrlRefiner::withPort()
,UrlRefiner::withoutPort()
,UrlRefiner::withPath()
,UrlRefiner::withQuery()
,UrlRefiner::withoutQuery()
,UrlRefiner::withFragment()
andUrlRefiner::withoutFragment()
. - New paginator stop rules
PaginatorStopRules::contains()
andPaginatorStopRules::notContains()
. - Static method
UserAgent::mozilla5CompatibleBrowser()
to get aUserAgent
instance with the user agent stringMozilla/5.0 (compatible)
and also the new methodwithMozilla5CompatibleUserAgent
in theAnonymousHttpCrawlerBuilder
that you can use like this:HttpCrawler::make()->withMozilla5CompatibleUserAgent()
.
v1.9.5
Fixed
- Prevent PHP warnings when an HTTP response includes a
Content-Type: application/x-gzip
header, but the content is not actually compressed. This issue also occurred with cached responses, because compressed content is decoded during caching. Upon retrieval from the cache, the header indicated compression, but the content was already decoded.