Releases: crwlrsoft/crawler
Releases · crwlrsoft/crawler
v1.6.1
v1.6.0
Added
- Enable dot notation in
Step::addToResult()
, so you can get data from nested output, like:$step->addToResult(['url' => 'response.url', 'status' => 'response.status', 'foo' => 'bar'])
. - When a step adds output properties to the result, and the output contains objects, it tries to serialize those objects to arrays, by calling
__serialize()
. If you want an object to be serialized differently for that purpose, you can define atoArrayForAddToResult()
method in that class. When that method exists, it's preferred to the__serialize()
method. - Implemented above-mentioned
toArrayForAddToResult()
method in theRespondedRequest
class, so on every step that somehow yields aRespondedRequest
object, you can use the keysurl
,uri
,status
,headers
andbody
with theaddToResult()
method. Previously this only worked forHttp
steps, because it defines output key aliases (HttpBase::outputKeyAliases()
). Now, in combination with the ability to use dot notation when adding data to the result, if your custom step returns nested output like['response' => RespondedRequest, 'foo' => 'bar']
, you can add response data to the result like this$step->addToResult(['url' => 'response.url', 'body' => 'response.body'])
.
Fixed
- Improvement regarding the timing when a store (
Store
class instance) is called by the crawler with a final crawling result. When a crawling step initiates a crawling result (so,addToResult()
was called on the step instance), the crawler has to wait for all child outputs (resulting from one step-input) until it calls the store, because the child outputs can all add data to the same final result object. But previously this was not only the case for all child outputs starting from a step whereaddToResult()
was called, but all children of one initial crawler input. So with this change, in a lot of cases, the store will earlier be called with finishedResult
objects and memory usage will be lowered.
v1.5.3
Fixed
- Merge
HttpBaseLoader
back toHttpLoader
. It's probably not a good idea to have multiple loaders. At least not multiple loaders just for HTTP. It should be enough to publicly expose theHeadlessBrowserLoaderHelper
viaHttpLoader::browserHelper()
for the extension steps. But keep theHttpBase
step, to share the general HTTP functionality implemented there.
v1.5.2
v1.5.1
Fixed
- For being more flexible to build a separate headless browser loader (in an extension package) extract the most basic HTTP loader functionality to a new
HttpBaseLoader
and important functionality for the headless browser loader to a newHeadlessBrowserLoaderHelper
. Further, also share functionality from theHttp
steps via a new abstractHttpBase
step. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.
v1.5.0
Added
- The
DomQuery
class (parent ofCssSelector
(Dom::cssSelector
) andXPathQuery
(Dom::xPath
)) has a new methodformattedText()
that uses the new crwlr/html-2-text package to convert the HTML to formatted plain text. You can also provide a customized instance of theHtml2Text
class to theformattedText()
method.
Fixed
- The
Http::crawl()
step won't yield a page again if a newly found URL responds with a redirect to a previously loaded URL.