Releases: crwlrsoft/crawler
Releases · crwlrsoft/crawler
v1.1.2
v1.1.1
Fixed
- There was an issue when adding multiple associative arrays with the same key to a
Result
object: let's say you're having a step producing array output like:['bar' => 'something', 'baz' => 'something else']
and it (the whole array) shall be added to the result propertyfoo
. When the step produced multiple such array outputs, that led to a result like['bar' => '...', 'baz' => '...', ['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...']
. Now it's fixed to result in[['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...']
.
v1.1.0
Added
Http
steps can now receive body and headers from input data (instead of statically defining them via argument likeHttp::method(headers: ...)
) using the new methodsuseInputKeyAsBody(<key>)
anduseInputKeyAsHeader(<key>, <asHeader>)
oruseInputKeyAsHeaders(<key>)
. Further, when invoked with associative array input data, the step will by default use the value fromurl
oruri
for the request URL. If the input array contains the URL in a key with a different name, you can use the newuseInputKeyAsUrl(<key>)
method. That was basically already possible with the existinguseInputKey(<key>)
method, because the URL is the main input argument for the step. But if you want to use it in combination with the other newuseInputKeyAsXyz()
methods, you have to useuseInputKeyAsUrl()
, because usinguseInputKey(<key>)
would invoke the whole step with that key only.Crawler::runAndDump()
as a simple way to just run a crawler and dump all results, each as an array.addToResult()
now also works with serializable objects.- If you know certain keys that the output of a step will contain, you can now also define aliases for those keys, to be used with
addToResult()
. The output of anHttp
step (RespondedRequest
) contains the keysrequestUri
andeffectiveUri
. The aliasesurl
anduri
refer toeffectiveUri
, soaddToResult(['url'])
will add theeffectiveUri
asurl
to the result object. - The
GetLink
(Html::getLink()
) andGetLinks
(Html::getLinks()
) steps, as well as the abstractDomQuery
(parent ofCssSelector
(/Dom::cssSelector
) andXPathQuery
(/Dom::xPath
)) now have a methodwithoutFragment()
to get links respectively URLs without their fragment part. - The
HttpCrawl
step (Http::crawl()
) has a new methoduseCanonicalLinks()
. If you call it, the step will not yield responses if its canonical link URL was already yielded. And if it discovers a link, and some document pointing to that URL via canonical link was already loaded, it treats it as if it was already loaded. Further this feature also sets the canonical link URL as theeffectiveUri
of the response. - All filters can now be negated by calling the
negate()
method, so theevaluate()
method will return the opposite bool value when called. Thenegate()
method returns an instance ofNegatedFilter
that wraps the original filter. - New method
cacheOnlyWhereUrl()
in theHttpLoader
class, that takes an instance of theFilterInterface
as argument. If you define one or multiple filters using this method, the loader will cache only responses for URLs that match all the filters.
Fixed
- The
HttpCrawl
step (Http::crawl()
) by default now removes the fragment part of URLs to not load the same page multiple times, because in almost any case, servers won't respond with different content based on the fragment. That's why this change is considered non-breaking. For the rare cases when servers respond with different content based on the fragment, you can call the newkeepUrlFragment()
method of the step. - Although the
HttpCrawl
step (Http::crawl()
) already respected the limit of outputs defined via themaxOutputs()
method, it actually didn't stop loading pages. The limit had no effect on loading, only on passing on outputs (responses) to the next step. This is fixed in this version. - A so-called byte order mark at the beginning of a file (/string) can cause issues. So just remove it, when a step's input string starts with a UTF-8 BOM.
- There seems to be an issue in guzzle when it gets a PSR-7 request object with a header with multiple string values (as array, like:
['accept-encoding' => ['gzip', 'deflate', 'br']]
). When testing it happened that it only sent the last part (in this casebr
). Therefor theHttpLoader
now prepares headers before sending (in this case to:['accept-encoding' => ['gzip, deflate, br']]
). - You can now also use the output key aliases when filtering step outputs. You can even use keys that are only present in the serialized version of an output object.
v1.0.2
v1.0.1
v1.0.0
Added
- New method
Step::refineOutput()
to manually refine step output values. It takes either aClosure
or an instance of the newRefinerInterface
as argument. If the step produces array output, you can provide a key from the array output, to refine, as first argument and the refiner as second argument. You can call the method multiple times and all the refiners will be applied to the outputs in the order you add them. If you want to refine multiple output array keys with aClosure
, you can skip providing a key and theClosure
will receive the full output array for refinement. As mentioned you can provide an instance of theRefinerInterface
. There are already a few implementations:StringRefiner::afterFirst()
,StringRefiner::afterLast()
,StringRefiner::beforeFirst()
,StringRefiner::beforeLast()
,StringRefiner::betweenFirst()
,StringRefiner::betweenLast()
andStringRefiner::replace()
. - New method
Step::excludeFromGroupOutput()
to exclude a normal steps output from the combined output of a group that it's part of. - New method
HttpLoader::setMaxRedirects()
to customize the limit of redirects to follow. Works only when using the HTTP client. - New filters to filter by string length, with the same options as the comparison filters (equal, not equal, greater than,...).
- New
Filter::custom()
that you can use with a Closure, so you're not limited to the available filters only. - New method
DomQuery::link()
as a shortcut forDomQuery::attribute('href')->toAbsoluteUrl()
. - New static method
HttpCrawler::make()
returning an instance of the new classAnonymousHttpCrawlerBuilder
. This makes it possible to create your own Crawler instance with a one-liner like:HttpCrawler::make()->withBotUserAgent('MyCrawler')
. There's also awithUserAgent()
method to create an instance with a normal (non bot) user agent.
Changed
- BREAKING: The
FileCache
now also respects thettl
(time to live) argument and by default it is one hour (3600 seconds). If you're using the cache and expect the items to live (basically) forever, please provide a high enough value for default the time to live. When you try to get a cache item that is already expired, it (the file) is immediately deleted. - BREAKING: The
TooManyRequestsHandler
(and with that also the constructor argument in theHttpLoader
) was renamed toRetryErrorResponseHandler
. It now reacts the same to 503 (Service Unavailable) responses as to the 429 (Too Many Requests) responses. If you're actively passing your own instance to theHttpLoader
, you need to update it. - You can now have multiple different loaders in a
Crawler
. To use this, return an array containing your loaders from the protectedCrawler::loader()
method with keys to name them. You can then selectively use them by calling theStep::useLoader()
method on a loading step with the key of the loader it should use.
Removed
- BREAKING: The loop feature. The only real world use case should be paginating listings and this should be solved with the Paginator feature.
- BREAKING:
Step::dontCascade()
andStep::cascades()
because with the change in v0.7, that groups can only produce combined output, there should be no use case for this anymore. If you want to exclude one steps output from the combined group output, you can use the newStep::excludeFromGroupOutput()
method.
v0.7.0
Added
- New functionality to paginate: There is the new
Paginate
child class of theHttp
step class (easy access viaHttp::get()->paginate()
). It takes an instance of thePaginatorInterface
and uses it to iterate through pagination links. There is one implementation of that interface, theSimpleWebsitePaginator
. TheHttp::get()->paginate()
method uses it by default, when called just with a CSS selector to get pagination links. Paginators receive all loaded pages and implement the logic to find pagination links. The paginator class is also called before sending a request, with the request object that is about to be sent as an argument (prepareRequest()
). This way, it should even be doable to implement more complex pagination functionality. For example when pagination is built using POST request with query strings in the request body. - New methods
stopOnErrorResponse()
andyieldErrorResponses()
that can be used withHttp
steps. By callingstopOnErrorResponse()
the step will throw aLoadingException
when a response has a 4xx or 5xx status code. By calling theyieldErrorResponse()
even error responses will be yielded and passed on to the next steps (this was default behaviour until this version. See the breaking change below). - The body of HTTP responses with a
Content-Type
header containingapplication/x-gzip
are automatically decoded whenHttp::getBodyString()
is used. Therefor addedext-zlib
to suggested incomposer.json
. - New methods
addToResult()
andaddLaterToResult()
.addToResult()
is a single replacement forsetResultKey()
andaddKeysToResult()
(they are removed, seeChanged
below) that can be used for array and non array output.addLaterToResult()
is a new method that does not create a Result object immediately, but instead adds the output of the current step to all the Results that will later be created originating from the current output. - New methods
outputKey()
andkeepInputData()
that can be used with any step. Using theoutputKey()
method, the step will convert non array output to an array and use the key provided as an argument to this method as array key for the output value. ThekeepInputData()
method allows you to forward data from the step's input to the output. If the input is non array you can define a key using the method's argument. This is useful e.g. if you're having data in the initial inputs that you also want to add to the final crawling results. - New method
createsResult()
that can be used with any step, so you can differentiate if a step creates a Result object, or just keeps data to add to results later (newaddLaterToResult()
method). But primarily relevant for library internal use. - The
FileCache
class can compress the cache data now to save disk space. Use theuseCompression()
method to do so. - New method
retryCachedErrorResponses()
inHttpLoader
. When called, the loader will only use successful responses (status code < 400) from the cache and therefore retry already cached error responses. - New method
writeOnlyCache()
inHttpLoader
to only write to, but don't read from the response cache. Can be used to renew cached responses. Filter::urlPathMatches()
to filter URL paths using a regex.- Option to provide a chrome executable name to the
chrome-php/chrome
library viaHttpLoader::setChromeExecutable()
.
Changed
- BREAKING: Group steps can now only produce combined outputs, as previously done when
combineToSingleOutput()
method was called. The method is removed. - BREAKING:
setResultKey()
andaddKeysToResult()
are removed. Calls to those methods can both be replaced with calls to the newaddToResult()
method. - BREAKING:
getResultKey()
is also removed withsetResultKey()
. It's removed without replacement, as it doesn't really make sense any longer. - BREAKING: Error responses (4xx as well as 5xx), by default, won't produce any step outputs any longer. If you want to receive error responses, use the new
yieldErrorResponses()
method. - BREAKING: Removed the
httpClient()
method in theHttpCrawler
class. If you want to provide your own HTTP client, implement a customloader
method passing your client to theHttpLoader
instead. - Deprecated the loop feature (class
Loop
andCrawler::loop()
method). Probably the only use case is iterating over paginated list pages, which can be done using the new Paginator functionality. It will be removed in v1.0. - In case of a 429 (Too Many Requests) response, the
HttpLoader
now automatically waits and retries. By default, it retries twice and waits 10 seconds for the first retry and a minute for the second one. In case the response also contains aRetry-After
header with a value in seconds, it complies to that. Exception: by default it waits at max60
seconds (you can set your own limit if you want), if theRetry-After
value is higher, it will stop crawling. If all the retries also receive a429
it also throws an Exception. - Removed logger from
Throttler
as it doesn't log anything. - Fail silently when
robots.txt
can't be parsed. - Default timeout configuration for the default guzzle HTTP client:
connect_timeout
is10
seconds andtimeout
is60
seconds. - The
validateAndSanitize...()
methods in the abstractStep
class, when called with an array with one single element, automatically try to use that array element as input value. - With the
Html
andXml
data extraction steps you can now add layers to the data that is being extracted, by just adding furtherHtml
/Xml
data extraction steps as values in the mapping array that you pass as argument to theextract()
method. - The base
Http
step can now also be called with an array of URLs as a single input. Crawl and Paginate steps still require a single URL input.
Fixed
- The
CookieJar
now also works withlocalhost
or other hosts without a registered domain name. - Improve the
Sitemap::getUrlsFromSitemap()
step to also work when the<urlset>
tag contains attributes that would cause the symfony DomCrawler to not find any elements. - Fixed possibility of infinite redirects in
HttpLoader
by adding a redirects limit of 10.
v0.6.0
Added
- New step
Http::crawl()
(classHttpCrawl
extending the normalHttp
step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on. - New steps
Sitemap::getSitemapsFromRobotsTxt()
(GetSitemapsFromRobotsTxt
) andSitemap::getUrlsFromSitemap()
(GetUrlsFromSitemap
) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps. - New step
Html::metaData()
to get data from meta tags (and title tag) in HTML documents. - New step
Html::schemaOrg()
(SchemaOrg
) to get schema.org structured data in JSON-LD format from HTML documents. - The abstract
DomQuery
class (parent of theCssSelector
andXPathQuery
classes) now has some methods to narrow the selected matches further:first()
,last()
,nth(n)
,even()
,odd()
.
Changed
- BREAKING: Removed
PoliteHttpLoader
and traitsWaitPolitely
andCheckRobotsTxt
. Converted the traits to classesThrottler
andRobotsTxtHandler
which are dependencies of theHttpLoader
. TheHttpLoader
internally gets default instances of those classes. TheRobotsTxtHandler
will respect robots.txt rules by default if you use aBotUserAgent
and it won't if you use a normalUserAgent
. You can access the loader'sRobotsTxtHandler
viaHttpLoader::robotsTxt()
. You can pass your own instance of theThrottler
to the loader and also access it viaHttpLoader::throttle()
to change settings.
Fixed
- Getting absolute links via the
GetLink
andGetLinks
steps and thetoAbsoluteUrl()
method of theCssSelector
andXPathQuery
classes, now also look for<base>
tags in HTML when resolving the URLs. - The
SimpleCsvFileStore
can now also save results with nested data (but only second level). It just concatenates the values separated with a|
.
v0.5.0
Added
- You can now call the new
useHeadlessBrowser
method on theHttpLoader
class to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created. - With the
maxOutputs()
method of the abstractStep
class you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call theinvoke()
method any longer until the step is reset after a run. - With the new
outputHook()
method of the abstractCrawler
class you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons. - The
extract()
method of theHtml
andXml
(children ofDom
) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data. - In addition to
uniqueOutputs()
there is now alsouniqueInputs()
. It works exactly the same asuniqueOutputs()
, filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object. - In order to be able to also get absolute links when using the
extract()
method of Dom steps, the abstractDomQuery
class now has a methodtoAbsoluteUrl()
. The Dom step will automatically provide theDomQuery
instance with the base url, presumed that the input was an instance of theRespondedRequest
class and resolve the selected value against that base url.
Changed
- Remove some not so important log messages.
- Improve behavior of group step's
combineToSingleOutput()
. When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on. - When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key
unnamed
with the whole array as value.
Fixed
- The static methods
Html::getLink()
andHtml::getLinks()
now also work without argument, like theGetLink
andGetLinks
classes. - When a
DomQuery
(CSS selector or XPath query) doesn't match anything, itsapply()
method now returnsnull
(instead of an empty string). When theHtml(/Xml)::extract()
method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methodsHtml(/Xml)::each()
,Html(/Xml)::first()
orHtml(/Xml)::last()
doesn't match anything, that's not causing an error any longer, it just won't yield anything. - Removed the (unnecessary) second argument from the
Loop::withInput()
method because whenkeepLoopingWithoutOutput()
is called andwithInput()
is called after that call, it resets the behavior. - Issue when date format for expires date in cookie doesn't have dashes in
d-M-Y
(sod M Y
).