Skip to content

Releases: nadar/crawler

1.7.1

05 Apr 09:53
566553a
Compare
Choose a tag to compare

1.7.1 (5. April 2022)

  • Added catch for throwable when parsing pdfs, also updated to latest version of smalot/pdfparser.

1.7.0

10 Aug 09:27
8eedb04
Compare
Choose a tag to compare

1.7.0 (10. August 2021)

  • #20 Improve the strip tags for html parser in order to generate a more clean and readable output when $stripTags is enabled. Things like <p>foo</p><p>bar</p> are now handled as foo bar instead of foobar.

1.6.2

15 Apr 12:01
3e6c579
Compare
Choose a tag to compare

1.6.2 (16. April 2021)

  • #18 Fix issue with pages where utf8 chars are in title tag.

1.6.1

15 Apr 10:18
8be5e08
Compare
Choose a tag to compare

1.6.1 (16. April 2021)

  • #17 Fixed issue where crawler group is not generated correctly.

1.6.0

16 Mar 08:37
Compare
Choose a tag to compare

1.6.0 (16. March 2021)

  • #15 Do not follow links which have rel="nofollow" by default. This can be configured in the HtmlParser::$ignoreRels property.

1.5.0

13 Jan 10:24
31b08f7
Compare
Choose a tag to compare

1.5.0 (13. January 2020)

  • #14 Pass the StatusCode of the response into the parsers and process only HTML and PDFs with code 200 (OK).

1.4.0

13 Jan 08:58
0c0dc45
Compare
Choose a tag to compare

1.4.0 (13. January 2020)

  • #13 New Crawler method getCycles() returns the number of times the run() method was called.

1.3.0

20 Dec 07:56
ab8787e
Compare
Choose a tag to compare

1.3.0 (20. December 2020)

  • #10 Add relative url check to Url class.
  • #8 Merge the path of an url when only a query param is provided.

1.2.1

17 Dec 12:54
e05164d
Compare
Choose a tag to compare

1.2.1 (17. December 2020)

  • #9 Fix issue where CRAWL_IGNORE tag had no effect. Trim the array value for found linkes, which is equals to the link title

1.2.0

14 Nov 11:16
76e089e
Compare
Choose a tag to compare

1.2.0 (14. November 2020)

  • #7 By default, response content which is bigger then 5MB won't be passed to Parsers. In order to turn off this behavior use 'maxSize' => false or increase the limit 'maxSize' => 15000000 (which is 15MB for example). The value must be provided in Bytes. The main goal is to ensure that the PDF Parser won't run into very large memory consumption. This restriction won't stop the Crawler from downloading the URL (whether its large the the maxSize definition or not), but preventing memory leaks when the Parsers start to interact with the response content.