Skip to content

Latest commit

 

History

History
277 lines (120 loc) · 9.41 KB

NEWS.md

File metadata and controls

277 lines (120 loc) · 9.41 KB

NEWS robotstxt

0.7.15.9000

  • null_to_default typo fixed
  • Updates to function documentation

0.7.15 | 2024-08-24

  • CRAN compliance - Packages which use Internet resources should fail gracefully
  • CRAN compliance - fix R CMD check NOTES.

0.7.14 | 2024-08-24

  • CRAN compliance - Packages which use Internet resources should fail gracefully

0.7.13 | 2020-09-03

  • CRAN compliance - prevent URL forwarding (HTTP 301): add www to URLs

0.7.12 | 2020-09-03

  • CRAN compliance - prevent URL forwarding (HTTP 301): add trailing slashes to URLs

0.7.11 | 2020-09-02

  • CRAN compliance - LICENCE file wording; prevent URL forwarding (HTTP 301)

0.7.10 | 2020-08-19

  • fix problem in parse_robotstxt() - comment in last line of robots.txt file would lead to errornous parsing - reported by @gittaca, #59 and #60

0.7.9 | 2020-08-02

  • fix problem is_valid_robotstxt() - robots.txt validity check was to lax - reported by @gittaca, #58

0.7.8 | 2020-07-22

  • fix problem with domain name extraction - reported by @gittaca, #57
  • fix problem with vArYING CasE in robots.txt field names - reported by @steffilazerte, #55

0.7.7 | 2020-06-17

  • fix problem in rt_request_handler - reported by @MHWauben dmi3kno/polite#28 - patch by @dmi3kno

0.7.6 | 2020-06-13

  • make info whether or not results were cached available - requested by @dmi3kno, #53

0.7.5 | 2020-06-07

  • fix passing through more parameters from robotstxt() to get_robotstxt() - reported and implemented by @dmi3kno

0.7.3 | 2020-05-29

  • minor : improve printing of robots.txt
  • add request data as attribute to robots.txt
  • add as.list() method for robots.txt
  • adding several paragrpahs to the README file
  • major : finishing handlers - quality check, documentation
  • fix : Partial matching warnings #51 - reported by @mine-cetinkaya-rundel

0.7.2 | 2020-05-04

  • minor : changes in dependencies were introducing errors when no scheme/protocoll was provided in URL -- fixed #50

0.7.1 | 2018-01-09

  • minor : modifying robots.txt parser to be more robust against different formatting of robots.txt files -- fixed #48

0.7.0 | 2018-11-27

  • major : introducing http handler to allow for better interpretation of robots.txt files in case of certain events: redirects, server error, client error, suspicous content, ...

0.6.4 | 2018-09-14

  • minor : pass through of parameter for content encoding

0.6.3 | 2018-09-14

  • minor : introduced parameter encoding to get_robotstxt() that defaults to "UTF-8" which does the content function anyways - but now it will not complain about it
  • minor : added comment to help files specifying use of trailing slash in paths pointing to folders in paths_allowed and robotstxt.

0.6.2 | 2018-07-18

  • minor : changed from future::future_lapply() to future.apply::future_lapply() to make package compatible with versions of future after 1.8.1

0.6.1 | 2018-05-30

  • minor : package was moved to other repo location and project status badge was added

0.6.0 | 2018-02-10

  • change/fix check function paths_allowed() would not return correct result in some edge cases, indicating that spiderbar/rep-cpp check method is more reliable and shall be the default and only method: see 1, see 2, see 3

0.5.2 | 2017-11-12

  • fix : rt_get_rtxt() would break on Windows due trying to readLines() from folder

0.5.1 | 2017-11-11

  • change : spiderbar is now non-default second (experimental) check method
  • fix : there were warnings in case of multiple domain guessing

0.5.0 | 2017-10-07

  • feature : spiderbar's can_fetch() was added, now one can choose which check method to use for checking access rights
  • feature : use futures (from package future) to speed up retrieval and parsing
  • feature : now there is a get_robotstxts() function wich is a 'vectorized' version of get_robotstxt()
  • feature : paths_allowed() now allows checking via either robotstxt parsed robots.txt files or via functionality provided by the spiderbar package (the latter should be faster by approximatly factor 10)
  • feature : various functions now have a ssl_verifypeer option (analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html) which might help with robots.txt file retrieval in some cases
  • change : user_agent for robots.txt file retrieval will now default to: sessionInfo()$R.version$version.string
  • change : robotstxt now assumes it knows how to parse --> if it cannot parse it assumes that it got no valid robots.txt file meaning that there are no restrictions
  • fix : valid_robotstxt would not accept some actual valid robotstxt files

0.4.1 | 2017-08-20

  • restructure : put each function in separate file
  • fix : parsing would go bonkers for robots.txt of cdc.gov (e.g. combining all robots with all permissions) due to errornous handling of carriage return character (reported by @hrbrmstr - thanks)

0.4.0 | 2017-07-14

  • user_agent parameter added to robotstxt() and paths_allowed to allow for user defined HTTP user-agent send when retrieving robots.txt file from domain

0.3.4 | 2017-07-08

  • fix : non robots.txt files (e.g. html files returned by server instead of the requested robots.txt / facebook.com) would be handled as if it were non existent / empty files (reported by @simonmunzert - thanks)
  • fix : UTF-8 encoded robots.txt with BOM (byte order mark) would break parsing although files were otherwise valid robots.txt files

0.3.3 | 2016-12-10

  • updating NEWS file and switching to NEWS.md

0.3.2 | 2016-04-28

  • CRAN publication

0.3.1 | 2016-04-27

version 0.1.2 // 2016-02-08 ...

  • first feature complete version on CRAN