null_to_default
typo fixed- Updates to function documentation
- CRAN compliance - Packages which use Internet resources should fail gracefully
- CRAN compliance - fix R CMD check NOTES.
- CRAN compliance - Packages which use Internet resources should fail gracefully
- CRAN compliance - prevent URL forwarding (HTTP 301): add www to URLs
- CRAN compliance - prevent URL forwarding (HTTP 301): add trailing slashes to URLs
- CRAN compliance - LICENCE file wording; prevent URL forwarding (HTTP 301)
- fix problem in parse_robotstxt() - comment in last line of robots.txt file would lead to errornous parsing - reported by @gittaca, #59 and #60
- fix problem is_valid_robotstxt() - robots.txt validity check was to lax - reported by @gittaca, #58
- fix problem with domain name extraction - reported by @gittaca, #57
- fix problem with vArYING CasE in robots.txt field names - reported by @steffilazerte, #55
- fix problem in rt_request_handler - reported by @MHWauben dmi3kno/polite#28 - patch by @dmi3kno
- make info whether or not results were cached available - requested by @dmi3kno, #53
- fix passing through more parameters from robotstxt() to get_robotstxt() - reported and implemented by @dmi3kno
- minor : improve printing of robots.txt
- add request data as attribute to robots.txt
- add
as.list()
method for robots.txt - adding several paragrpahs to the README file
- major : finishing handlers - quality check, documentation
- fix : Partial matching warnings #51 - reported by @mine-cetinkaya-rundel
- minor : changes in dependencies were introducing errors when no scheme/protocoll was provided in URL -- fixed #50
- minor : modifying robots.txt parser to be more robust against different formatting of robots.txt files -- fixed #48
- major : introducing http handler to allow for better interpretation of robots.txt files in case of certain events: redirects, server error, client error, suspicous content, ...
- minor : pass through of parameter for content encoding
- minor : introduced parameter encoding to
get_robotstxt()
that defaults to "UTF-8" which does the content function anyways - but now it will not complain about it - minor : added comment to help files specifying use of trailing slash in paths pointing to folders in
paths_allowed
androbotstxt
.
- minor : changed from
future::future_lapply()
tofuture.apply::future_lapply()
to make package compatible with versions of future after 1.8.1
- minor : package was moved to other repo location and project status badge was added
- change/fix check function paths_allowed() would not return correct result in some edge cases, indicating that spiderbar/rep-cpp check method is more reliable and shall be the default and only method: see 1, see 2, see 3
- fix : rt_get_rtxt() would break on Windows due trying to readLines() from folder
- change : spiderbar is now non-default second (experimental) check method
- fix : there were warnings in case of multiple domain guessing
- feature : spiderbar's can_fetch() was added, now one can choose which check method to use for checking access rights
- feature : use futures (from package future) to speed up retrieval and parsing
- feature : now there is a
get_robotstxts()
function wich is a 'vectorized' version ofget_robotstxt()
- feature :
paths_allowed()
now allows checking via either robotstxt parsed robots.txt files or via functionality provided by the spiderbar package (the latter should be faster by approximatly factor 10) - feature : various functions now have a ssl_verifypeer option (analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html) which might help with robots.txt file retrieval in some cases
- change : user_agent for robots.txt file retrieval will now default to:
sessionInfo()$R.version$version.string
- change : robotstxt now assumes it knows how to parse --> if it cannot parse it assumes that it got no valid robots.txt file meaning that there are no restrictions
- fix : valid_robotstxt would not accept some actual valid robotstxt files
- restructure : put each function in separate file
- fix : parsing would go bonkers for robots.txt of cdc.gov (e.g. combining all robots with all permissions) due to errornous handling of carriage return character (reported by @hrbrmstr - thanks)
- user_agent parameter added to robotstxt() and paths_allowed to allow for user defined HTTP user-agent send when retrieving robots.txt file from domain
- fix : non robots.txt files (e.g. html files returned by server instead of the requested robots.txt / facebook.com) would be handled as if it were non existent / empty files (reported by @simonmunzert - thanks)
- fix : UTF-8 encoded robots.txt with BOM (byte order mark) would break parsing although files were otherwise valid robots.txt files
- updating NEWS file and switching to NEWS.md
- CRAN publication
-
get_robotstxt() tests for HTTP errors and handles them, warnings might be suppressed while un-plausible HTTP status codes will lead to stoping the function https://github.com/ropenscilabs/robotstxt#5
-
dropping R6 dependency and use list implementation instead https://github.com/ropenscilabs/robotstxt#6
-
use caching for get_robotstxt() https://github.com/ropenscilabs/robotstxt#7 / https://github.com/ropenscilabs/robotstxt/commit/90ad735b8c2663367db6a9d5dedbad8df2bc0d23
-
make explicit, less error prone usage of httr::content(rtxt) https://github.com/ropenscilabs/robotstxt#
-
replace usage of missing for parameter check with explicit NULL as default value for parameter https://github.com/ropenscilabs/robotstxt#9
-
partial match useragent / useragents https://github.com/ropenscilabs/robotstxt#10
-
explicit declaration encoding: encoding="UTF-8" in httr::content() https://github.com/ropenscilabs/robotstxt#11
- first feature complete version on CRAN