Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertix Crawler 0.8.1
What's Changed
- Logging and Behavior Tweaks by @ikreymer in #229
- Fix typos by @stavares843 in #232
- Add crawl log to WACZ by @ikreymer in #231
New Contributors
- @stavares843 made their first contribution in #232
Full Changelog: 0.8.0...0.8.1
Browsertrix Crawler 0.8.0
What's Changed
- Switch to Chrome/Chromium 109 in #184
- Convert to ESM module in #184
- Add ad blocking via request interception (#173)
- new setting: add support for specifying language via the --lang flag by @ikreymer in #186
- Add screenshot functionality by @tw4l in #188
- Remove dead pywb configuration by @edsu in #198
- Use VNC for headful profile creation by @ikreymer in #197
- arg parsing fix: by @ikreymer in #200
- Improve crawler logging by @tw4l in #195
- Add requests[socks] python dependency by @kuechensofa in #201
- Add RedisCrawlState test by @tw4l in #208
- crawl state: add getPendingList() to return pending state from either… by @ikreymer in #205
- Serialize Redis pending pages as JSON objects by @tw4l in #212
- behaviors: don't run behaviors in iframes that are about:blank or are… by @ikreymer in #211
- Bump to Chrome 109, Beta 0.8.0-beta.1 Release by @ikreymer in #215
- Fix --overwrite CLI flag by @tw4l in #220
- deps: bump pywb to 2.7.3 by @ikreymer in #222
- update behaviors to 0.4.1, rename 'Behavior line' -> 'Behavior log' by @ikreymer in #223
New Contributors
- @kuechensofa made their first contribution in #201
Full Changelog: 0.7.1...0.8.0
Browsertix Crawler 0.8.0 Beta 1
What's Changed
- Improve crawler logging by @tw4l in #195
- Add requests[socks] python dependency by @kuechensofa in #201
- Add RedisCrawlState test by @tw4l in #208
- crawl state: add getPendingList() to return pending state from either… by @ikreymer in #205
- Serialize Redis pending pages as JSON objects by @tw4l in #212
- behaviors: don't run behaviors in iframes that are about:blank or are… by @ikreymer in #211
- Bump to Chrome 109, Beta 0.8.0-beta.1 Release by @ikreymer in #215
New Contributors
- @kuechensofa made their first contribution in #201
Full Changelog: 0.8.0-beta.0...0.8.0-beta.1
Browsertrix Crawler 0.8.0 Beta 0
Key Features
- Switch to Chrome/Chromium 105
- Convert to ESM module
- Add ad blocking via request interception (#173)
- Support for setting browser language (#186)
- Screenshot functionality with different options: current view, full page, and thumbnail (#188)
- Switch to VNC for interactive profile creation, which is now default, automated creation via --automated
What's Changed
- Dev/0.8.0 by @ikreymer in #184
- new setting: add support for specifying language via the --lang flag by @ikreymer in #186
- Add screenshot functionality by @tw4l in #188
- Remove dead pywb configuration by @edsu in #198
- Use VNC for headful profile creation by @ikreymer in #197
- arg parsing fix: by @ikreymer in #200
Full Changelog: 0.7.1...0.8.0-beta.0
Browsertix Crawler 0.7.1
Browsertrix Crawler 0.7.0
What's Changed
- Update to Chrome/Chromium 101 - (0.7.0 Beta 0) by @ikreymer in #144
- Add --netIdleWait, bump dependencies (0.7.0-beta.2) by @ikreymer in #145
- Update README.md by @atomotic in #147
- Wait Default + Logging Improvements by @ikreymer in #153
- Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements by @ikreymer in #157
- Logging and browser improvements: by @ikreymer in #158
- pending wait: set max pending request wait to 120 seconds by @ikreymer in #161
- Default Wait-Time Improvements by @ikreymer in #162
- Interrupt Handling Fixes by @ikreymer in #167
- Run in Docker as User by @edsu in #171
New Contributors
Full Changelog: 0.6.0...0.7.0
Browsertix Crawler 0.7.0 Beta 5
What's Changed
- Interrupt Handling Fixes by @ikreymer in #167
- Update to Browsertrix Behaviors 0.3.4 - Fix for lazy-loaded images #165
Full Changelog: 0.7.0-beta.4...0.7.0-beta.5
Browsertix Crawler 0.7.0 Beta 4
Fixing related to wait times, including:
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
Full Changelog: 0.7.0-beta.3...0.7.0-beta.4
Browsertrix Crawler 0.7.0 Beta 3
What's Changed
- Overhaul of page concurrency system: better detection of windows that are stuck, only reuse same window for every 25 pages, #157
- Logging improvements: pywb.log written with
--logging pywb
, JS errors logged with--logging jserrors
#158 - Avoid getting stuck on pending requests at end of crawl: #161
- Update to Browsertrix Behaviors 0.3.3: Better Crawling of twitter and autoplay of videos
- Update to pywb 2.6.8: Includes better rewriting of embedded twitter videos.
Full Changelog: 0.7.0-beta.2...0.7.0-beta.3
Browsertix Crawler 0.7.0 Beta 2
Fixes include:
- Default --waitUntil set to
load
instead ofload,networkidle2
, due to occasional hanging waiting for both - Add --netIdleWait to specify wait for network idle after load (defaults to 10 seconds)
- Update to puppeteer 16.1.0
- Logging: if pywb logging is enabled, write logs to collection dir
./logs/pywb.log
and./logs/redis.log
- Logging: reduce logging by not printing duplicate behavior status logs
- pywb/openssl: allow 'unsafe legacy renegotiation' to avoid errors capturing sites that use older ssl
Full Changelog: 0.7.0-beta.1...0.7.0-beta.2