- Developed and used for basic automation of continuous heritrix jobs from 2020/05, till 2022, with dedicated virtual machines / containers heritrix instances in Webarchive CZ.
- Running on single VM/container and in single instance
- Dedicated heritrix user
- Accessible cron for dedicated user
- Installation of Java
- Installation of Heritrix vs. 3
- necessary to set up with all system links / FS shares during deploy of crawling server and installation of heritrix engine are not part of this readme
- for more complex jobs, should be used DB templates, AMQP protocols and asynchronous event processing
- Continuous suite is composed of several scripts and specialised functions for passive monitoring and simple running of continuous crawls with high intensisty
- Consists of:
- project directory - Specific continuous dir
- stat-checks - Statistics samples logs dir
- logs-runtime - Runtime logs dir
- central directory - General dir for all projects
- archival directory - structure not included here, dependent on each case and overall archival strategy, also as FS structure
- project directory - Specific continuous dir
-
- pull and deploy
cd <Crawler-config-dir>
git clone https://github.com/WebarchivCZ/continuous-suite.git
cd continuous-suite
mv project-dir <Project-dir>
-
- create new cron record
heritrix@crawler:~$ crontab -e
6 4,9,13,17,21 * * * <project_dir>/continuous-suite.sh >> <project_dir>/logs_runtime/<project_name>-cs-`date +\%Y\%m`.log 2>&1
-
- before running, it is necessary to configure paths and variables (see lower, in 2. Customisations)
- main project directory and settings for each continuous project
- for each instance of continuous crawl create own directory
- necessary to set up project path: /
- installation
- by deploying project-dir for concrete continuous type project
cd <Project-dir>
ls
continuous-suite.sh # Main running script, flow of events
settings-continuous.cfg.sh # Main settings, sourcing
seeds.txt # Actual seeds - could be customised, eg. tsv import
stat-checks # Directory for aggregated crawl sample statistics in tsv format, eg. Continuous-Cov19-2023-02-01-Cov19.techlog.tsv
runtime-logs #Logs of cron runtime
- main settings for crawler and crawl settings and continuous-suite
- important - fill up with actual paths and dates
- specification - typ - Key identifier composed from Type_ProjectName - expecting same as project directory
- logins and passes set up in local .sh, not here
- important - fill up with actual paths and dates
- settings categories
- Crawler variables and paths
- Crawl Metadata
- Dates (authomatic)
- Project
- important - typ - Key identifier composed from Type_ProjectName - expecting same as project directory
- Paths
- Organizational
- Crawl Quantitative - Dynamic crawler values
- Seeds source - TSV
- Stats
- main script with event flow and crawler set up and app logic
- uses other supporting scripts:
- project dir
- settings-continuous.cfg.sh
- crawler-beans.template
- central dir
- start-crawler.sh
- stat-checker.sh
- archive dir
- archive_logs.sh
- project dir
- on first deploy:
- change target for sourcing settings-continuous.cfg.sh after project path:
source <Crawler-config-dir>/continuous-suite/<Project-dir>
- change target for sourcing settings-continuous.cfg.sh after project path:
- structure:
-
- Initiation of variables
-
- Functions defintions
- helping functions
-
- Seeds reactualization
- reactualization of seedsm otpional
- 4.A Crawl Initiation - Set up deploy
- deploy crawler-beans.cxml after actual variables
- 4.B Crawl Initiation - Crawler initiation
- restart of heritrix crawler
-
- Crawl - Basic Event Flow
-
- Archiving and operationa logs cleaning
-
- actual crawl flow accords to basic sequential crawl flow
- Crawl Initiation
- Crawl Launch
- Crawl Unpausing - here depends at crawler setting
- Crawl Runnning
- Crawl Termination
- Crawl Teardown
- supporting template file for crawl
- on deploy:
- necessary credentials - either include own or comment it
- for domain facebook.com
- ,
- for domain twitter.com
- ,
- etc. set up credentialStore
- for domain facebook.com
- customize local beans (ad hoc)
- rejectLocalCalendars.regexList
- surtPrefixes
- rejectLocalCalendars-sheet
- rejectLocalTraps-sheet
- SurtPrefixesSheetAssociation
- necessary credentials - either include own or comment it
- structure of main settings and their template (change only ad hoc project / project type / structure)
- basic settings
- metadata.jobName=%TYP% %DNES%-%SHORT_N%
- metadata.operator=%ACTUAL_OPERATOR%
- metadata.description=%M_COMMENT%
- warcWriter.prefix=%TYP%-%DNES%-%SHORT_N%_%CRAWLER_HOST%-
- warcWriter.storePaths=%CORE_STORE%/%TYP_LC%/%TYP%-%DNES%-%SHORT_N%
- duplication reduction and ops
- historyBdb.dir=%CRAWLER_JOBS%/history/%YEAR%-history-state-year
- bdb.dir=%CRAWLER_JOBS%/states/%TYP_LC%/%YEAR%
- crawlController.scratchDir=%CRAWLER_JOBS%/scratchx%TYP_LC%
- settings
- frontier.balanceReplenishAmount=%BALANCE_REPLENISHAM%
- crawlController.maxToeThreads=%MAX_TOETHREADS%
- crawlLimiter.maxTimeSeconds=%MAX_TIMESECONDS%
- #tooManyHopsDecideRule.maxHops=%MAX_HOPS%
- #transclusionDecideRules.maxTransHops=%MAX_TRANSHOPS%
- #transclusionDecideRules.maxSpeculativeHops=%MAX_SPECHOPS%
- #scope.maxHops=%MAX_HOPS%
- #scope.maxTransHops=%MAX_TRANSHOPS%
- #scope.maxSpeculativeHops=%MAX_SPECHOPS%
- warcWriter.poolMaxActive=%POOL_MAXACTIVE%
- metadata
- metadata.operatorContactUrl=%OPER_WEB%
- metadata.operatorFrom=%OPER_MAIL%
- metadata.organization=%OPER_ORGANIZATIONFULL%
- metadata.audience=%OPER_AUDIENCE%
- metadata.userAgentTemplate=%OPER_TEMPLATE%
- storePaths
- %CORE_STORE%/%TYP_LC%/%TYP%-%DNES%-%SHORT_N%
- basic settings
- contains scripts central to multipurpose use, not only
/opt/heritrix
start-crawler.sh # Crawler starting script
stat-checker.sh # Statistics checking
- creates statistic logs ".techlog.tsv"
- usage:
- automatical from continuous-script.sh
- manual:
<central-dir-path>/crawlchecker.sh <runtime_seconds> Continuous-Cov19-YYYY-MM-DD-Cov19 <ip_adress> <port>
- set up variables
- type, ip-adress, file_name, runtime in seconds are taken from Main settings (settings.topics.cfg.sh )
runtime="${1} seconds" TYP=${2} # "Continuous-X" ADDRESS=${4} # "IP adress X.X.X.X" PORT=${5} # port FName=${3} # Path+Filename
- login and password need to set up locally
login="<actual-login>" pass="<actual-password>"
- starts crawler with defined parametres
- on deploy:
- change
<actual-crawler-login>
,<actual-crawler-password>
- change
- usage:
- automatical from continuous-script.sh
- manual
start-crawler.sh <crawler_ip> <port> <java-XmxInMB> <java-XmsInMB> <HERITRIX_HOME> <JAVA_HOME>
- recommended strutcture for logging /mnt/archives/rok/continuous-name vcitane scriptu z toolboxu pre ukladanie logov
- external script, created by R. Kreibich and updated by P. Habetinova
- added for inspiration
- missing from scope of this project
- it is dependent on indexation processess of project and archive needs
- basic process related structure with datetime
- samples by set up frequency, should cover main aspects of crawl runtime
| date | elapsedMilliseconds | lastReachedState | novel | dupByHash | warcNovelContentBytes | warcNovelUrls | activeQueues | snoozedQueues | exhaustedQueues | busyThreads | congestionRatio | currentKiBPerSec | currentDocsPerSecond | usedBytes | alertCount |
| ----- | ---------- | ---- | ---------- | ---------- | ---------- | ------ | ---- | ---- | ---- | ---- | --- | --- | ---- | --- | ------- | ---- |
| 09:09:32 | 33102 | PAUSE | 0 | 0 | | | 0 | 0 | 0 | 0 | | 0 | 0 | 1293253128 | 0 |
| 09:10:34 | 93431 | RUN | 60365985 | 88987568 | 60366378 | 1392 | 492 | 229 | 120 | 150 | 12.997361 | 2124 | 35 | 1463790776 | 0 |
| 09:11:38 | 155843 | RUN | 169967391 | 286983358 | 169967914 | 2711 | 596 | 396 | 194 | 150 | 10.399267 | 5432 | 98 | 1384826816 | 0 |
| 09:12:40 | 218913 | RUN | 318117678 | 623741404 | 318118201 | 4458 | 712 | 542 | 295 | 150 | 9.520231 | 7712 | 102 | 3413708416 | 0 |
- accessible in archived crawl dirs, standardly created by heritrix
Continuous suite is free software; you can redistribute it and/or modify it under the terms of the GNU GPL 3, with reservation to secrets generally and path and date customisations in continuous-setting.cfg.sh.