-
Notifications
You must be signed in to change notification settings - Fork 275
Packaged utilities
Sawood Alam edited this page May 11, 2017
·
1 revision
When OpenWayback is built from the source using mvn package
, it includes some binaries (executable scripts) that can be useful to perform certain tasks such as indexing.
Below is a list of packaged utility scrips (also available in the Docker image).
$ bdb-client
Usage: DBPATH DBNAME -w
Read lines from STDIN, inserting into BDBJE at
DBPATH named DBNAME, creating DB if needed.
Usage: DBPATH DBNAME -r [PREFIX]
Dump lines from BDBJE at path DBPATH named DBNAME
to STDOUT. If PREFIX is specified, only output records
beginning with PREFIX, otherwise output all records
$ bin-search
Usage: PREFIX FILE1 [FILE2] ...
$ cdx-indexer
USAGE:
cdx-indexer [-format FORMAT|-identity] FILE
cdx-indexer [-format FORMAT|-identity] FILE CDXFILE
Create a CDX format index from ARC or WARC file
FILE at CDXFILE or to STDOUT.
With -identity, perform no url canonicalization.
With -format, output CDX in format FORMAT.
$ cdx-sample
Need path to CDX argument 1
USAGE: ./cdx-sample PATH NUM
Create a split file for use with Wayback hadoop indexing code on STDOUT.
Finds approximate offsets at host boundaries for file at PATH, producing
a split file with NUM parts, which indicates the number of reduce tasks.
$ create-test-arc
USAGE: srcDir tgtDir [arc_prefix]
$ location-client
USAGE:
[lookup|add|remove|sync] ...
lookup LOCATION-DB-URL ARC
emit all known URLs for arc ARC
add LOCATION-DB-URL ARC URL
inform locationDB that ARC is located at URL
remove LOCATION-DB-URL ARC URL
remove reference to ARC at URL in locationDB
sync LOCATION-DB-URL DIR DIR-URL
scan directory DIR, and submit all ARC files therein
to locationDB at url DIR-URL/ARC
get-mark LOCATION-DB-URL
emit an identifier for the current marker in the
locationDB log. These identifiers can be used with the
mark-range operation.
mark-range LOCATION-DB-URL START END
emit to STDOUT one line with the name of all ARC files
added to the locationDB between marks START and END
add-stream LOCATION-DB-URL
read lines from STDIN formatted like:
NAME<SPACE>URL
and for each line, inform locationDB that file NAME is
located at URL
$ url-client
$ warc-header
USAGE: tgtWarc fieldsSrc id
tgtWarc is the path to the target WARC.gz
fieldsSrc is the path to the text of the record
make sure each line is terminated by \r\n
and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
of the header record... header...
$ zipline-manifest
Usage: ZIPLINES_PATH
$ zl-bin-search
USAGE:
zl-bin-search [-format FORMAT] [-max MAX_BLOCKS] SUMMARY LOCATION KEY
Search a ziplined compressed CDX format index for key
KEY to STDOUT. SUMMARY and LOCATION are paths to the
block summary and file location files.
With -format, output CDX in format FORMAT.
With -max, limit search at most MAX_BLOCKS blocks.
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git