-
Notifications
You must be signed in to change notification settings - Fork 275
Configure Multiple Access Points For Multiple CDX Collections
This document describes step-by-step configuration of separate access points for individual collections. Every collection is a set of ARC/WARC files that is indexed in CDX files. To save the storage space, ARC/WARC files can be compressed and have file extension .arc.gz
or .warc.gz
.
To illustrate the step-by-step configuration, we will take an example where we have two collections namely art
and news
. Each of the collections have couple of .warc.gz
files (could be other supported formats as well). Suppose these collections are stored in the following directory structure:
$ tree /archives
/archives
└── collections
├── art
│ ├── art-20140313083412-000.warc.gz
│ └── art-20140422132637-001.warc.gz
└── news
├── news-20140315112738-000.warc.gz
└── news-20140418034624-001.warc.gz
Suppose that our Wayback server has a domain name wayback.example.com
and we want to setup three access points as follows:
-
/art/
access point only searches in theart
collection. -
/news/
access point only searches in thenews
collection. -
/all/
access point searches in all the collections and gives the composite result.
Default Wayback server comes pre-configured to use BDB Index
(Berkeley Data Base) that enables automatic indexing of small collection which is suitable for single access point. But for large scale collections with multiple access points, manually generated CDX indexing is preferred.
In this case we will need one or more CDX indexes for each collection along with path indexes. Path index is a simple sorted text file that has two columns separated by a TAB; the first column contains ARC/WARC file name and the second column contains corresponding full path to the file (or full path with the domain name if on a remote host). A utility called cdx-indexer
is shipped with Wayback download (can be found in the bin
directory) to generate CDX index from ARC/WARC files. For large collections we might want to write a script to automate the process of CDX generation while internally calling the shipped cdx-indexer
script.
[TODO: Write a separate guide to describe the CDX generation.]
Suppose that we have generated one CDX file and one path index file for the art
collection and similarly for the news
collection. There can be more than one CDX files for each collection, but for the sake of simplicity, we are keeping one CDX file per collection. We have also created an additional path index file that contains the file and path listing of both the collections (this can be created by merging the two path index files and sorting them). Suppose that our archives directory now has the following directory structure:
$ tree /archives
/archives
├── collections
│ ├── art
│ │ ├── art-20140313083412-000.warc.gz
│ │ └── art-20140422132637-001.warc.gz
│ └── news
│ ├── news-20140315112738-000.warc.gz
│ └── news-20140418034624-001.warc.gz
├── cdx-idx
│ ├── index-art.cdx
│ └── index-news.cdx
└── path-idx
├── art-path-idx.txt
├── news-path-idx.txt
└── all-path-idx.txt
First of all we need to install Apache Tomcat, if not already installed. Once Tomcat is up and running, it will have a webapps
directory. In our case it is located at /var/lib/tomcat7/webapps
, but it may differ based on how Tomcat is configured on your machine. Now we need to obtain the latest copy of OpenWayback and install it. Please refer How to Install guide for further details. In this setup we will assume that you have installed Wayback as ROOT
application. Although you can choose to name it anything else, but the configurations are easier for ROOT
application.
Now we will focus on configuration files available in WEB-INF
directory of Wayback application. Now let's have a look at the default wayback.xml
file. Comments and unnecessary commented blocks have been removed to reduce the number of lines:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd"
default-init-method="init">
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="properties">
<value>
wayback.basedir=/tmp/wayback
wayback.urlprefix=http://localhost:8080/wayback/
</value>
</property>
</bean>
<bean id="waybackCanonicalizer" class="org.archive.wayback.util.url.AggressiveUrlCanonicalizer" />
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB">
<property name="bdbPath" value="${wayback.basedir}/file-db/db/" />
<property name="bdbName" value="DB1" />
<property name="logPath" value="${wayback.basedir}/file-db/db.log" />
</bean>
<!--
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="${wayback.basedir}/path-index.txt" />
</bean>
-->
<import resource="BDBCollection.xml"/>
<!--
<import resource="CDXCollection.xml"/>
<import resource="RemoteCollection.xml"/>
<import resource="NutchCollection.xml"/>
-->
<import resource="ArchivalUrlReplay.xml"/>
<bean name="+" class="org.archive.wayback.webapp.ServerRelativeArchivalRedirect">
<property name="matchPort" value="8080" />
<property name="useCollection" value="true" />
</bean>
<bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
<property name="accessPointPath" value="http://localhost:8080/wayback/"/>
<property name="internalPort" value="8080"/>
<property name="serveStatic" value="true" />
<property name="bounceToReplayPrefix" value="false" />
<property name="bounceToQueryPrefix" value="false" />
<property name="replayPrefix" value="${wayback.urlprefix}" />
<property name="queryPrefix" value="${wayback.urlprefix}" />
<property name="staticPrefix" value="${wayback.urlprefix}" />
<property name="collection" ref="localbdbcollection" />
<!--
<property name="collection" ref="localcdxcollection" />
-->
<property name="replay" ref="archivalurlreplay" />
<property name="query">
<bean class="org.archive.wayback.query.Renderer">
<property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" />
</bean>
</property>
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="${wayback.urlprefix}"/>
</bean>
</property>
<property name="parser">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser">
<property name="maxRecords" value="10000" />
</bean>
</property>
</bean>
</beans>
[A work in progress...]
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git