Tagging design ideas

Archive-It givens:

Each WARC belongs to one collection
Each collection belongs to one organization
Each collection could be public or private and users can toggle this at will - this is stored in a PostGres DB
There are 10k+ collections and 1k+ organizations
The collection ID is baked into the filename (but not the organization ID)

Archive-It wants:

We want to be able to use a simple CDX server (OutbackCDX)
We want to be able to filter a query based on both the collection ID and organization ID
We want to be able to query all public organizations

Archive-It currently does:

Query all CDX entries for a given URL
Filter based on WARC name for organizations and collections:
- For collections, filter on WARC filename, which contains collection ID
- For for organizations, query DB for every collection the organization owns, and building a regex a la ARCHIVEIT-(coll1|coll2|...)
- For private collections, rely on set of rules targeted at collection ID

Options

~~Extend OutbackCDX with tagging through column families~~ OutbackCDX is already using those.
~~Triplicate data: one index with all data; one index per collection; one index per organization~~ Unlikely, as RocksDB given a) the number of files involved, and b) how much RocksDB keeps in memory for each index.
Implement a tagging index
Continue filtering on filename
- Filter as now: fetch all records and filter in wayback
- Implement a filter QS arg (e.g /cdx?url=....&filter=filename:ARCHIVEIT-collID)
Plugin that can talk to Postgres

Tagging index

Key	Value
OutbackCDX Record Key	space-separated list of collection and organization (prefix org-) IDs?

Might be possible to have multiple tagging indices (e.g: one for collections, one for orgs).

As an implementation detail, would need to find a way to name these indices so that they will not be used as databases for existing incides (_tags/coll or something?)

Will also depend on the following.

Filter QS arg

OutbackCDX ~~could grow~~ has grown a query-string argument filter that would accept one (or more?†) key/value pair(s) to run the query results through on the way out. This could include the current mechanism of filtering on the WARC name filter=filename:ARCHIVEIT-(coll1|coll2|...)-.*, or, to go with the previous, filter=tag.coll:coll1 or filter=tag.org:org1.

† The current implementation is a single filter. It can and should be fixed to accept multiple. - Alex

Handling the `all` access point

If we go with a tagging index how do we handle the all (all public collections) access point?

Add a public (or private) tag and reindex when a collection is toggled between private and public.
Wayback generates a very large tag:coll1|coll2|... filter for all private or public collection ids (costs ~1ms for 10k ids, ~10ms for 100k, ~100ms for 1m).
Add an API to OutbackCDX for registering named access points with a given set of tags (or regex filters?).
Plugin (see below)

Plugin

OutbackCDX could grow a plugin API that looks something like:

class ArchiveItPlugin implements FilterPlugin {
   static Pattern ACCESS_POINT_REGEX = Pattern.compile("all|coll-([0-9]+)|org-([0-9]+)");

   BitSet privateCollections = periodicallyLoadSomehowFromPostgres();
   Map<Long,Long> orgIdForCollId = periodicallyLoadSomehowFromPostgres();

   public Predicate<Capture> newFilter(Query query) {
       Matcher m = ACCESS_POINT_REGEX.matcher(query.getAccessPoint());
       if (!m.matches()) return null;
       long collId = m.group(1);
       long orgId = m.group(2);

       if (collId != null) {
           return capture -> getCollId(capture).equals(collId);
       } else if (orgId != null) {
           return capture -> getOrgId(capture).equals(orgId);
       } else { // all (everything except private collections)
           return capture -> !privateCollections.get(getCollId(capture));
       }
   }

   // Helpers to parse filename or read it from the tags field if we add that
   private static Pattern WARC_REGEX = Pattern.compile("ARCHIVEIT-([0-9]+)-([0-9]+)\\.warc\\.gz");
   long getCollId(Capture capture) { return WARC_REGEX.matcher(capture.file).group(1); }
   long getOrgId(Capture capture) { return orgIdForCollId.get(getCollId(capture)); }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagging design ideas

Archive-It givens:

Archive-It wants:

Archive-It currently does:

Options

Tagging index

Filter QS arg

Handling the `all` access point

Plugin

Clone this wiki locally

Tagging design ideas

Archive-It givens:

Archive-It wants:

Archive-It currently does:

Options

Tagging index

Filter QS arg

Handling the all access point

Plugin

Clone this wiki locally

Handling the `all` access point