Skip to content

Tagging design ideas

Alex Osborne edited this page Feb 20, 2019 · 15 revisions

Archive-It givens:

  • Each WARC belongs to one collection
  • Each collection belongs to one organization
  • Each collection could be public or private and users can toggle this at will - this is stored in a PostGres DB
  • There are 10k+ collections and 1k+ organizations
  • The collection ID is baked into the filename (but not the organization ID)

Archive-It wants:

  • We want to be able to use a simple CDX server (OutbackCDX)
  • We want to be able to filter a query based on both the collection ID and organization ID
  • We want to be able to query all public organizations

Archive-It currently does:

  • Query all CDX entries for a given URL
  • Filter based on WARC name for organizations and collections:
    • For collections, filter on WARC filename, which contains collection ID
    • For for organizations, query DB for every collection the organization owns, and building a regex a la ARCHIVEIT-(coll1|coll2|...)
    • For private collections, rely on set of rules targeted at collection ID

Options

  1. Extend OutbackCDX with tagging through column families OutbackCDX is already using those.
  2. Triplicate data: one index with all data; one index per collection; one index per organization Unlikely, as RocksDB given a) the number of files involved, and b) how much RocksDB keeps in memory for each index.
  3. Implement a tagging index
  4. Continue filtering on filename
    • Filter as now: fetch all records and filter in wayback
    • Implement a filter QS arg (e.g /cdx?url=....&filter=filename:ARCHIVEIT-collID)
  5. Plugin that can talk to Postgres

Tagging index

Key Value
OutbackCDX Record Key space-separated list of collection and organization (prefix org-) IDs?

Might be possible to have multiple tagging indices (e.g: one for collections, one for orgs).

As an implementation detail, would need to find a way to name these indices so that they will not be used as databases for existing incides (_tags/coll or something?)

Will also depend on the following.

Filter QS arg

OutbackCDX could grow has grown a query-string argument filter that would accept one (or more?†) key/value pair(s) to run the query results through on the way out. This could include the current mechanism of filtering on the WARC name filter=filename:ARCHIVEIT-(coll1|coll2|...)-.*, or, to go with the previous, filter=tag.coll:coll1 or filter=tag.org:org1.

† The current implementation is a single filter. It can and should be fixed to accept multiple. - Alex

Handling the all access point

If we go with a tagging index how do we handle the all (all public collections) access point?

  1. Add a public (or private) tag and reindex when a collection is toggled between private and public.
  2. Wayback generates a very large tag:coll1|coll2|... filter for all private or public collection ids (costs ~1ms for 10k ids, ~10ms for 100k, ~100ms for 1m).
  3. Add an API to OutbackCDX for registering named access points with a given set of tags (or regex filters?).
  4. Plugin (see below)

Plugin

OutbackCDX could grow a plugin API that looks something like:

class ArchiveItPlugin implements FilterPlugin {
   static Pattern ACCESS_POINT_REGEX = Pattern.compile("all|coll-([0-9]+)|org-([0-9]+)");

   BitSet privateCollections = periodicallyLoadSomehowFromPostgres();
   Map<Long,Long> orgIdForCollId = periodicallyLoadSomehowFromPostgres();

   public Predicate<Capture> newFilter(Query query) {
       Matcher m = ACCESS_POINT_REGEX.matcher(query.getAccessPoint());
       if (!m.matches()) return null;
       long collId = m.group(1);
       long orgId = m.group(2);

       if (collId != null) {
           return capture -> getCollId(capture).equals(collId);
       } else if (orgId != null) {
           return capture -> getOrgId(capture).equals(orgId);
       } else { // all (everything except private collections)
           return capture -> !privateCollections.get(getCollId(capture));
       }
   }

   // Helpers to parse filename or read it from the tags field if we add that
   private static Pattern WARC_REGEX = Pattern.compile("ARCHIVEIT-([0-9]+)-([0-9]+)\\.warc\\.gz");
   long getCollId(Capture capture) { return WARC_REGEX.matcher(capture.file).group(1); }
   long getOrgId(Capture capture) { return orgIdForCollId.get(getCollId(capture)); }
}