-
Notifications
You must be signed in to change notification settings - Fork 20
Tagging design ideas
- Each WARC belongs to one collection
- Each collection belongs to one organization
- Each collection could be public or private and users can toggle this at will - this is stored in a PostGres DB
- There are 10k+ collections and 1k+ organizations
- The collection ID is baked into the filename (but not the organization ID)
- We want to be able to use a simple CDX server (OutbackCDX)
- We want to be able to filter a query based on both the collection ID and organization ID
- We want to be able to query all public organizations
- Query all CDX entries for a given URL
- Filter based on WARC name for organizations and collections:
- For collections, filter on WARC filename, which contains collection ID
- For for organizations, query DB for every collection the organization owns, and building a regex a la
ARCHIVEIT-(coll1|coll2|...)
- For private collections, rely on set of rules targeted at collection ID
-
Extend OutbackCDX with tagging through column familiesOutbackCDX is already using those. -
Triplicate data: one index with all data; one index per collection; one index per organizationUnlikely, as RocksDB given a) the number of files involved, and b) how much RocksDB keeps in memory for each index. - Implement a tagging index
- Continue filtering on filename
- Filter as now: fetch all records and filter in wayback
- Implement a
filter
QS arg (e.g/cdx?url=....&filter=filename:ARCHIVEIT-collID
)
- Plugin that can talk to Postgres
Key | Value |
---|---|
OutbackCDX Record Key | space-separated list of collection and organization (prefix org-) IDs? |
Might be possible to have multiple tagging indices (e.g: one for collections, one for orgs).
As an implementation detail, would need to find a way to name these indices so that they will not be used as databases for existing incides (_tags/coll or something?)
Will also depend on the following.
OutbackCDX could grow has grown a query-string argument filter that would accept one (or more?†) key/value pair(s) to run the query results through on the way out. This could include the current mechanism of filtering on the WARC name filter=filename:ARCHIVEIT-(coll1|coll2|...)-.*
, or, to go with the previous, filter=tag.coll:coll1
or filter=tag.org:org1
.
† The current implementation is a single filter. It can and should be fixed to accept multiple. - Alex
If we go with a tagging index how do we handle the all
(all public collections) access point?
- Add a
public
(orprivate
) tag and reindex when a collection is toggled between private and public. - Wayback generates a very large
tag:coll1|coll2|...
filter for all private or public collection ids (costs ~1ms for 10k ids, ~10ms for 100k, ~100ms for 1m). - Add an API to OutbackCDX for registering named access points with a given set of tags (or regex filters?).
- Plugin (see below)
OutbackCDX could grow a plugin API that looks something like:
class ArchiveItPlugin implements FilterPlugin {
static Pattern ACCESS_POINT_REGEX = Pattern.compile("all|coll-([0-9]+)|org-([0-9]+)");
BitSet privateCollections = periodicallyLoadSomehowFromPostgres();
Map<Long,Long> orgIdForCollId = periodicallyLoadSomehowFromPostgres();
public Predicate<Capture> newFilter(Query query) {
Matcher m = ACCESS_POINT_REGEX.matcher(query.getAccessPoint());
if (!m.matches()) return null;
long collId = m.group(1);
long orgId = m.group(2);
if (collId != null) {
return capture -> getCollId(capture).equals(collId);
} else if (orgId != null) {
return capture -> getOrgId(capture).equals(orgId);
} else { // all (everything except private collections)
return capture -> !privateCollections.get(getCollId(capture));
}
}
// Helpers to parse filename or read it from the tags field if we add that
private static Pattern WARC_REGEX = Pattern.compile("ARCHIVEIT-([0-9]+)-([0-9]+)\\.warc\\.gz");
long getCollId(Capture capture) { return WARC_REGEX.matcher(capture.file).group(1); }
long getOrgId(Capture capture) { return orgIdForCollId.get(getCollId(capture)); }
}