Refactoring ideas

#Change API to use UsableURI wherever possible instead of strings.

##Motivation Most places in Open Wayback where a URI is expected, it is represented as a string. This makes the API fragile because you never know if the URI is valid, meaning that every class ideally needs to validate the URI before using it. If so, lots of code is duplicated across classes and processing time wasted by validating a URI that is probably validated before. If it is not validated everywhere, then we’re not able to ensure that validation or normalization of the URI is done the same for every possible configuration.

Another problem is that there is a chance that changes to how URIs are handled, are not updated everywhere they should. One such example could be support for international domain names (IDN). The rules for rewriting UTF-8 host names to puny-code and back, should be implemented once and be available in every class using it.

##Changes The proposal is to replace most places in the API where a URI is expected as a String to use UsableURI instead. This will probably also require additions to the UsableURI class to support some of the standard rewriting and splitting of URIs. Maybe the UsableURI class should add more support for SURTs as well.

I also question the use of the factory-pattern for generating UsableURI instances. If UsableURI was immutable this pattern would be ok, but it isn't. A lot of set-methods inherited from org.apache.commons.httpclient.URI makes it possible to alter the URI without enforcing the logic in the factory class. It would be better to override methods in UsableURI to make sure values always are parsed the same way. If we ommit the factory class, it would probably be a good idea to make the constructor private and add static instantiation methods to avoid confusion when constructing UsableURI from uri string or SURT string.

public class UsableURI extends LaxURI implements CharSequence, Serializable {
  public static UsableURI fromUriString(String uri)
  public static UsableURI fromUri(java.net.URI uri)
  public static UsableURI fromSurtString(String surt)
  ...
}

If we would like to keep the factory, then UsableURI should be immutable. This could be achieved in two ways:

Override all setter-methods with private no-ops to make UsableURI immutable.
Let UsableURI be an interface and only expose methods which doesn't change the value without creating new instances.

Access Check and Exclusion

(this is still a pile of random thoughts - I'd welcome comments/clean ups - Kenji) There are two frameworks for filtering out captures: one for ResourceIndex/CaptureSearchResult and another for CDXServer/CDXLine. As we plan to consolidate index implementation into CDXServer, I focus on CDXServer version of capture filtering here. In order to reuse filtering components originally written for ResourceIndex framework, there is some dirty bridge work in CDXServer/CDXLine. That's one area needing clean-ups.

Two kinds of filtering are recognized to date:

scope filtering (for hosting multiple collections on top of single Wayback index)
access control (for prohibiting playback of certain captures, often controlled by external filtering rule database)

Scope filtering is currently tightly coupled with CompositeAccessPoint, and specific to use case at Internet Archive. Access control can also change visibility of CDX fields as well as filtering out CDX lines altogether.

Key difference between these two is:

scope filtering is silent; there's no need to communicate to the user that captures are being filtered out. whereas,
access control needs to communicate what filtering took effect and how (ex. "excluded by robots.txt" etc.) This communication is not well implemented in my opinion (more later).

Another (minor) difference is:

scope filtering is usually statically configured, whereas,
access control often vary by client (username, IP address etc.)

Current Implementation

Description of classes involved.

AuthChecker

This interface is a factory of CDXAccessFilter. CDXServer is configured with an implementation of this interface at startup. CDXAccessFilter is a per-session filtering object.

AuthChecker also grants permissions to AuthToken. Primary implementation PrivTokenAuthChecker is configured with a list of pre-defined tokens, and determines whether a user (principal; represented by AuthToken) has certain permissions.

Mix-up of these two functionality is a legacy of stand-alone CDXServer implementation. Common approach is to have separate authentication/authorization component, and let other parts of application consult with principal object for user permissions. With this architecture, we can remove isAllUrlAccessAllowed and isAllCdxFieldAccessAllowed methods. It is harder to reuse authentication/authorization functionality in Wayback because of this mix-up.

getPublicCdxFields method appears to be unused. Don't know why it must be part of AuthChecker interface. PrivTokenAuthChecker has setPublicCdxFields(String), which updates publicCdxFormat property with FieldSplitFormat object. There's no setPublicCdxFormat(FieldSplitFormat) method.

getPublicCdxFormat is used by CDXServer.writeCdxResponse method: if (!authChecker.isAllCdxFieldAccessAllowed(authToken)) { outputFields = this.authChecker.getPublicCdxFormat(); } This property could be moved to CDXServer.

WaybackAPAuthChecker has been superseded by AccessPointAuthChecker and there is no known user of WaybackAPAuthChecker currently. Its base class WaybackAuthChecker has no other sub-classes. These two classes can be removed.

AuthToken

Name of this class indicates close tie to PrivTokenAuthChecker. This is more like a Principal class defined by JAAS. authToken field is the name of principal. cachedAllUrlAllow, cachedAllCdxAllow and ignoreRobots are permissions.

AuthToken is abused to pass AccessPoint to AuthChecker. Its sole sub-class APContextAuthToken saves AccessPoint object passed to its constructor for AuthChecker implementation (ex. AccessPointAuthChecker) to build collection-specific CDXAccessFilter. By introduction of AccessPoint.createExclusionFilter method, this method has become a standard way of instantiating ExclusionFilter. We should add CollectionContext parameter to CDXServer.getCdx.

setAllCdxFieldsAllow() method and setIgnoreRobots(boolean) methods are worth special note. These methods are used by EmbeddedCDXServerIndex for configuring APContextAuthToken for internal use of CDXServer

CDXAccessFilter

An interface for per-session filtering object. It defines two methods:

boolean includeUrl(String urlKey, String originalUrl)
boolean includeCapture(CDXLine line)

includeUrl is called (by CDXServer.getCdx(CDXQuery, AuthToken)) just once for URL, before any calls to includeCapture, to check for per-URL filtering. This method exists so as to quickly detect per-URL exclusion, even before loading the first line of CDX.

Another purpose of this method is to communicate the act of filtering. If this method returns false, CDXServer will silently return empty result; There's no way to tell if the URL has never been captured, or excluded per-URL basis. To communicate the act of filtering, AccessCheckFilter (primary implementation of CDXAccessFilter) throws RuntimeIOException wrapping an instance of AccessControlException carrying more information on the type of filtering applied. Wayback defines four sub-classes of AccessControlException:

AdministrativeAccessControlException - excluded by other (possibly manually set up) policy rules
RobotControlAccessControlException - excluded by robots.txt rules
RobotNotAvailableException - unused in CDXServer
RobotTimeOutAccessControlException - unused in CDXServer EmbeddedCDXServerIndex.doQuery catches wrapping RuntimeIOException and re-throws inner AccessControlException. We should be able to make AuthChecker and CDXAccessFilter throw AccessControlException directly.

Semantics of the first two exceptions are loose. AccessCheckFilter throws AdministrativeAccessControlException when whatever ExclusionFilter given to its adminFilter parameter returns values other than FILTER_INCLUDE. Similarly it throws RobotAccessControlException whenever its robotsFilter returns non-FILTER_INCLUDE value. As such, AccessCheckFilter is not extensible to allow other types of exclusions. For this reason, recent changes are moving away from this approach; New AccessPointAuthChecker creates AccessCheckFilter with just one ExclusionFilter object, returned by AccessPoint.createExclusionFilter().

ExclusionFilter originates from ResourceIndex/CaptureSearchResult framework. AccessCheckFilter uses it so that existing exclusion filters (most notably OracleExclusionFilter and StaticMapExclusionFilter) can be reused with CDXServer. As its filterObject method needs CaptureSearchResult as an argument, AccessCheckFilter creates temporary wrapper CaptureSearchResult object for every CDXLine (CDXLine cannot implement CaptureSearchResult as it is not an interface). This is very inefficient.

Older code did not have this issue because filtering was run in CDXToCaptureSearchResultsWriter who already had CaptureSearchResult objects. This old method is still supported, but strongly discouraged as it screws up CDX query result if exclusion and collapsing are combined. CDXWriter should focus on converting CDXLine to final output - other CDXWriter implementations, like PlainTextWriter, do not implement exclusion.

Possible Refactoring & API Changes

Authentication/Authorization

Adopt design pattern from JAAS:

AuthToken as Subject
AuthChecker as LoginModule (or LoginContext) This implies:
moving permission attributes and test methods from AuthChecker to AuthToken
add a new method to AuthChecker interface, that corresponds to LoginModule#initialize and LoginModule#login
remove createAccessFilter method from AuthChecker, embed the essential portion of its functionality in CDXServer

Adopting full JAAS would make Wayback configuration way too complex without clear benefits. Unless other requirements arise, simplified framework would suffice. Alignment with JAAS design makes it easier to understand and extend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly