-
Notifications
You must be signed in to change notification settings - Fork 276
Refactoring ideas
#Change API to use UsableURI wherever possible instead of strings.
##Motivation Most places in Open Wayback where a URI is expected, it is represented as a string. This makes the API fragile because you never know if the URI is valid, meaning that every class ideally needs to validate the URI before using it. If so, lots of code is duplicated across classes and processing time wasted by validating a URI that is probably validated before. If it is not validated everywhere, then we’re not able to ensure that validation or normalization of the URI is done the same for every possible configuration.
Another problem is that there is a chance that changes to how URIs are handled, are not updated everywhere they should. One such example could be support for international domain names (IDN). The rules for rewriting UTF-8 host names to puny-code and back, should be implemented once and be available in every class using it.
##Changes The proposal is to replace most places in the API where a URI is expected as a String to use UsableURI instead. This will probably also require additions to the UsableURI class to support some of the standard rewriting and splitting of URIs. Maybe the UsableURI class should add more support for SURTs as well.
I also question the use of the factory-pattern for generating UsableURI instances. If UsableURI was immutable this pattern would be ok, but it isn't. A lot of set-methods inherited from org.apache.commons.httpclient.URI
makes it possible to alter the URI without enforcing the logic in the factory class. It would be better to override methods in UsableURI to make sure values always are parsed the same way. If we ommit the factory class, it would probably be a good idea to make the constructor private and add static instantiation methods to avoid confusion when constructing UsableURI from uri string or SURT string.
public class UsableURI extends LaxURI implements CharSequence, Serializable {
public static UsableURI fromUriString(String uri)
public static UsableURI fromUri(java.net.URI uri)
public static UsableURI fromSurtString(String surt)
...
}
If we would like to keep the factory, then UsableURI should be immutable. This could be achieved in two ways:
- Override all setter-methods with private no-ops to make UsableURI immutable.
- Let UsableURI be an interface and only expose methods which doesn't change the value without creating new instances.
(this is still a pile of random thoughts - I'd welcome comments/clean ups - Kenji) There are two frameworks for filtering out captures: one for ResourceIndex
/CaptureSearchResult
and another for CDXServer
/CDXLine
. As we plan to consolidate index implementation into CDXServer
, I focus on CDXServer
version of capture filtering here. In order to reuse filtering components originally written for ResourceIndex
framework, there is some dirty bridge work in CDXServer
/CDXLine
. That's one area needing clean-ups.
Two kinds of filtering are recognized to date:
- scope filtering (for hosting multiple collections on top of single Wayback index)
- access control (for prohibiting playback of certain captures, often controlled by external filtering rule database)
Scope filtering is currently tightly coupled with CompositeAccessPoint
, and specific to use case at Internet Archive. Access control can also change visibility of CDX fields as well as filtering out CDX lines altogether.
Key difference between these two is:
- scope filtering is silent; there's no need to communicate to the user that captures are being filtered out. whereas,
- access control needs to communicate what filtering took effect and how (ex. "excluded by robots.txt" etc.) This communication is not well implemented in my opinion (more later).
Another (minor) difference is:
- scope filtering is usually statically configured, whereas,
- access control often vary by client (username, IP address etc.)
Description of classes involved.
This interface is a factory of CDXAccessFilter
. CDXServer is configured with an implementation of this interface at startup. CDXAccessFilter
is a per-session filtering object.
AuthChecker
also grants permissions to AuthToken
. Primary implementation PrivTokenAuthChecker
is configured with a list of pre-defined tokens, and determines whether a user (principal; represented by AuthToken
) has certain permissions.
Mix-up of these two functionality is a legacy of stand-alone CDXServer implementation. Common approach is to have separate authentication/authorization component, and let other parts of application consult with principal object for user permissions. With this architecture, we can remove isAllUrlAccessAllowed
and isAllCdxFieldAccessAllowed
methods. It is harder to reuse authentication/authorization functionality in Wayback because of this mix-up.
getPublicCdxFields
method appears to be unused. Don't know why it must be part of AuthChecker interface. PrivTokenAuthChecker
has setPublicCdxFields(String)
, which updates publicCdxFormat
property with FieldSplitFormat
object. There's no setPublicCdxFormat(FieldSplitFormat)
method.
getPublicCdxFormat
is used by CDXServer.writeCdxResponse
method:
if (!authChecker.isAllCdxFieldAccessAllowed(authToken)) {
outputFields = this.authChecker.getPublicCdxFormat();
}
This property could be moved to CDXServer
.
WaybackAPAuthChecker
has been superseded by AccessPointAuthChecker
and there is no known user of WaybackAPAuthChecker
currently. Its base class WaybackAuthChecker
has no other sub-classes. These two classes can be removed.
Name of this class indicates close tie to PrivTokenAuthChecker
. This is more like a Principal
class defined by JAAS.
authToken
field is the name of principal. cachedAllUrlAllow
, cachedAllCdxAllow
and ignoreRobots
are permissions.
AuthToken
is abused to pass AccessPoint
to AuthChecker
. Its sole sub-class APContextAuthToken
saves AccessPoint
object passed to its constructor for AuthChecker
implementation (ex. AccessPointAuthChecker
) to build collection-specific CDXAccessFilter
. By introduction of AccessPoint.createExclusionFilter
method, this method has become a standard way of instantiating ExclusionFilter
. We should add CollectionContext
parameter to CDXServer.getCdx
.
setAllCdxFieldsAllow()
method and setIgnoreRobots(boolean)
methods are worth special note. These methods are used by EmbeddedCDXServerIndex
for configuring APContextAuthToken
for internal use of CDXServer
An interface for per-session filtering object. It defines two methods:
- boolean
includeUrl(String urlKey, String originalUrl)
- boolean includeCapture(CDXLine line)
includeUrl
is called (by CDXServer.getCdx(CDXQuery, AuthToken)
) just once for URL, before any calls to includeCapture
, to check for per-URL filtering. This method exists so as to quickly detect per-URL exclusion, even before loading the first line of CDX.
Another purpose of this method is to communicate the act of filtering. If this method returns false
, CDXServer will silently return empty result; There's no way to tell if the URL has never been captured, or excluded per-URL basis. To communicate the act of filtering, AccessCheckFilter
(primary implementation of CDXAccessFilter
) throws RuntimeIOException
wrapping an instance of AccessControlException
carrying more information on the type of filtering applied. Wayback defines four sub-classes of AccessControlException
:
-
AdministrativeAccessControlException
- excluded by other (possibly manually set up) policy rules -
RobotControlAccessControlException
- excluded by robots.txt rules -
RobotNotAvailableException
- unused inCDXServer
-
RobotTimeOutAccessControlException
- unused inCDXServer
EmbeddedCDXServerIndex.doQuery
catches wrappingRuntimeIOException
and re-throws innerAccessControlException
. We should be able to makeAuthChecker
andCDXAccessFilter
throwAccessControlException
directly.
Semantics of the first two exceptions are loose. AccessCheckFilter
throws AdministrativeAccessControlException
when whatever ExclusionFilter
given to its adminFilter
parameter returns values other than FILTER_INCLUDE
. Similarly it throws RobotAccessControlException
whenever its robotsFilter
returns non-FILTER_INCLUDE
value. As such, AccessCheckFilter
is not extensible to allow other types of exclusions. For this reason, recent changes are moving away from this approach; New AccessPointAuthChecker
creates AccessCheckFilter
with just one ExclusionFilter
object, returned by AccessPoint.createExclusionFilter()
.
ExclusionFilter
originates from ResourceIndex
/CaptureSearchResult
framework. AccessCheckFilter
uses it so that existing exclusion filters (most notably OracleExclusionFilter
and StaticMapExclusionFilter
) can be reused with CDXServer
.
As its filterObject
method needs CaptureSearchResult
as an argument, AccessCheckFilter
creates temporary wrapper CaptureSearchResult
object for every CDXLine
(CDXLine
cannot implement CaptureSearchResult
as it is not an interface). This is very inefficient.
Older code did not have this issue because filtering was run in CDXToCaptureSearchResultsWriter
who already had CaptureSearchResult
objects. This old method is still supported, but strongly discouraged as it screws up CDX query result if exclusion and collapsing are combined. CDXWriter
should focus on converting CDXLine
to final output - other CDXWriter
implementations, like PlainTextWriter
, do not implement exclusion.
Adopt design pattern from JAAS:
- AuthToken as Subject
- AuthChecker as LoginModule (or LoginContext) This implies:
- moving permission attributes and test methods from AuthChecker to AuthToken
- add a new method to AuthChecker interface, that corresponds to LoginModule#initialize and LoginModule#login
- remove
createAccessFilter
method fromAuthChecker
, embed the essential portion of its functionality inCDXServer
Adopting full JAAS would make Wayback configuration way too complex without clear benefits. Unless other requirements arise, simplified framework would suffice. Alignment with JAAS design makes it easier to understand and extend.
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git