November 2016

During November 2016 we did a lot of work on dor_indexing_app and its associated systems

Summary

Improved performance by ~70% in latency, went from ~11s to ~3s per request
Met throughput target of 3 days for full reindex (1.2M objects)
Implemented performance instrumentation via New Relic
Performed detailed analysis and exposed bottlenecks in code and systems during realistic workloads
Improved stability and documentation of pipeline and systems
Implemented monitoring of all systems

dor_indexing_app
- Added third node to cluster to improve thoughput
- Tuned concurrency parameters between ActiveMQ and reindexer
- Installed New Relic for performance analysis and instrumentation
- Added instrumentation metrics to indexing logs
- Upgrade VM performance by improving hosting configuration
dor-services
- Reduced redundant calls to the workflow services
- Refactored to avoid unnecessary calls to retrieve collection information
- Parse XSLT scripts only once, rather than on every request
- Avoid unnecessary reload of collection objects
- Reduce query traffic to Solr cloud
- Cache the Fedora client certificate store to avoid unnecessary reinitialization
DOR (Fedora)
- Removed traffic from fedora.apim.access messaging (unused but was high volume)
- PENDING: Upgrade NFS storage appliance
- PENDING: Upgrade VM performance by improving hosting configuration

dor_indexing_app
- Full stack monitoring (Fedora, Workflow Service, Sul-Solr, SulMQ)
- OkComputer-based monitoring
- Clarified and documented API, deprecate GET routes
- Upgrade to Rails 5, and use Honeybadger
dor-services
- Refactored model hierarchy and upgraded ActiveFedora to 8.x
- Use identityMetadata as authoritative model definition
- Rely more on stanford-mods for metadata extraction
- Removed unused and unmaintained code
Argo
- Delegate reindexing from internals to dor-indexing-app services
- Remove dead unused /dor routes and associated code
DOR (Fedora)
- Direct monitoring
Workflow Service
- Direct monitoring
- Change performance configuration of Oracle database server
- PENDING: Isolate services onto its own VMs
ActiveMQ
- Direct monitoring
- Background reindexing when idle -- takes ~3 days for 1.2M objects using this method

~70% of the time in the to_solr method running application logic
~30% of the time is spent in external services to DOR, Solr, and Workflow, in that order
Sensitive to VM CPU availability and DOR performance