Skip to content
christinklez edited this page Nov 15, 2021 · 16 revisions

Rikolti

Rikolti is a new project currently in development by the California Digital Library, to replace our existing harvesting infrastructure for Calisphere, which aggregates digital cultural heritage content from hundreds of organizations throughout California. A standard set of requirements informs this work, including: ingest from multiple different sources, mapping to a standard metadata model, pulling in content files, and producing a search index.

Background and Context

The existing Calisphere harvester uses deprecated, unsupported technologies from DPLA’s version 1 harvesting stack, developed in 2013. The harvesting system is too central to our operations to allow it to degrade in unsupported, undocumented, and outdated technologies. Further, an assessment that we conducted in 2016 signals the opportunity to shift technologies to streamline processes and support scalability of the service. Additional analysis was conducted on existing technologies, including Supplejack, Combine, and Ingestion3, but these were not readily adaptable to Calisphere’s identified use cases and requirements.

The Rikolti system is designed to be modular and fast, using current, well-supported systems:

  • Modular components will allow us to re-run individual pieces as needed: for example, we can iterate on a metadata mapping independently from other steps. Modular components also make it easier to develop and test locally, and allow for better error handling at each step.
  • Rapid processing will allow us to handle data much more quickly, and opens up new opportunities for reporting, understanding, and sharing that data.

Rikolti also investigates new features that speak to specific needs identified by Calisphere’s contributors, such as full-text indexing of the text appearing in object content files (e.g., PDFs).

Tech/framework (to be) used

Proposed Rikolti structure: Proposed Rikolti structure The proposed Rikolti design identifies the following technology:

  • Sever-less solutions: AWS Lambda runs quick small jobs (i.e., perfect for building small, modular components like thumbnail generation and metadata fetching). AWS Glue is a serverless solution for running Apache Spark, a processing engine designed to quickly process big data.
  • Data storage: AWS Simple Storage Service (s3) will store harvest data.

The green components in the diagram above have been prototyped:

  • Metadata fetcher: Uses AWS Lambda to fetch the vernacular metadata & content files. The fetcher on our current harvesting stack is written in python2; will be updated to python3.
  • Full-text indexing: Uses AWS Textract to index the text within content files, such as PDFs.
  • Metadata transformation: Uses AWS Glue and Apache Spark to conduct ETL (extract, transform, load) processes to homogenize all of the different metadata that comes from our hundreds of sources; reads and writes output to s3.
  • Search index: ElasticSearch (likely deployed through the AWS ElasticSearch service) to power the Calisphere search function.

Development plan

Rikolti development is organized around quarterly milestones. In a nutshell, the goal for 2021 is to harvest a number of collections through to the Calisphere front-end using the Rikolti stack, while demonstrating that old and new systems can be run in parallel.

Fall 2020 (release date: December 22, 2020)

  • Determine a high-level approach and overall task list. This will include a plan for migration of a limited amount of code and data.

Winter 2021 (release date: March 31, 2021)

  • Scale up proof-of-concept to work with 3 top kinds of enrichment chain types. Look into workflow management systems and operator interface (Airflow, Steps).

Spring 2021 (release date: June 30, 2021)

  • Have a new end-to-end demo that we can get feedback from Shared DAMS (Nuxeo) campuses on their content, specifically re “full-text”.

Summer 2021 (release date: October 30, 2021)

  • Have a new end-to-end demo (through to Calisphere frontend) demo to develop further Rikolti components, specifically around Nuxeo image harvesting and complex objects.

Stay tuned for additional updates on Rikolti development and target sprint goals. Track the project boards for the latest updates.

What is “rikolti”?

Rikolti [riˈkolti] is the Esperanto word meaning “to harvest.”

Esperanto is a constructed international auxiliary language, created in 1887 by Polish ophthalmologist L. L. Zamenhof. Zamenhof's goal was to create an easy and flexible language that would serve as a universal second language to foster world peace and international understanding. The word esperanto translates into English as "one who hopes." Summarized from https://en.wikipedia.org/wiki/Esperanto.

Credits

Rikolti is in development by staff at the California Digital Library.

Clone this wiki locally