Skip to content

Use Cases

Jack Park edited this page Apr 29, 2018 · 2 revisions

Goal

The primary goal of tq-asr-documizer is to convert documents into machine-processable resources. In this system, text documents are processed with these steps:

  1. Create an instance of an IDocument which serves as a container for all metadata, text resources, and processing state records
  2. Where paragraphs exist, create collections of IParagraph objects which serve as containers for metadata, processing state records, and sentences
  3. Create instances of ISentence objects for every sentence. Those serve as containers for all metadata, the sentence itself, and all processing state records.

Use Cases

Batch Processing

Static Collections

In general, a core function is to process document collections such as, but not limited to:

  • PubMed abstracts
  • PubMed full text documents
  • Text books from PDF files (e.g. open text books)
  • Other documents from PDF and other files

Dynamic Collections

Dynamic collections are those being driven by:

  • Web spiders
  • Carrot 2 clustered searches

On Demand Processing

This is fundamentally a kind of local web services feature, in which various operations in the OpenSherlock ecosystem can ask for a search on a topic, or a particular URL.

  • In the case of a web search, the system performs the search and harvests received documents
  • In the case of a particular URL, the system fetches and harvests the page.

In all cases, it is important to realize that this system maintains a record of all documents it has already fetched. Unless otherwise instructed, it will not re-fetch documents it already has on record.

Clone this wiki locally