-
Notifications
You must be signed in to change notification settings - Fork 0
Use Cases
Jack Park edited this page Apr 29, 2018
·
2 revisions
The primary goal of tq-asr-documizer is to convert documents into machine-processable resources. In this system, text documents are processed with these steps:
- Create an instance of an IDocument which serves as a container for all metadata, text resources, and processing state records
- Where paragraphs exist, create collections of IParagraph objects which serve as containers for metadata, processing state records, and sentences
- Create instances of ISentence objects for every sentence. Those serve as containers for all metadata, the sentence itself, and all processing state records.
In general, a core function is to process document collections such as, but not limited to:
- PubMed abstracts
- PubMed full text documents
- Text books from PDF files (e.g. open text books)
- Other documents from PDF and other files
Dynamic collections are those being driven by:
- Web spiders
- Carrot 2 clustered searches
This is fundamentally a kind of local web services feature, in which various operations in the OpenSherlock ecosystem can ask for a search on a topic, or a particular URL.
- In the case of a web search, the system performs the search and harvests received documents
- In the case of a particular URL, the system fetches and harvests the page.
In all cases, it is important to realize that this system maintains a record of all documents it has already fetched. Unless otherwise instructed, it will not re-fetch documents it already has on record.