Skip to content
louismullie edited this page Feb 7, 2012 · 72 revisions

Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.

Resources


**Current features**
  • Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
  • Text retrieval with indexation and full-text search (Ferret)
  • Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
  • Word inflectors, including stemmers, conjugators, declensors, and number inflection
  • Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
  • Language, date, time and named entity extraction, as well as coreference resolution
  • Topic extraction (LDA or Reuters-trained model)
  • Simple text statistics (frequency, TF*IDF)
  • Serialization of annotated entities to YAML or XML format
  • Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
  • Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.

**Caveats/Planned features**
  • The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will be ported to inline C.
  • XML unserializer is currently broken; it will need to be fixed.
  • The API to the Stanford Coreference Resolver and the NER system will need to be integrated with the parser to allow retrieval of coreferences/tags at the same time as the parse tree. Currently, it is only possible to retrieve them separately.
  • Tests need to be improved for extractors and processors.
  • A faster WordNet API in Java will be interfaced.

**License**

This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

Clone this wiki locally