Skip to content
louismullie edited this page Feb 7, 2012 · 72 revisions

Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.

Resources

Current features

  • Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
  • Text retrieval with indexation and full-text search (Ferret)
  • Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
  • Word inflectors, including stemmers, conjugators, declensors, and number inflection
  • Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
  • Language, date, time and named entity extraction, as well as coreference resolution
  • Topic extraction (LDA or Reuters-trained model)
  • Simple text statistics (frequency, TF*IDF)
  • Serialization of annotated entities to YAML or XML format
  • Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
  • Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.

New in version 0.2.4

Here's how to access the Ferret API:

c = Collection 'folder'                                    # Recursively searched for supported file types.
c.index                                                    # Indexes are stored under foler/.index
c.search(:ferret, :q => 'hungary').each do |doc|           # Search for a document with the word "hungary".
  # Do processing/annotation/extraction
  puts doc.file         
end
c.serialize(:yaml, :file => 'test.yaml')                   # Save annotated entities to reopen later.

Pretty sweet, no?

Caveats/Planned features

  • The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will be ported to inline C.
  • XML unserializer is currently broken; it will need to be fixed.
  • The API to the Stanford Coreference Resolver and the NER system will need to be integrated with the parser to allow retrieval of coreferences/tags at the same time as the parse tree. Currently, it is only possible to retrieve them separately.
  • Tests need to be improved for extractors and processors.
  • A faster WordNet API in Java will be interfaced.

License

This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

Clone this wiki locally