Home

Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.

Resources

Read the latest documentation.
See how to install Treat.
Learn how to use Treat.
Help out by contributing to the project.
View a list of papers about tools included in this toolkit.
Tutorials coming soon!

Current features

Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
Document indexation and retrieval (Ferret)
Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
Word inflectors, including stemmers, conjugators, declensors, and number inflection
Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
Language, date, time and named entity extraction, as well as coreference resolution
Topic extraction (LDA or Reuters-trained model)
Simple text statistics (frequency, TF*IDF)
Serialization of annotated entities to YAML or XML format
Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.

New in version 0.2.4

Here's how to access the Ferret API:

c = Collection 'folder'                                    # Recursively searched for supported file types.
c.index                                                    # Indexes are stored under foler/.index
c.search(:ferret, :q => 'hungary').each do |doc|           # Search for a document with the word "hungary".
  # Do processing/annotation/extraction
  puts doc.file         
end
c.serialize(:yaml, :file => 'test.yaml')                   # Save annotated entities to reopen later.

Pretty sweet, no?

Caveats/Planned features

The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will be ported to inline C.
XML unserializer is currently broken; it will need to be fixed.
The API to the Stanford Coreference Resolver and the NER system will need to be integrated with the parser to allow retrieval of coreferences/tags at the same time as the parse tree. Currently, it is only possible to retrieve them separately.
Tests need to be improved for extractors and processors.
A faster WordNet API in Java will be interfaced.

License

This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally