-
Notifications
You must be signed in to change notification settings - Fork 128
Home
louismullie edited this page Feb 7, 2012
·
72 revisions
Treat is a toolkit for natural language processing and computational linguistics. It provides a common API for a number of existing tools in C, Ruby and Java for document retrieval, parsing, annotation, and information extraction.
Resources
- Read the latest documentation.
- See how to install Treat.
- Learn how to use Treat.
- Help out by contributing to the project.
- View a list of papers about tools included in this toolkit.
- Tutorials coming soon!
Current features
- Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus)
- Document indexation and retrieval (Ferret)
- Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju)
- Word inflectors, including stemmers, conjugators, declensors, and number inflection
- Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages)
- Language, date, time and named entity extraction, as well as coreference resolution
- Topic extraction (LDA or Reuters-trained model)
- Simple text statistics (frequency, TF*IDF)
- Serialization of annotated entities to YAML or XML format
- Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats
- Linguistic resources, including full ISO-639-1 and ISO-639-2 support, and tag alignments for five treebanks.
New in version 0.2.4
Here's how to access the Ferret API:
c = Collection 'folder' # Recursively searched for supported file types.
c.index # Indexes are stored under foler/.index
c.search(:ferret, :q => 'hungary').each do |doc| # Search for a document with the word "hungary".
# Do processing/annotation/extraction
puts doc.file
end
c.serialize(:yaml, :file => 'test.yaml') # Save annotated entities to reopen later.
Pretty sweet, no?
Caveats/Planned features
- The few native Ruby statistics algorithms are slow. Some of the highly recursive code in the core Tree and Entity classes will be ported to inline C.
- XML unserializer is currently broken; it will need to be fixed.
- The API to the Stanford Coreference Resolver and the NER system will need to be integrated with the parser to allow retrieval of coreferences/tags at the same time as the parse tree. Currently, it is only possible to retrieve them separately.
- Tests need to be improved for extractors and processors.
- A faster WordNet API in Java will be interfaced.
License
This software is released under the GPL License and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.