Skip to content
nemesis edited this page Nov 27, 2014 · 5 revisions

DSL for textual entities

Treat is built around a tree structure that represents a wide variety of structural elements found within a text, from text collections, to documents, paragraphs and words. The Treat DSL (which can be included by calling include Treat::Core::DSL) allows you to easily build these textual entities from the data you have. For example, the following will load each file found in the folder into a collection object:

c = collection './folder'

Documents can be created from various formats. The .txt format is supported natively, but HTML, DOC, PDF are also supported with the proper [binaries] (https://github.com/louismullie/treat/wiki/Manual#installing-binaries).

d = document 'text.txt'
d = document 'page.html'
d = document 'essay.doc'
d = document 'paper.pdf'

You can even directly load in images with Google Ocropus installed:

d = document 'image.jpg'

Use port install ocropus poppler antiword graphviz to install all binaries that Treat supports.

If you input a URL, a local copy of the file found at that location will be downloaded automatically:

d = document 'http://www.website.com/page.html'

Treat.core.verbosity.silence = false will enable progress bars for downloads (among other things).

Finally, you can just create entities from pure Ruby strings. You can create any type of entity from strings, except for documents and collection. For example:

sent = sentence 'Those who dream by day know the most.'
phra = phrase 'who dream by day'
word = word 'who'

You can also compose entities in trees any way you like:

phra = phrase 'Obama', 'Sarkozy', 'Meeting'

para  = paragraph 'Obama and Sarkozy met on January 1st to'
'investigate the possibility of a new rescue plan. Nicolas ' +
'Sarkozy is to meet Merkel next Tuesday in Berlin.'

sect = section title(phra), para

A simple example: Tokenization

You can tokenize a sentence as follows:

s = sentence 'Those who dream by day know most.'
s.tokenize
s.print_tree

To apply a function recursively on a tree, use apply. For example, to tag all tokens in our tokenized sentence:

s.apply :tag

Any "task" (such as part-of-speech tagging) that can be "applied" is an abstraction for several different workers that can offer to perform that task. In this case, several workers are available:

s.apply :tag => :stanford # or
s.apply :tag => :brill    # or
s.apply :tag => :lingua

If language detection is turned on, the worker to perform the task is determined also on the basis of the tags available for the particular language.

# Penn Tree Bank tags.
sent = sentence "This is an English sentence, prove it to me!"
sent.apply(:parse).print_tree

# Stuttgart-Tübingen tags.
sent = sentence "Wegen ihres Jahrestages bereiten wir unseren " +
                "Eltern eine Exkursion nach München vor."
sent.apply(:parse).print_tree

# Paris7 tags.
sent = sentence "Une phrase en Français pour entourlouper les Anglais."
sent.apply(:tokenize, :tag).print_tree

Iterating Entities

In this example, we'll show how we can iterate the subtree of any entity either a) by entity type or b) by POS category.

# Iterate all the words.
s.each_word do |word|
  puts word.tag
  puts word.stem
end

# Tag all words with category.
s.apply :category

# May be called for any POS.
s.each_noun do |noun|
  puts word.hyponyms
  puts word.hypernyms
end

Text structuring, sentence segmentation and syntactic parsing

Here we'll create a section of text, split out the title and paragraph, segment the sentences and parse the syntactic structure using the Stanford parser.

section = section "Obama-Sarkozy Meeting\n" +
"Obama and Sarkozy met on January 1st to investigate " +
"the possibility of a new rescue plan. President " +
"Sarkozy is to meet Merkel next Tuesday in Berlin."

# Chunk: split the titles and paragraphs.
# Segment: perform sentence segmentation.
# Parse: parse the syntax of each sentence.
section.apply :chunk, :segment, :parse

# View the tree structure.
section.print_tree

Once this is done it's easy to get quick information about the text:

# Get some basic info on the text.
puts section.title
puts section.sentence_count
puts section.word_count

section.apply :category
puts section.noun_count
puts section.frequency_of 'president'

section.each_phrase_with_tag('NP') do |phrase|
  puts phrase.to_s
end

DSL for Machine Learning

In addition to the DSL for textual entities, Treat provides a DSL to create and solve machine learning problems. As an example, let's take the following task: given a set of sentences retrieved from a Wikipedia page, determine which ones are non-informative based on punctuation count and word count.

First, we define the question is_junk and specify that the question applies to sentences.

qn = question(:is_junk, :sentence)

Then, we can define our classification problem, which is to solve the above question using punctuation and word counts:

pb = problem(qn,
     feature(:punctuation_count), 
     feature(:word_count) )

Let's get some documents to work on: one will be used for training, and the other for evaluation. We then need to preprocess both documents by chunking, segmentation and tokenization.

d1 = document('http://en.wikipedia.org/wiki/NOD_mouse')
d2 = document('http://en.wikipedia.org/wiki/Academic_studies_about_Wikipedia')
[d1,d2].apply(:chunk, :segment, :tokenize)

We set the annotation corresponding to our question (:is_junk) to for training on the first document. Also, we'll define a separate annotation (arbitrarily called :is_true_junk) on the second document to be used as a gold standard for evaluation:

# Answer our problem to create a training set.
d1.sentences[0..17].each { |s| s.set :is_junk, 0 }
d1.sentences[17..-1].each { |s| s.set :is_junk, 1 }
d_set = d1.export(pb)

# Define our gold standard results for evaluation.
d2.sentences[0..81].each { |s| s.set :is_true_junk, 0 }
d2.sentences[81..-1].each { |s| s.set :is_true_junk, 1 }

The last step is to export a training data set conforming to our problem from the annotated document. We can then use that data set in order to classify new sentences.

d_set = d1.export(pb)
d2.sentences.map do |s| 
  pred = s.classify(:id3, training: d_set)
  if pred == 1
    tp += 1 if s.is_true_junk == 1
    fp += 1 if s.is_true_junk == 0
  else
    tn += 1 if s.is_true_junk == 0
    fn += 1 if s.is_true_junk == 1
  end
end

puts "Precision: #{tp/(tp + fp)}"
puts "Recall: #{tp/(tp + fn)}"

Success! Our simple algorithm has 90% precision and 92% recall on our test document.

Clone this wiki locally