Skip to content

Elephant Bird Lucene

isnotinvain edited this page Jan 3, 2013 · 13 revisions

Elephant-Bird provides two modules that make it easier to build and query lucene indexes in HDFS from either a map reduce or pig job.

Module layout

Elephant-Bird has two lucene modules:

  • elephant-bird-lucene which contains
    • LuceneIndexOutputFormat for creating lucene indexes in HDFS from a MR job
    • LuceneIndexInputFormat for querying lucene indexes in HDFS from a MR job
    • HdfsMergeTool for merging lucene indexes in HDFS
  • elephant-bird-pig-lucene
    • LuceneIndexStorage which wraps LuceneIndexOutputFormat
    • LuceneIndexLoader which wraps LuceneIndexInputFormat

Creating indexes

First create a class that extends LuceneIndexOutputFormat. This is where you set up how exactly your index is built, including which features of lucene to use. The primary functions of this class is to build a lucene Document from a given key + value and to provide an analyzer.

Here's an example:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.util.Version;

import com.twitter.elephantbird.mapreduce.output.LuceneIndexOutputFormat;

/**
 * This OutputFormat assumes that the key is a user id, and the value is the text of a tweet.
 * It builds an index that is searchable by tokens in the tweet text, and by user id.
 */
public class TweetIndexOutputFormat extends LuceneIndexOutputFormat<LongWritable, Text> {
  // create some lucene Fields. These can be anything you'd like, such as DocFields
  public static final String TWEET_TEXT_FIELD = "tweet_text";
  public static final String USER_ID_FIELD = "user_id";
  private final Field tweetTextField = new TextField(TWEET_TEXT_FIELD, "", Field.Store.YES);
  private final Field userIdField = new LongField(USER_ID_FIELD, 0L, Field.Store.YES);
  private final Document doc = new Document();

  public TweetIndexOutputFormat() {
    doc.add(tweetTextField);
    doc.add(userIdField);
  }

  // This is where we convert an MR key value pair into a lucene Document
  // This part is up to you, depending on how you want your data indexed / stored / tokenized / etc.
  @Override
  protected Document buildDocument(LongWritable userId, Text tweetText) throws IOException {
    tweetTextField.setStringValue(tweetText.toString());
    userIdField.setLongValue(userId.get());
    return doc;
  }

  // Provide an analyzer to use. If you don't want to use an analyzer (if your data is pre-tokenized perhaps) you can simply not override this method.
  @Override
  protected Analyzer newAnalyzer(Configuration conf) {
    return new SimpleAnalyzer(Version.LUCENE_40);
  }
}