-
Notifications
You must be signed in to change notification settings - Fork 387
Elephant Bird Lucene
isnotinvain edited this page Jan 3, 2013
·
13 revisions
Elephant-Bird provides two modules that make it easier to build and query lucene indexes in HDFS from either a map reduce or pig job.
Elephant-Bird has two lucene modules:
-
elephant-bird-lucene
which contains-
LuceneIndexOutputFormat
for creating lucene indexes in HDFS from a MR job -
LuceneIndexInputFormat
for querying lucene indexes in HDFS from a MR job -
HdfsMergeTool
for merging lucene indexes in HDFS
-
-
elephant-bird-pig-lucene
-
LuceneIndexStorage
which wrapsLuceneIndexOutputFormat
-
LuceneIndexLoader
which wrapsLuceneIndexInputFormat
-
First create a class that extends LuceneIndexOutputFormat
. This is where you set up how exactly your index is built, including which features of lucene to use. The primary functions of this class is to build a lucene Document from a given key + value and to provide an analyzer.
Here's an example:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.util.Version;
import com.twitter.elephantbird.mapreduce.output.LuceneIndexOutputFormat;
/**
* This OutputFormat assumes that the key is a user id, and the value is the text of a tweet.
* It builds an index that is searchable by tokens in the tweet text, and by user id.
*/
public class TweetIndexOutputFormat extends LuceneIndexOutputFormat<LongWritable, Text> {
// create some lucene Fields. These can be anything you'd like, such as DocFields
public static final String TWEET_TEXT_FIELD = "tweet_text";
public static final String USER_ID_FIELD = "user_id";
private final Field tweetTextField = new TextField(TWEET_TEXT_FIELD, "", Field.Store.YES);
private final Field userIdField = new LongField(USER_ID_FIELD, 0L, Field.Store.YES);
private final Document doc = new Document();
public TweetIndexOutputFormat() {
doc.add(tweetTextField);
doc.add(userIdField);
}
// This is where we convert an MR key value pair into a lucene Document
// This part is up to you, depending on how you want your data indexed / stored / tokenized / etc.
@Override
protected Document buildDocument(LongWritable userId, Text tweetText) throws IOException {
tweetTextField.setStringValue(tweetText.toString());
userIdField.setLongValue(userId.get());
return doc;
}
// Provide an analyzer to use. If you don't want to use an analyzer (if your data is pre-tokenized perhaps) you can simply not override this method.
@Override
protected Analyzer newAnalyzer(Configuration conf) {
return new SimpleAnalyzer(Version.LUCENE_40);
}
}