Skip to content

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

License

Notifications You must be signed in to change notification settings

yuxutw/elephant-bird

 
 

Repository files navigation

Elephant Bird

About

Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day.

Join the conversation about Elephant-Bird on the developer mailing list.

License

Apache licensed.

Quickstart

  1. Get the code: git clone git://github.com/kevinweil/elephant-bird.git
  2. Build the jar: ant (or ant -p to view all targets)
  3. Explore what's available: ant javadoc and ant examples

Note: For any of the LZO-based code, make sure that the native LZO libraries are on your java.library.path. Generally this is done by setting JAVA_LIBRARY_PATH in pig-env.sh or hadoop-env.sh. You can also add lines like

PIG_OPTS=-Djava.library.path=/path/to/my/libgplcompression/dir

to pig-env.sh. See the instructions for Hadoop-LZO for more details.

There are a few simple examples that use the input formats. Note how the Protocol Buffer-based formats work, and also note that the examples build file uses the custom codegen stuff. See below for more about that.

NOTE: This is an experimental branch for working with Pig 0.8. It may not work. Caveat emptor.

Maven repository

Elephant Bird takes advantage of Github's raw interface and self-hosts a maven repository inside the git repo itself. To use the maven repo, simply add https://raw.github.com/kevinweil/elephant-bird/master/repo as a maven repo in the system you use to manage dependencies.

For example, with Ivy you would add the following resolver in ivysettings.xml:

<ibiblio name="elephant-bird-repo" m2compatible="true"
         root="https://raw.github.com/kevinweil/elephant-bird/master/repo"/>

And include elephant-bird as a dependency in ivy.xml:

<dependency org="com.twitter" name="elephant-bird" rev="${elephant-bird.version}"/>

Version compatibility

  1. Protocol Buffers 2.3 (not compatible with 2.4+)
  2. Pig 0.8, 0.9 (not compatible with 0.7 and below)
  3. Hive 0.7 (with HIVE-1616)
  4. Thrift 0.5
  5. Mahout 0.6
  6. Cascading2 (as the API is evolving, see libraries.properties for the currently supported version)

Protocol Buffer and Thrift compiler dependencies

Elephant Bird requires Protocol Buffer compiler version 2.3 at build time, as generated classes are used internally. Thrift compiler version 0.5.0 is required to generate classes used in tests. As these are native-code tools they must be installed on the build machine (java library dependencies are pulled from maven repositories during the build).

Contents

Hadoop Input and Output Formats

Elephant-Bird provides input and output formats for working with working with a variety of plaintext formats stored in LZO compressed files.

  • JSON data
  • Line-based data (TextInputFormat but for LZO)
  • W3C logs

Additionally, protocol buffers and thrift messages can be stored in a variety of file formats.

  • Block-based, into generic bytes
  • Line-based, base64 encoded
  • SequenceFile
  • RCFile

Hadoop API wrappers

Hadoop provides two API implementations: the the old-style org.apache.hadoop.mapred and new-style org.apache.hadoop.mapreduce packages. Elephant-Bird provides wrapper classes that allow unmodified usage of mapreduce input and output formats in contexts where the mapred interface is required.

For more information, see DeprecatedInputFormatWrapper.java and DeprecatedOutputFormatWrapper.java

Hadoop Writables

  • Elephant-Bird provides protocol buffer and thrift writables for directly working with these formats in map-reduce jobs.

Pig Support

Loaders and storers are available for the input and output formats listed above. Additionally, pig-specific features include:

  • JSON loader (including nested structures)
  • Regex-based loader
  • Includes converter interface for turning Tuples into Writables and vice versa
  • Provides implementations to convert generic Writables, Thrift, Protobufs, and other specialized classes, such as Apache Mahout's VectorWritable.

Hive Support

Elephant-Bird has experimental Hive support. For more information, see How to use Elephant Bird with Hive.

Utilities

  • Counters in Pig
  • Protocol Buffer utilities
  • Thrift utilities
  • Conversions from Protocol Buffers and Thrift messages to Pig tuples
  • Conversions from Thrift to Protocol Buffer's DynamicMessage
  • Reading and writing block-based Protocol Buffer format (see ProtobufBlockWriter)

Working with Thrift and Protocol Buffers in Hadoop

We provide InputFormats, OutputFormats, Pig Load / Store functions, Hive SerDes, and Writables for working with Thrift and Google Protocol Buffers. We haven't written up the docs yet, but look at ProtobufMRExample.java, ThriftMRExample.java, people_phone_number_count.pig, people_phone_number_count_thrift.pig under examples directory for reflection-based dynamic usage. We also provide utilities for generating Protobuf-specific Loaders, Input/Output Formats, etc, if for some reason you want to avoid the dynamic bits.

Protobuf Codegen?

Note: this is not strictly required for working with Protocol Buffers in Hadoop. We can do most of this dynamically. Some people like having specific classes, though, so this functionality is available since protobuf 2.3 makes it so easy to do.

In protobuf 2.3, Google introduced the notion of a Protocol Buffer plugin that lets you hook in to their code generation elegantly, with all the parsed metadata available. We use this in com.twitter.elephantbird.proto.HadoopProtoCodeGenerator to generate code for each Protocol Buffer. The HadoopProtoCodeGenerator expects as a first argument a yml file consisting of keys and lists of classnames. For each Protocol Buffer file read in (say from my_file.proto), it looks up the basename (my_file) in the yml file. If a corresponding list exists, it expects each element is a classname of a class deriving from com.twitter.elephantbird.proto.ProtoCodeGenerator. These classes implement a method to set the filename, and a method to set the generated code contents of the file. You can add your own by creating such a derived class and including it in the list of classnames for the Protocol Buffer file key. That is, if you want to apply the code generators in com.twitter.elephantbird.proto.codegen.ProtobufWritableGenerator and com.twitter.elephantbird.proto.codegen.LzoProtobufBytesToPigTupleGenerator to every protobuf in the file my_file.proto, then your config file should have a section that looks like

my_file:
  - com.twitter.elephantbird.proto.codegen.ProtobufWritableGenerator
  - com.twitter.elephantbird.proto.codegen.LzoProtobufBytesToPigTupleGenerator

There are examples in the examples subdirectory showing how to integrate this code generation into a build, both for generating Java files pre-jar and for generating other types of files from Protocol Buffer definitions post-compile (there are examples that do this to generate Pig loaders for a set of Protocol Buffers).

Hadoop SequenceFiles and Pig

Reading and writing Hadoop SequenceFiles with Pig is supported via classes SequenceFileLoader and SequenceFileStorage. These classes make use of a WritableConverter interface, allowing pluggable conversion of key and value instances to and from Pig data types.

Here's a short example: Suppose you have SequenceFile<Text, LongWritable> data sitting beneath path input. We can load that data with the following Pig script:

REGISTER '/path/to/elephant-bird.jar';

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';

pairs = LOAD 'input' USING $SEQFILE_LOADER (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
) AS (key: chararray, value: long);

To store {key: chararray, value: long} data as SequenceFile<Text, LongWritable>, the following may be used:

%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';

STORE pairs INTO 'output' USING $SEQFILE_STORAGE (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
);

For details, please see Javadocs in the following classes:

How To Contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on github.

Each new release since 2.1.3 has a tag. The latest version on master is what we are actively running on Twitter's hadoop clusters daily, over hundreds of terabytes of data.

Contributors

Major contributors are listed below. Lots of others have helped too, thanks to all of them! See git logs for credits.

About

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 99.9%
  • Shell 0.1%