This project provides a client library in Java that allows Microsoft Azure DocumentDB to act as an input source or output sink for MapReduce, Hive and Pig jobs.
To get the binaries of this library as distributed by Microsoft, ready for use within your project, you can use GitHub releases.
To get the source code of the connector via git just type:
git clone git://github.com/Azure/azure-documentdb-hadoop.git
To download a copy of the source code, click "Download ZIP" on the right side of the page or click here.
To get the binaries of this library as distributed by Microsoft, ready for use within your project, you can use Maven.
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-documentdb-hadoop</artifactId>
<version>1.2.0</version>
</dependency>
Install the DocumentDB Hadoop Connector onto HDInsight clusters through custom action scripts. Full instructions can be found here.
- Java Development Kit 7
- Apache Hadoop & YARN 2.4.0
- Apache Pig 0.12.1
- Apache Hive & HCatalog 0.13.1
- Apache Hadoop & YARN 2.6.0
- Apache Pig 0.14.0
- Apache Hive $ HCatalog 0.14.0
- HDI 3.1 (Getting started with HDInsight)
- HDI 3.2
- Microsoft Azure DocumentDB Java SDK 1.6.0 (com.microsoft.azure / azure-documentdb / 1.6.0)
When using Hive:
- OpenX Technologies JsonSerde 1.3.1-SNAPSHOT (org.openx.data / json-serde-parent / 1.3.1-SNAPSHOT) GitHub repo can be found here
Please download the jars and add them to your build path.
To use this client library with Azure DocumentDB, you need to first create an account.
// Import Hadoop Connector Classes
import com.microsoft.azure.documentdb.Document;
import com.microsoft.azure.documentdb.hadoop.ConfigurationUtil;
import com.microsoft.azure.documentdb.hadoop.DocumentDBInputFormat;
import com.microsoft.azure.documentdb.hadoop.DocumentDBOutputFormat;
import com.microsoft.azure.documentdb.hadoop.DocumentDBWritable;
// Set Configurations
Configuration conf = new Configuration();
final String host = "Your DocumentDB Endpoint";
final String key = "Your DocumentDB Primary Key";
final String dbName = "Your DocumentDB Database Name";
final String inputCollNames = "Your DocumentDB Input Collection Name[s]";
final String outputCollNames = "Your DocumentDB Output Collection Name[s]";
final String query = "[Optional] Your DocumentDB Query";
final String outputStringPrecision = "[Optional] Number of bytes to use for String indexes"
final String offerType = "[Optional] Your performance level for Output Collection Creations";
final String upsert = "[Optional] Bool to disable or enable document upsert";
conf.set(ConfigurationUtil.DB_HOST, host);
conf.set(ConfigurationUtil.DB_KEY, key);
conf.set(ConfigurationUtil.DB_NAME, dbName);
conf.set(ConfigurationUtil.INPUT_COLLECTION_NAMES, inputCollNames);
conf.set(ConfigurationUtil.OUTPUT_COLLECTION_NAMES, outputCollNames);
conf.set(ConfigurationUtil.QUERY, query);
conf.set(ConfigurationUtil.OUTPUT_STRING_PRECISION, outputStringPrecision);
conf.set(ConfigurationUtil.OUTPUT_COLLECTIONS_OFFER, offerType);
conf.set(ConfigurationUtil.UPSERT, upsert);
Full MapReduce sample can be found here.
CREATE EXTERNAL TABLE DocumentDB_Hive_Table( COLUMNS )
STORED BY 'com.microsoft.azure.documentdb.hive.DocumentDBStorageHandler'
tblproperties (
'DocumentDB.endpoint' = 'Your DocumentDB Endpoint',
'DocumentDB.key' = 'Your DocumentDB Primary Key',
'DocumentDB.db' = 'Your DocumentDB Database Name',
'DocumentDB.inputCollections' = 'Your DocumentDB Input Collection Name[s]',
'DocumentDB.query' = '[Optional] Your DocumentDB Query' );
CREATE EXTERNAL TABLE Hive_DocumentDB_Table( COLUMNS )
STORED BY 'com.microsoft.azure.documentdb.hive.DocumentDBStorageHandler'
tblproperties (
'DocumentDB.endpoint' = 'Your DocumentDB Endpoint',
'DocumentDB.key' = 'Your DocumentDB Primary Key',
'DocumentDB.db' = 'Your DocumentDB Database Name',
'DocumentDB.outputCollections' = 'Your DocumentDB Output Collection Name[s]',
'[Optional] DocumentDB.outputStringPrecision' = '[Optional] Number of bytes to use for String indexes',
'[Optional] DocumentDB.outputCollectionsOffer' = '[Optional] Your performance level for Output Collection Creations',
'[Optional] DocumentDB.upsert' = '[Optional] Bool to disable or enable document upsert');
INSERT INTO TABLE Hive_DocumentDB_Table
Full Hive sample can be found here.
LOAD 'Your DocumentDB Endpoint'
USING com.microsoft.azure.documentdb.hadoop.pig.DocumentDBLoader(
'Your DocumentDB Primary Key',
'Your DocumentDB Database Name',
'Your DocumentDB Input Collection Name[s]',
'[Optional] Your DocumentDB SQL Query' );
STORE data INTO 'DocumentDB Endpoint'
USING com.microsoft.azure.documentdb.hadoop.pig.DocumentDBStorage(
'DocumentDB Primary Key',
'DocumentDB Database Name',
'DocumentDB Output Collection Name[s]',
'[Optional] Your performance level for Output Collection Creations',
'[Optional] Number of bytes to use for String indexes',
'[Optional] Bool to disable or enable document upsert');
Full Pig sample can be found here.
- When outputting to DocumentDB, your output collection will require capacity for an additional stored procedure. The stored procedure will remain in your collection for reuse.
- The Hadoop Connector automatically sets your indexes to range indexes with max precision on strings and numbers. More information can be found here.
- Connector supports configurable upsert option. Upsert configuration is automatically set to true and will overwrite documents within the same collection with the same id.
- Reads and writes to DocumentDB will be counted against your provisioned throughput for each collection.
- Output to DocumentDB collections is done in batch round robin.
- Connector supports configurable offer option. Offer configuration allows users to set the performance tier of their newly creation collections (this does not apply when outputting to an already existing collection).
- Connector supports output to partitioned collections. Hadoop Connector will not automatically create partitioned collections for Hadoop job outputs.
Be sure to check out the Microsoft Azure Developer Forums on MSDN or the Developer Forums on Stack Overflow if you have trouble with the provided code. Also, check out our tutorial for more information.
If you would like to become an active contributor to this project please follow the instructions provided in Azure Projects Contribution Guidelines.
If you encounter any bugs with the library please file an issue in the Issues section of the project.