Home

Gobblin is an open source unified data ingestion framework to bring significant amount of data from internal data sources (data generated on premise) and external data sources (data sourced from external web sites) into one central repository (HDFS) for analysis.

As companies are moving more and more towards a data-driven decision making business model, increasing number of business products are driven by business insights from data generated both internally on premise or sourced externally from public web sites or web services. Gobblin is developed to address ingesting those big data with ease:

Centralized data lake: standardized data formats, directory layouts;
Standardized catalog of lightweight transformations: security filters, schema evolution, type conversion, etc;
Data quality measurements and enforcement: schema validation, data audits, etc;
Scalable ingest: auto-scaling, fault-tolerance, etc;
Ease of operations: centralized monitoring, enforcement of SLA-s etc;
Ease of use: self-serve on-boarding of new datasets to minimize time and involvement of engineers.

Features

Support Matrix

Gobblin supports the following combination of data sources and protocols:

The types of data sources: RDBMS, distributed NoSQL, event streams, log files, etc.;
The types of data transport protocols: file copies over HTTP or SFTP, JDBC, REST, Kafka, Databus, vendor-specific APIs, etc.;
The semantics of the data bundles: increments, appends, full dumps, change stream, etc.;
The types of the data flows: batch, streaming.

Community

Discussion Group: Google Gobblin-Users Group
Issue Tracking: github issue tracking

Requirements

Java 1.6+
Gradle 1.12+
Hadoop 1.2.1+ or Hadoop 2.3.0+

Quickstart Guides and Examples

Quickstart Guide - a step-by-step tutorial on the basics

Source | Documentation | Discussion Group

Home
[Getting Started](Getting Started)
Architecture
User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
Case Studies
- Kafka-HDFS Ingestion
- Publishing Data to S3
Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
Miscellaneous
- Camus → Gobblin Migration
- Exactly Once Support

Provide feedback

Saved searches