hive.html

<!DOCTYPE html>
<html>
  <head>
    <title>Title</title>
    <meta charset="utf-8">
    <style>
      @import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
      @import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
      @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);

      body { font-family: 'Droid Serif'; }
      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: normal;
      }

      th, td {
          padding: 5px;
          border: 1px solid black;
      }
      .remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
    </style>
  </head>
  <body>
    <textarea id="source">

class: center, middle

# Hive

---

# Hive - A Petabyte Scale Data Warehouse Using Hadoop
### Authors: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy
### Presenter: Neil Seward

---

# Contents

1. Defining Facebook's Problem
2. Contributions of Paper
3. HiveQL
4. HDFS
5. System Architecture
6. Demo
7. Conclusions

---

# Defining Facebook's Problem
- Facebook's data storage prior to 2008 was done on a trditional RDBMS

- Their data was rapidly expanding and was already taking too long to perform menial queries.

- Simple queries would take days to complete.
???

---

# Defining Facebook's Problem
- Facebook started to investigate Hadoop MapReduce as a solution.

- Allthough beneficial, the query time saved was lost in constructing queries in a MapReduce manor.

- Many developers at Facebook took days to construct these simple queries.
???

---

# Defining Facebook's Problem
- As a result, Facebook developed Hive.

- Their objective was to create a scalable infrastructure that operated like a traditional RDBMS would.

- Hive was to have columns, tables, and traditional SQL operation commands.
???

---

# Contributions
- Hive offers traditional SQL queries and data management on a distributed MapReduce storage system.

- Hive's custom serialization offers users the ability to store custom data types.

- Hive's HDFS and the underlying architecture provide's a 20% decrease in runtime compared to traditional MapReduce architectures at the time.


???

---


# The Hive Model
- Hive structures data into tables, columns, rows, and partitions.

- The system supports many common datatypes: int, float, string, etc.

- The system can also support complex and custom datatypes defined through serialization.

???

---


# HiveQL
- HiveQL is a subset of SQl.

- It contains extended operations that are useul in a distributed architecture.

- Although it is a subset of SQL, there are some limitations.

???

---
# HiveQL
- HiveQL can only support ANSI JOIN statements.

<br>
<img src="img/hive-join.png" style="width: 700px; height: 400px"/>

???

---
# HiveQL
### Custom MapReduce
- HiveQL has support for using custom map and reduce scripts in queries.

- This allows for more customization queries and additional operations like classification, clustering, etc.

<img src="img/hive-custom-mapreduce.png" style="width: 800px; height: 300px"/>

???
Note that Hive also supports clustering operations to reduce the load put on the reduce operations.
Note that Hive supports the interchanging of FROM, SELECT, MAP, and REDUCE within a sub query.

---
# HiveQL
### Custom MapReduce Data

<img src="img/hive-custom-mapreduce-data.png" style="width: 800px; height: 150px"/>

---
# HiveQL
### Custom Map

<img src="img/hive-custom-map.png" style="width: 800px; height: 300px"/>

---
# HiveQL
### Custom Reduce

<img src="img/hive-custom-reduce.png" style="width: 600px; height: 400px"/>
???


---
# HiveQL
### Serialization
- HiveQL supports custom data types with serialization.

- The default serialization is done through LazySerDe.

---
# HiveQL
### Serialization
- SerDe serializes rows into objects.

- In the table definition, HiveQL can specify row, column, list, and map value delimiters for serialization.
<br>
<br>
<img src="img/hive-serialization.png" style="width: 750px; height: 250px"/>

???
Note that collection items are delimiters for the elements in a list. List items are rows in this case.
Map keys are delimiters for items in a map.


---
# HDFS
- <b>Tables</b>: stored in directory.

- <b>Partitions</b>: stored in subdirectory within table directory.

- <b>Buckets</b>: stored as a file within a table or partition directory.

---
# HDFS
### Tables
- Tables can have many partitions, or operate without partitions.

- Tables are partitioned with custom PARTITIONED BY functions.

<br>
<img src="img/hive-create-partition.png" style="width: 750px; height: 100px"/>

???
Note that the table will be create with partitions existing for every 'date' and 'hour'

---
# HDFS
### Partitions
- Partitions allow Hive to perform queries much faster.

- They enable partition pruning, which will allow Hive to only query sub directories of tables.

<br>
<img src="img/hive-single-partition.png" style="width: 800px; height: 300px"/>
???
The first creates a partition with data from table t and the second creates and empty partition.

---
# HDFS
### Buckets
- Buckets can be used to scan files within partitions or tables.

<br>
<img src="img/hive-create-buckets.png" style="width: 800px; height: 300px"/>
???
Note that this queries the 2nd bucket file out of the 32 buckets in the table.
Note that these buckets can be used for clustering tables.

---

# System Architecture
### Summary
The main building blocks of Hive are:
<br>

- Metastore

- Compiler

- Optimizer

- MapReduce Execution

---

# System Architecture
### Metastore
- Metastore stores location data and metadata on all tables, partitions, columns, etc.

- This information is stored in a traditional RDBMS (MySQL)

- There are often duplicate metastore servers in case one is terminated or is hung.

???
The info is stored in a RDBMS because the compiler needs this info quick.
Without metastore, there is no structure for the data stored in HDFS.

---

# System Architecture
### Query Compilation and Optimization
- The compiler and optimizer takes in the HiveSQL queries and translates them into map and reduce tasks.
<br>
<br>
<img src="img/hive-compiler.png" style="width: 700px; height: 300px"/>
[1]

???

- The analyser transforms the syntax trees into an internal query representation with query blocks.
- The analyser queries the metastore for information on tables to insert into internal queries.
- The logical plan generator will take the internal queries and order them in a logical plan in an operation tree.
- The query plan generator converts the ordered logical plan into a series of map-reduce jobs.

---

# System Architecture - Parser
<img src="img/hive-example-parse.png" style="width: 800px; height: 125px"/>

[1]

---

# System Architecture - Parser
<img src="img/hive-parser.png" style="width: 700px; height: 450px"/>

[1]
???
- The parser converts queries intox syntax trees.
- The first block is the column selection from the underlying tables
- The second block contains the destination information for the overwrite operation
- The third block contains the join information and ON clause information for the two tables.

---

# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-1.png" style="width: 750px; height: 450px"/>

[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.

---

# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-2.png" style="width: 750px; height: 450px"/>

[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.

---

# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-3.png" style="width: 750px; height: 450px"/>

[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.

---

# System Architecture - Logical Plan
<img src="img/hive-logical-plan.png" style="width: 650px; height: 450px"/>

[1]
???
- The logical plan generator will take the internal queries and order them in a logical plan in an operation tree.


---

# System Architecture
### Optimization
There are several optimization steps that occur to prune the logical plan created by the semantic analyser.

- Column pruning

- Predicate pushdown

- Partition pruning

- Map side joins

???
-Note predicate pushdown has to deal with the immediate execution of queries as they come in.

---

# System Architecture
### Optimization - Pruning

- Column pruning ensures that only the specified columns are selected when processing rows.

- Partition pruning ensures that only sub directories of tables are traversed when querying tables.

---

# System Architecture
### Optimization - Joins

- Map side joins allows the replication of small tables inside of map workers so that the join can be part of mapping.

- Map side joins reduce the load sent to reduce tasks.

---

# System Architecture
### Optimization - Predicate Pushdown

- Predicate pushdown prioritizes certain operations in a query in order to lessen load.

<img src="img/hive-predicate.png" style="width: 650px; height: 150px"/>

???
- The WHERE clause is evaluated first on the two tables before the join with predicate pushdown.

---

# System Architecture - Query Plan
<img src="img/hive-query-plan.png" style="width: 600px; height: 450px"/>

[1]
???
- The query plan generator converts the optimized logical plan into a series of map-reduce jobs.
- The execution engine then takes this plan and executes based on dependencies.
---

# Demo

---

# Conclusions

- Hive provides an easy to use query language to extract data from Hadoop MapReduce backends.

- The HDFS with partitions and buckets contribute to faster queries through pruning.

- Outside of pruning, Hive's data structuring and optimization provides a system that performs ~20% faster than traditional Hadoop systems during that time.

---

# References

[1] http://www.slideshare.net/recruitcojp/internal-hive


    </textarea>
    <script src="https://gnab.github.io/remark/downloads/remark-latest.min.js">
    </script>
    <script>
      var slideshow = remark.create();
    </script>
  </body>
</html>