-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhive.html
412 lines (270 loc) · 10 KB
/
hive.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif'; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
}
th, td {
padding: 5px;
border: 1px solid black;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Hive
---
# Hive - A Petabyte Scale Data Warehouse Using Hadoop
### Authors: Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy
### Presenter: Neil Seward
---
# Contents
1. Defining Facebook's Problem
2. Contributions of Paper
3. HiveQL
4. HDFS
5. System Architecture
6. Demo
7. Conclusions
---
# Defining Facebook's Problem
- Facebook's data storage prior to 2008 was done on a trditional RDBMS
- Their data was rapidly expanding and was already taking too long to perform menial queries.
- Simple queries would take days to complete.
???
---
# Defining Facebook's Problem
- Facebook started to investigate Hadoop MapReduce as a solution.
- Allthough beneficial, the query time saved was lost in constructing queries in a MapReduce manor.
- Many developers at Facebook took days to construct these simple queries.
???
---
# Defining Facebook's Problem
- As a result, Facebook developed Hive.
- Their objective was to create a scalable infrastructure that operated like a traditional RDBMS would.
- Hive was to have columns, tables, and traditional SQL operation commands.
???
---
# Contributions
- Hive offers traditional SQL queries and data management on a distributed MapReduce storage system.
- Hive's custom serialization offers users the ability to store custom data types.
- Hive's HDFS and the underlying architecture provide's a 20% decrease in runtime compared to traditional MapReduce architectures at the time.
???
---
# The Hive Model
- Hive structures data into tables, columns, rows, and partitions.
- The system supports many common datatypes: int, float, string, etc.
- The system can also support complex and custom datatypes defined through serialization.
???
---
# HiveQL
- HiveQL is a subset of SQl.
- It contains extended operations that are useul in a distributed architecture.
- Although it is a subset of SQL, there are some limitations.
???
---
# HiveQL
- HiveQL can only support ANSI JOIN statements.
<br>
<img src="img/hive-join.png" style="width: 700px; height: 400px"/>
???
---
# HiveQL
### Custom MapReduce
- HiveQL has support for using custom map and reduce scripts in queries.
- This allows for more customization queries and additional operations like classification, clustering, etc.
<img src="img/hive-custom-mapreduce.png" style="width: 800px; height: 300px"/>
???
Note that Hive also supports clustering operations to reduce the load put on the reduce operations.
Note that Hive supports the interchanging of FROM, SELECT, MAP, and REDUCE within a sub query.
---
# HiveQL
### Custom MapReduce Data
<img src="img/hive-custom-mapreduce-data.png" style="width: 800px; height: 150px"/>
---
# HiveQL
### Custom Map
<img src="img/hive-custom-map.png" style="width: 800px; height: 300px"/>
---
# HiveQL
### Custom Reduce
<img src="img/hive-custom-reduce.png" style="width: 600px; height: 400px"/>
???
---
# HiveQL
### Serialization
- HiveQL supports custom data types with serialization.
- The default serialization is done through LazySerDe.
---
# HiveQL
### Serialization
- SerDe serializes rows into objects.
- In the table definition, HiveQL can specify row, column, list, and map value delimiters for serialization.
<br>
<br>
<img src="img/hive-serialization.png" style="width: 750px; height: 250px"/>
???
Note that collection items are delimiters for the elements in a list. List items are rows in this case.
Map keys are delimiters for items in a map.
---
# HDFS
- <b>Tables</b>: stored in directory.
- <b>Partitions</b>: stored in subdirectory within table directory.
- <b>Buckets</b>: stored as a file within a table or partition directory.
---
# HDFS
### Tables
- Tables can have many partitions, or operate without partitions.
- Tables are partitioned with custom PARTITIONED BY functions.
<br>
<img src="img/hive-create-partition.png" style="width: 750px; height: 100px"/>
???
Note that the table will be create with partitions existing for every 'date' and 'hour'
---
# HDFS
### Partitions
- Partitions allow Hive to perform queries much faster.
- They enable partition pruning, which will allow Hive to only query sub directories of tables.
<br>
<img src="img/hive-single-partition.png" style="width: 800px; height: 300px"/>
???
The first creates a partition with data from table t and the second creates and empty partition.
---
# HDFS
### Buckets
- Buckets can be used to scan files within partitions or tables.
<br>
<img src="img/hive-create-buckets.png" style="width: 800px; height: 300px"/>
???
Note that this queries the 2nd bucket file out of the 32 buckets in the table.
Note that these buckets can be used for clustering tables.
---
# System Architecture
### Summary
The main building blocks of Hive are:
<br>
- Metastore
- Compiler
- Optimizer
- MapReduce Execution
---
# System Architecture
### Metastore
- Metastore stores location data and metadata on all tables, partitions, columns, etc.
- This information is stored in a traditional RDBMS (MySQL)
- There are often duplicate metastore servers in case one is terminated or is hung.
???
The info is stored in a RDBMS because the compiler needs this info quick.
Without metastore, there is no structure for the data stored in HDFS.
---
# System Architecture
### Query Compilation and Optimization
- The compiler and optimizer takes in the HiveSQL queries and translates them into map and reduce tasks.
<br>
<br>
<img src="img/hive-compiler.png" style="width: 700px; height: 300px"/>
[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.
- The analyser queries the metastore for information on tables to insert into internal queries.
- The logical plan generator will take the internal queries and order them in a logical plan in an operation tree.
- The query plan generator converts the ordered logical plan into a series of map-reduce jobs.
---
# System Architecture - Parser
<img src="img/hive-example-parse.png" style="width: 800px; height: 125px"/>
[1]
---
# System Architecture - Parser
<img src="img/hive-parser.png" style="width: 700px; height: 450px"/>
[1]
???
- The parser converts queries intox syntax trees.
- The first block is the column selection from the underlying tables
- The second block contains the destination information for the overwrite operation
- The third block contains the join information and ON clause information for the two tables.
---
# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-1.png" style="width: 750px; height: 450px"/>
[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.
---
# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-2.png" style="width: 750px; height: 450px"/>
[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.
---
# System Architecture - Semantic Analyzer
<img src="img/hive-semantic-3.png" style="width: 750px; height: 450px"/>
[1]
???
- The analyser transforms the syntax trees into an internal query representation with query blocks.
---
# System Architecture - Logical Plan
<img src="img/hive-logical-plan.png" style="width: 650px; height: 450px"/>
[1]
???
- The logical plan generator will take the internal queries and order them in a logical plan in an operation tree.
---
# System Architecture
### Optimization
There are several optimization steps that occur to prune the logical plan created by the semantic analyser.
- Column pruning
- Predicate pushdown
- Partition pruning
- Map side joins
???
-Note predicate pushdown has to deal with the immediate execution of queries as they come in.
---
# System Architecture
### Optimization - Pruning
- Column pruning ensures that only the specified columns are selected when processing rows.
- Partition pruning ensures that only sub directories of tables are traversed when querying tables.
---
# System Architecture
### Optimization - Joins
- Map side joins allows the replication of small tables inside of map workers so that the join can be part of mapping.
- Map side joins reduce the load sent to reduce tasks.
---
# System Architecture
### Optimization - Predicate Pushdown
- Predicate pushdown prioritizes certain operations in a query in order to lessen load.
<img src="img/hive-predicate.png" style="width: 650px; height: 150px"/>
???
- The WHERE clause is evaluated first on the two tables before the join with predicate pushdown.
---
# System Architecture - Query Plan
<img src="img/hive-query-plan.png" style="width: 600px; height: 450px"/>
[1]
???
- The query plan generator converts the optimized logical plan into a series of map-reduce jobs.
- The execution engine then takes this plan and executes based on dependencies.
---
# Demo
---
# Conclusions
- Hive provides an easy to use query language to extract data from Hadoop MapReduce backends.
- The HDFS with partitions and buckets contribute to faster queries through pruning.
- Outside of pruning, Hive's data structuring and optimization provides a system that performs ~20% faster than traditional Hadoop systems during that time.
---
# References
[1] http://www.slideshare.net/recruitcojp/internal-hive
</textarea>
<script src="https://gnab.github.io/remark/downloads/remark-latest.min.js">
</script>
<script>
var slideshow = remark.create();
</script>
</body>
</html>