Hadoop Hive

Yandex.Cloud S3 HDFS HIVE MapReduce TEZ YARN HiveSQL CLI Shell Hadoop Cluster Administration

Payment type	Date	Tips average amount	Passengers total
Cash	2020-01-31	999.99	112

1,2020-04-01 00:41:22,2020-04-01 01:01:53,1,1.20,1,N,41,24,2,5.5,0.5,0.5,0,0,0.3,6.8,0

Learn more about the data source here

Deploying a Hadoop cluster using a Yandex.Cloud solution :
Creating a bucket using a S3 Yandex.Cloud solution.
Downloading data (database) to created s3 bucket using distcp.
Creating & configure Database (database). Setting configuration Hive - TEZ.
- "payment" according to the description of the data format. The storage format is parquet.
- The names of the id and name fields. Filling dimension table.
- Using access utility - Hive CLI.
- Tables created as external (external) to prevent data loss.
Creating tables trips built on top of the existing data in the csv format. trips are partitioned by the day of the start of the trip, the storage format is parquet. Thus, the search for the necessary data in the table will take the shortest possible time.
Configure partitions, transformation and upload data to fact tables.
Creating data showcase using a materialized view and MAPJOIN.
Creating terminal scenario for showcase auto-creation.
Rebuilding showcase.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scrpts		scrpts
.gitignore		.gitignore
README.md		README.md

Provide feedback