Yandex.Cloud
S3
HDFS
HIVE
MapReduce
TEZ
YARN
HiveSQL
CLI
Shell
Hadoop Cluster Administration
Task: To provide constant access to cold data, create a 'Star` scheme, create a showcase of the form:
Payment type | Date | Tips average amount | Passengers total |
---|---|---|---|
Cash | 2020-01-31 | 999.99 | 112 |
1,2020-04-01 00:41:22,2020-04-01 01:01:53,1,1.20,1,N,41,24,2,5.5,0.5,0.5,0,0,0.3,6.8,0
Learn more about the data source here
- Deploying a Hadoop cluster using a
Yandex.Cloud
solution : - Creating a bucket using a
S3
Yandex.Cloud solution. - Downloading data (database) to created
s3
bucket usingdistcp
. - Creating & configure Database (database). Setting configuration
Hive - TEZ
.- "payment" according to the description of the data format. The storage format is parquet.
- The names of the id and name fields. Filling dimension table.
- Using access utility -
Hive CLI
. - Tables created as external (
external
) to prevent data loss.
- Creating tables trips built on top of the existing data in the
csv
format. trips are partitioned by the day of the start of the trip, the storage format isparquet
. Thus, the search for the necessary data in the table will take the shortest possible time. - Configure partitions, transformation and upload data to fact tables.
- Creating data showcase using a materialized view and
MAPJOIN
. - Creating terminal scenario for showcase auto-creation.
- Rebuilding showcase.