This guide provides step-by-step instructions for setting up Hive tables and running queries in a Hadoop environment using Docker. It includes commands for creating external and staging tables, loading data, and querying with Hive.
Ensure Docker is installed on your system. The data file data.csv
should be available in the directory /home/hadoop/data
inside the container.
-
Run the Docker Container
Start the Hadoop container with the specified ports and volume mapping:docker run -p 9870:9870 -p 8088:8088 -v D:/hedoop:/home/hadoop/data -it --name=hadoop macio232/hadoop-pseudo-distributed-mode
-
Start Hadoop Services
docker start hadoop docker exec -it hadoop /bin/bash start-all.sh
-
Create Directories and Upload CSV File to HDFS
hdfs dfs -mkdir -p /user/hadoop/folder119 hdfs dfs -put /home/hadoop/data/data.csv /user/hadoop/folder119/
-
Open Hive CLI
hive
-
Check Databases
List available databases and switch to the student database (create it if necessary):SHOW DATABASES; CREATE DATABASE student119; USE student119;
Create an external partitioned table named Info119
:
CREATE EXTERNAL TABLE Info119 (
Id INT,
Name STRING,
Age INT
)
PARTITIONED BY (Gender STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/folder119/info/';
Create a staging table to load data initially:
CREATE TABLE Info119_staging (
Id INT,
Name STRING,
Age INT,
Gender STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
-
Load Data into Staging Table
Load the data from the HDFS path into the staging table:LOAD DATA INPATH '/user/hadoop/folder119/data.csv' OVERWRITE INTO TABLE Info119_staging;
-
Insert Data with Partitioning into
Info119
Enable dynamic partitioning and insert data from the staging table:SET hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE Info119 PARTITION (Gender) SELECT Id, Name, Age, Gender FROM Info119_staging;
-
Retrieve Data from
Info119
SELECT * FROM Info119;
-
Check Partitions in
Info119
SHOW PARTITIONS Info119;
-
View Data Files in HDFS
Access the partitioned data files:hdfs dfs -ls /user/hadoop/folder119/info hdfs dfs -cat /user/hadoop/folder119/info/gender=F/000000_0
Create an external partitioned table for address data:
DROP TABLE IF EXISTS Address119;
CREATE EXTERNAL TABLE Address119 (
Id INT
)
PARTITIONED BY (City STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/folder119/address/';
Create a staging table to load data initially:
CREATE TABLE Address119_staging (
Id INT,
Name STRING,
Age INT,
Gender STRING,
City STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
-
Load Data into
Address119_staging
LOAD DATA INPATH '/user/hadoop/folder119/data.csv' OVERWRITE INTO TABLE Address119_staging;
-
Insert Data with Partitioning into
Address119
SET hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE Address119 PARTITION (City) SELECT Id, City FROM Address119_staging;
-
Retrieve Data from
Address119
SELECT * FROM Address119;
-
Check Partitions in
Address119
SHOW PARTITIONS Address119;
-
Filter by Gender in
Info119
Retrieve records for a specific gender:SELECT * FROM Info119 WHERE Gender = 'M'; SELECT * FROM Info119 WHERE Gender = 'F';
-
Join
Info119
andAddress119
onId
SELECT i.Id, i.Name, i.Age, i.Gender, a.City FROM Info119 i JOIN Address119 a ON i.Id = a.Id;
-
Join with Filter
Retrieve data for females in a specific city:SELECT i.Id, i.Name, i.Age, i.Gender, a.City FROM Info119 i JOIN Address119 a ON i.Id = a.Id WHERE i.Gender = 'F' AND a.City = 'K';
This README covers Hive table setup, data loading, and querying in a Hadoop environment using Docker. Adjust paths as needed based on your setup.