Hive Programming with Hadoop - Docker

This guide provides step-by-step instructions for setting up Hive tables and running queries in a Hadoop environment using Docker. It includes commands for creating external and staging tables, loading data, and querying with Hive.

Prerequisites

Ensure Docker is installed on your system. The data file data.csv should be available in the directory /home/hadoop/data inside the container.

Docker Setup and Initial Commands

Run the Docker Container
Start the Hadoop container with the specified ports and volume mapping:

docker run -p 9870:9870 -p 8088:8088 -v D:/hedoop:/home/hadoop/data -it --name=hadoop macio232/hadoop-pseudo-distributed-mode

Start Hadoop Services

docker start hadoop
docker exec -it hadoop /bin/bash
start-all.sh

Setting Up HDFS Directories and Files

Create Directories and Upload CSV File to HDFS

hdfs dfs -mkdir -p /user/hadoop/folder119
hdfs dfs -put /home/hadoop/data/data.csv /user/hadoop/folder119/

Open Hive CLI
```
hive
```
Check Databases
List available databases and switch to the student database (create it if necessary):
```
SHOW DATABASES;
CREATE  DATABASE student119;
USE student119;
```

Creating Tables in Hive

1. External Table - `Info119`

Create an external partitioned table named Info119:

CREATE EXTERNAL TABLE Info119 (
    Id INT,
    Name STRING,
    Age INT
)
PARTITIONED BY (Gender STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/folder119/info/';

2. Staging Table - `Info119_staging`

Create a staging table to load data initially:

CREATE TABLE Info119_staging (
    Id INT,
    Name STRING,
    Age INT,
    Gender STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Loading Data into Hive Tables

Load Data into Staging Table
Load the data from the HDFS path into the staging table:

LOAD DATA INPATH '/user/hadoop/folder119/data.csv' OVERWRITE INTO TABLE Info119_staging;

Insert Data with Partitioning into Info119
Enable dynamic partitioning and insert data from the staging table:

SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE Info119
PARTITION (Gender)
SELECT Id, Name, Age, Gender
FROM Info119_staging;

Querying and Viewing Results

Retrieve Data from Info119
```
SELECT * FROM Info119;
```
Check Partitions in Info119
```
SHOW PARTITIONS Info119;
```

View Data Files in HDFS
Access the partitioned data files:

hdfs dfs -ls /user/hadoop/folder119/info
hdfs dfs -cat /user/hadoop/folder119/info/gender=F/000000_0

Working with Address Data

1. External Table - `Address119`

Create an external partitioned table for address data:

DROP TABLE IF EXISTS Address119;

CREATE EXTERNAL TABLE Address119 (
    Id INT
)
PARTITIONED BY (City STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/folder119/address/';

2. Staging Table - `Address119_staging`

Create a staging table to load data initially:

CREATE TABLE Address119_staging (
    Id INT,
    Name STRING,
    Age INT,
    Gender STRING,
    City STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

3. Load Data and Insert into Partitioned Table

Load Data into Address119_staging

LOAD DATA INPATH '/user/hadoop/folder119/data.csv' OVERWRITE INTO TABLE Address119_staging;

Insert Data with Partitioning into Address119

SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE Address119
PARTITION (City)
SELECT Id, City
FROM Address119_staging;

Retrieve Data from Address119
```
SELECT * FROM Address119;
```
Check Partitions in Address119
```
SHOW PARTITIONS Address119;
```

Advanced Queries

Filter by Gender in Info119
Retrieve records for a specific gender:

SELECT * FROM Info119 WHERE Gender = 'M';
SELECT * FROM Info119 WHERE Gender = 'F';

Join Info119 and Address119 on Id

SELECT i.Id, i.Name, i.Age, i.Gender, a.City
FROM Info119 i
JOIN Address119 a ON i.Id = a.Id;

Join with Filter
Retrieve data for females in a specific city:

SELECT i.Id, i.Name, i.Age, i.Gender, a.City
FROM Info119 i
JOIN Address119 a ON i.Id = a.Id
WHERE i.Gender = 'F' AND a.City = 'K';

This README covers Hive table setup, data loading, and querying in a Hadoop environment using Docker. Adjust paths as needed based on your setup.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data.csv		data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hive Programming with Hadoop - Docker

Prerequisites

Docker Setup and Initial Commands

Setting Up HDFS Directories and Files

Creating Tables in Hive

1. External Table - `Info119`

2. Staging Table - `Info119_staging`

Loading Data into Hive Tables

Querying and Viewing Results

Working with Address Data

1. External Table - `Address119`

2. Staging Table - `Address119_staging`

3. Load Data and Insert into Partitioned Table

Advanced Queries

About

Uh oh!

Releases

Packages

aminul-islam-niloy/Hive-Programming-with-Hadoop---Docker

Folders and files

Latest commit

History

Repository files navigation

Hive Programming with Hadoop - Docker

Prerequisites

Docker Setup and Initial Commands

Setting Up HDFS Directories and Files

Creating Tables in Hive

1. External Table - Info119

2. Staging Table - Info119_staging

Loading Data into Hive Tables

Querying and Viewing Results

Working with Address Data

1. External Table - Address119

2. Staging Table - Address119_staging

3. Load Data and Insert into Partitioned Table

Advanced Queries

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

1. External Table - `Info119`

2. Staging Table - `Info119_staging`

1. External Table - `Address119`

2. Staging Table - `Address119_staging`

Packages