Skip to content
Open
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Changelog
All notable changes to this project will be documented in this file.


0.0.2
-----
Major
* conf/dfs.sh Removed $HADOOP_PREFIX/sbin/start-yarn.sh (it is useless here)
* conf/dfs.sh Added $HADOOP_PREFIX/sbin/start-dfs.sh (required)
* conf/dfs.sh Added hdfs dfs -chmod 777 /tmp/hadoop-yarn/staging
(Required for impersonated Hive user to be able to start Map Reduce)
* conf/run.sh Added $HADOOP_PREFIX/sbin/start-dfs.sh (was missing, required)
* conf/mapres-site.xml added mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.java.opts.
Otherwise, when MapReduce works it crushes JVM. Even hadoop-mapreduce-examples-2.8.5.jar don't work. See https://community.cloudera.com/t5/Support-Questions/Map-and-Reduce-Error-Java-heap-space/td-p/45874
* Hadoop “Unable to load native-hadoop library for your platform” warning fixed (added ENV LD_LIBRARY_PATH)
(major performance impact)


Minor
* conf/run.sh copies to /etc/run.sh (and not to root)
* conf/core-site.xml fs.default.name key changed to fs.defaultFS
* conf/hdfs-site.xml dfs.name.dir key changed to fs.namenode.name.dir
* conf/hdfs-site.xml dfs.data.dir key changed to dfs.datanode.data.dir
* In Dockerfile Added nano
* LICENSE.txt added
* CHANGELOG.md is added
23 changes: 17 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,19 @@ RUN yum update -y && \
tar \
curl \
wget \
net-tools
net-tools \
nano

# setup ssh
#disable coloring for nano, see https://stackoverflow.com/a/55597765/1137529
RUN echo "syntax \"disabled\" \".\"" > ~/.nanorc; echo "color green \"^$\"" >> ~/.nanorc

#work-arround for nano
#Odd caret/cursor behavior in nano within SSH session,
#see https://github.com/Microsoft/WSL/issues/1436#issuecomment-480570997
ENV TERM eterm-color


# setup ssh, see http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/SingleCluster.html
RUN ssh-keygen -A
RUN ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Expand All @@ -36,6 +46,7 @@ ENV HADOOP_COMMON_HOME $HADOOP_HOME
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV YARN_HOME $HADOOP_HOME
ENV HADOOP_COMMON_LIB_NATIVE_DIR $HADOOP_HOME/lib/native
ENV LD_LIBRARY_PATH $HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
ENV PATH $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

# config hadoop
Expand All @@ -52,12 +63,12 @@ RUN cd /usr/local/hadoop && ln -s ./apache-hive-2.3.5-bin hive
ENV HIVE_HOME $HADOOP_HOME/hive

RUN chown -R root:root /usr/local/hadoop-2.8.5

RUN $HADOOP_PREFIX/bin/hdfs namenode -format
COPY dfs.sh .
RUN ./dfs.sh
COPY ./dfs.sh /etc/dfs.sh

RUN /etc/dfs.sh

COPY run.sh /etc/run.sh
COPY ./run.sh /etc/run.sh

# clean
RUN rm hadoop-2.8.5.tar.gz apache-hive-2.3.5-bin.tar.gz
Expand Down
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@ Tested with
- Docker 18.09.2
- bash 3.2.57

## Contains
- Amazon Linux
- Java Open JDK 8
- Apache Hadoop 2.8.5
- Apache Hive 2.3.5
- Configuration Hadoop in a pseudo-distributed mode (Yarn in Single Node),
see http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/SingleCluster.html for details.

## Build

Clone repo
Expand All @@ -19,35 +27,45 @@ git clone git@github.com:ops-guru/docker-hive.git
Create image

```
docker build . -t amzn-hive-image
docker build . -t opsguruhub/docker-hive
```

You may want to do some cleanup first:
docker container stop local-hive; docker rm local-hive; docker rmi opsguruhub/docker-hive

## DockerHub

Image available on DockerHub

```
docker run -d opsguruhub/docker-hive:0.0.1
docker pull opsguruhub/docker-hive
```

## Test

`Enusre that you don't have running container

docker container stop local-hive; docker rm local-hive;

Run image

```
CONTAINER_ID=$(docker run -d opsguruhub/docker-hive)
docker run -p 8030-8088:8030-8088 -p 10000:10000 -p 10002:10002 -d --name local-hive opsguruhub/docker-hive
```

Wait for services started

```
docker logs $CONTAINER_ID
docker logs local-hive
```

Or you can open bash:
docker exec `-it` local-hive bash

Start beeline client and connect to hive

```
docker exec -it $CONTAINER_ID /usr/local/hadoop/hive/bin/beeline -u jdbc:hive2://localhost:10000 -n "" -p ""
docker exec -it local-hive /usr/local/hadoop/hive/bin/beeline -u jdbc:hive2://localhost:10000 -n "" -p ""
```

Now you should be able to query
Expand All @@ -60,4 +78,6 @@ Now you should be able to query
| default |
+----------------+
1 row selected (1.921 seconds)
```
```

You can access YARN here http://localhost:8088/cluster
2 changes: 1 addition & 1 deletion conf/core-site.xml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<configuration>
<property>
<name>fs.default.name</name>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
Expand Down
26 changes: 13 additions & 13 deletions conf/hdfs-site.xml
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
18 changes: 17 additions & 1 deletion conf/mapred-site.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,20 @@
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<property>
<name>mapreduce.map.memory.mb</name>
<value>5120</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>5210</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx4g</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx4g</value>
</property>
</configuration>
12 changes: 9 additions & 3 deletions dfs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,21 @@
/usr/sbin/sshd -D &

echo "starting fs"

$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode

$HADOOP_PREFIX/sbin/start-yarn.sh
$HADOOP_PREFIX/sbin/start-dfs.sh

echo "creating folders"

$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user/root
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user/hive
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /tmp
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user/hive/warehouse
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /tmp
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /tmp/hadoop-yarn
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /tmp/hadoop-yarn/staging
$HADOOP_PREFIX/bin/hdfs dfs -chmod 777 /tmp
$HADOOP_PREFIX/bin/hdfs dfs -chmod 777 /user/hive/warehouse
$HADOOP_PREFIX/bin/hdfs dfs -chmod 777 /user/hive/warehouse
$HADOOP_PREFIX/bin/hdfs dfs -chmod 777 /tmp/hadoop-yarn
$HADOOP_PREFIX/bin/hdfs dfs -chmod 777 /tmp/hadoop-yarn/staging
2 changes: 1 addition & 1 deletion run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
/usr/sbin/sshd -D &

$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode

$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh

cd $HIVE_HOME && bin/schematool -initSchema -dbType derby
Expand Down