To make some easily accessible environment to run and develop Hive.
Isolates work on different branches/etc by leveraging container isolation X11 apps could still run like "normal" application (I tend to multiple eclipse instances for every patch I'm actually working)
Full isolation makes it easier to customize everything toward the goal...all ports can be binded/etc.
You may also run hive inside...
There is a prebacked image which contains some build tools in the image itself - that image is used at ci.hive.apache.org to run tests
Ability to run some version of hive as a standalone container;
Lets launch a hive with:
docker run --rm -d -p 10000:10000 -v hive-dev-box_work:/work kgyrtkirk/hive-dev-box:bazaar
the above will initialize the metastore and launch a nodemanger/resourcemanager and hive as separate processes inside the container (in a screen session)
- you may choose different versions by setting: HIVE_VERSION, TEZ_VERSION or HADOOP_VERSION
- add
-v hive-dev-box_data:/data
to enable persistent metastore/warehouse
There are sometimes bugreports agains earlier releases; but testing these out sometimes is problematic - running/switching between versions is kinda problematic. I was using some vagrant based box which was usefull doing this...
I'm working on Hive and sometimes on other projects in the last couple years - and since QA runs may come after 8-12 hours; I work on multiple patches simultaneously. However; working on several patches simultaniously has its own problems:
I go thru all the approaches I was using ealier:
- basic approach: use a single workspace - and switch the branch...
- unquestionably this is the most simple
- after switching the branch - a full rebuild is neccessary
- 1 for each: use multiple copies of hive - with have isolated maven caches
- pro:
- capability to run maven commands simultaneuously on multiple patches
- con:
- one of the patches have to be "active" to make an IDE able to use it
- it falls short when it comes to working on patch simultaneous in multiple projects (hive+tez+hadoop)
- after some time it eats up space...
- pro:
- dockerized/virtualized development environment
- pro:
- everything is isolated
- because I'm not anymore bound to my natural environment: I may change a lot of things without interfering with anything else
- easier to "cleanup" at the end of submitting the patch (just delete the container)
- ability to have IDEs running for multiple patches at the same time
- con:
- isolated environment; configuration changes might get lost
- may waste disk space...
- pro:
The aim of this project is to provide an easier way to test-drive hive releases
- running releases:
- upstream apache releases
- HDP/CDP/CDH releases
- in-development builds
- provide an evironment for developing hive patches
# build and launch the hive-dev-box container
./hdb run hive-test
# after building the container you will get a prompt inside it
# initialize the metastore with
reinit_metastore
# everything should be ready to launch hive
hive_launch
# exit with CTRL+A CTRL+\ to kill all processes
- on linux based systems you are already running an xserver
- MacOSX users should follow: https://medium.com/@mreichelt/how-to-show-x11-windows-within-docker-on-mac-50759f4b65cb
Every container will be reaching out to almost the same artifacts; so employing an artifact cache "makes sense" in this case :D
# start artifactory instance
./start_artifactory.bash
To configure this instance the start_artifactory command will show a few commands you will need to execute to set it up - once its running.
After that you will be able to acccess artifactory at http://127.0.0.1:8081/ by using admin/admin to login.
This instance will be linked to the running development environment(s) automatically
add an export to your .bashrc or similar; like:
# to have a shared folder between all the dev containers and also the host system:
export HIVE_DEV_BOX_HOST_DIR=$HOME/hdb
The dev environment will assume that you are working on upstream patches; and will always open a new branch forked from master If you skip this; things may not work - you will be left to do these things; in case you are using HIVE_SOURCES env variable you might not need to set it anyway.
# make sure to load the new env variables for bash
. .bashrc
# and also create the host dir beforehand
mkdir $HIVE_DEV_BOX_HOST_DIR
# invoking with an argument names the container and will also be the preffered name for the ws and the development branch
./hdb run HIVE-12121-asd
# when the terminal comes up
# issuing the the following command will clone the sources based on your srcs dsl
srcs hive
# enter hive dir ; and create a local branch based on your requirements
cd hive
git branch `hostname` apache/master
# if you need...patch the sources:
cdpd-patcher hive
# run a full rebuild
rebuild
# you may run eclipse
dev_eclipse
A shorter version exists for initializing upstream patch development
./hdb run HIVE-12121-asd
# this will clone the source; creates a branch named after the containers hostname; runs a rebuild and open eclipse
hive_patch_development
beyond the "obvious" /bin
and /lib
folders there are some which might make it more clear how this works:
/work
- used to store downloaded and expanded artifacts
- if you switch to say apache hive 3.1.1 and then to some other version you shouldn't need to wait for the download and expansion of it..
- this is mounted as a docker volume; and shared between the containers
- files under
/work
are not changed
/active
- the
/work
folder may contain a number versions of the same component - symbolic links point to actually used versions
- at any point doing an
ls -l /active
gives a brief overview about the active components
- the
/home/dev
- this is the development home
/home/dev/hive
- the Hive sources; in case
HIVE_SOURCES
is set at launch time; this folder will be mapped to that directory on the host
- the Hive sources; in case
/home/dev/host
- this is a directory shared with the host; can be used to exchange files (something.patch)
- will also contain the workspace "template"
bin
directory under this folder will be linked as/home/dev/bin
so that scripts can be shared between containers and the host
- run NAME
- starts a new container with NAME - without attaching to it
- enter NAME
- enters into the container
# create a symlink to hive-dev-box/hdb from an executable location ; eg $HOME/bin ?
ln -s $PWD/hdb $HOME/bin/hdb
# enable bash_completion for hdb
# add the following line to .bashrc
. <($HOME/bin/hdb bash_completion)
# use hadoop 3.1.0
sw hadoop 3.1.0
# use hive 2.3.5
sw hive 2.3.5
# use tez 0.8.4
sw tez 0.8.4
- optionally switch to a different metastore implementation
- wipe it clean
- populate schema and load sysdb
reinit_metastore [derby|postgres|mysql]