In this quickstart guide, we'll walk through the steps for ROCm installation, run a few Tensorflow workloads, and discuss FAQs and tips.
For basic installation instructions for ROCm and Tensorflow, please see this doc.
We also have docker images for quick deployment with dockerhub: https://hub.docker.com/r/rocm/tensorflow
Now that we've got ROCm and Tensorflow installed, we'll want to clone the tensorflow/models
repo that'll provide us with numerous useful workloads:
cd ~
git clone https://github.com/tensorflow/models.git
The following sections include the instructions for running various workloads. They also include expected results, which may vary slightly from run to run.
Here are the basic instructions:
cd ~/models/tutorials/image/mnist
python3 ./convolutional.py
And here is what we expect to see:
Step 0 (epoch 0.00), 165.1 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 8.0 ms
Minibatch loss: 3.232, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 7.6%
Step 200 (epoch 0.23), 8.1 ms
Minibatch loss: 3.355, learning rate: 0.010000
Minibatch error: 9.4%
Validation error: 4.4%
Step 300 (epoch 0.35), 8.1 ms
Minibatch loss: 3.147, learning rate: 0.010000
Minibatch error: 3.1%
Validation error: 2.9%
...
Step 8500 (epoch 9.89), 7.2 ms
Minibatch loss: 1.609, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 1.0%
Test error: 0.8%
Details for this workload can be found at this link.
Here, we'll be running two simultaneous processes from different terminals: one for training and one for evaluation.
Run the training:
cd ~/models/tutorials/image/cifar10
export ROCR_VISIBLE_DEVICES=0
python3 ./cifar10_train.py
You should see output similar to this:
2017-10-04 17:33:39.246053: step 0, loss = 4.66 (72.3 examples/sec; 1.770 sec/batch)
2017-10-04 17:33:39.536988: step 10, loss = 4.64 (4399.5 examples/sec; 0.029 sec/batch)
2017-10-04 17:33:39.794230: step 20, loss = 4.49 (4975.8 examples/sec; 0.026 sec/batch)
2017-10-04 17:33:40.050329: step 30, loss = 4.33 (4998.1 examples/sec; 0.026 sec/batch)
2017-10-04 17:33:40.255417: step 40, loss = 4.36 (6241.7 examples/sec; 0.021 sec/batch)
2017-10-04 17:33:40.448037: step 50, loss = 4.40 (6644.5 examples/sec; 0.019 sec/batch)
2017-10-04 17:33:40.640150: step 60, loss = 4.20 (6662.7 examples/sec; 0.019 sec/batch)
2017-10-04 17:33:40.832118: step 70, loss = 4.23 (6667.8 examples/sec; 0.019 sec/batch)
2017-10-04 17:33:41.017503: step 80, loss = 4.30 (6904.7 examples/sec; 0.019 sec/batch)
2017-10-04 17:33:41.208288: step 90, loss = 4.21 (6709.0 examples/sec; 0.019 sec/batch)
Note: If you have a second GPU, you can run the evaluation in parallel with the training -- to do so, just change ROCR_VISIBLE_DEVICES
to your second GPU's ID. If you only have a single GPU, it is best to wait until training is complete, otherwise you risk running out of device memory.
To run the evaluation, follow this:
cd ~/models/tutorials/image/cifar10
export ROCR_VISIBLE_DEVICES=0
python3 ./cifar10_eval.py
Using the most recent training checkpoints, this script indicates how often the top prediction matches the true label of the image. You should see periodic output similar to this:
2017-10-05 18:34:40.288277: precision @ 1 = 0.118
2017-10-05 18:39:45.989197: precision @ 1 = 0.118
2017-10-05 18:44:51.644702: precision @ 1 = 0.836
2017-10-05 18:49:57.354438: precision @ 1 = 0.836
2017-10-05 18:55:02.960087: precision @ 1 = 0.856
2017-10-05 19:00:08.752611: precision @ 1 = 0.856
2017-10-05 19:05:14.307137: precision @ 1 = 0.861
...
Details can be found at this link
Set up the CIFAR-10 dataset
cd ~/models/research/resnet
curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
tar -xzf cifar-10-binary.tar.gz
ln -s ./cifar-10-batches-bin ./cifar10
Train ResNet:
python3 ./resnet_main.py --train_data_path=cifar10/data_batch* \
--log_root=/tmp/resnet_model \
--train_dir=/tmp/resnet_model/train \
--dataset='cifar10' \
--num_gpus=1
Here are the expected results (note the precision
metric in particular):
INFO:tensorflow:loss = 2.53745, step = 1, precision = 0.125
INFO:tensorflow:loss = 1.9379, step = 101, precision = 0.40625
INFO:tensorflow:loss = 1.68374, step = 201, precision = 0.421875
INFO:tensorflow:loss = 1.41583, step = 301, precision = 0.554688
INFO:tensorflow:loss = 1.37645, step = 401, precision = 0.5625
...
INFO:tensorflow:loss = 0.485584, step = 4001, precision = 0.898438
...
Details can be found at this link
Here's how to run the classification workload:
cd models/tutorials/image/imagenet
python3 ./classify_image.py
Here are the expected results:
giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107)
indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779)
lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296)
custard apple (score = 0.00147)
earthstar (score = 0.00117)
Details on the tf_cnn_benchmarks can be found at this link.
Here are the basic instructions:
# Grab the repo
cd $HOME
git clone -b cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
cd benchmarks
# Run the training benchmark (e.g. ResNet-50)
python3 ./scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --num_gpus=1
Details on the Perfzero Resnet50 benchmark can be found at this link.
Here are the basic instructions:
# Grab Perfzero
git clone https://github.com/tensorflow/benchmarks
# Grab the model
git clone https://github.com/ROCmSoftwarePlatform/models rocm-models
# Install prerequisites and set exports
cd rocm-models
pip3 install --upgrade pip setuptools
export PYTHONPATH="$PYTHONPATH:~/rocm-models"
export HIP_HIDDEN_FREE_MEM=500
pip3 install --user -r official/requirements.txt
pip3 install py_cpuinfo==5
# Run the training benchmark
# This benchmark configuration uses the following parameters:
# - 1 GPU
# - model = resnet50 v1.5
# - precision = fp32
# - batch size = 64
python3 /root/benchmarks/perfzero/lib/benchmark.py --gcloud_key_file_url="" --python_path=models --benchmark_methods=official.r1.resnet.estimator_benchmark.Resnet50EstimatorBenchmarkSynth.benchmark_graph_1_gpu
# The following command is similar to the previous but uses 8 GPUs
python3 /root/benchmarks/perfzero/lib/benchmark.py --gcloud_key_file_url="" --python_path=models --benchmark_methods=official.r1.resnet.estimator_benchmark.Resnet50EstimatorBenchmarkSynth.benchmark_graph_8_gpu
As a temporary workaround, if your workload runs out of device memory, you can either reduce the batch size or set config.gpu_options.allow_growth = True
.
We build ROCm docker images for every tensorflow-rocm commit. Those docker images have latest tensorflow-rocm installed, and are aimed for testing.
Docker image name: rocm<version>-<commit hash>
Latest docker image name: rocm<version>-latest
and latest
Pull instructions: $ docker pull rocm/tensorflow-autobuilds:latest
We build dev builds for every dependency change. Those docker images that have latest dependencies for the purpose of building tensorflow-rocm, and are aimed for development.
Docker image name: dev-<commit hash>
Latest docker image name: dev-latest
Pull instructions: $ docker pull rocm/tensorflow-autobuilds:dev-latest
To finetune GPT-2 with Huggingface Transformers and Tensorflow 2.x with AMD GPUs Enter a standard rocm docker image
alias drun='sudo docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -w /dockerx'
drun rocm/tensorflow-autobuilds:rocm4.2.0-latest
Clone the OpenAI GPT-2 Dataset handling repository and run the download python script. This downloads the training sets for various model sizes and stores them into a directory named "data". This directory will be in the location where the command was run.
mkdir -p /data/tf-gpt-2 && cd /data/tf-gpt-2
git clone https://github.com/openai/gpt-2-output-dataset.git
if [ -d "/data/tf-gpt-2/data" ]
then
echo "Directory TF GPT-2 data exists."
else
echo "Directory TF GPT-2 data does not exist, pulling data."
pip3 install tqdm
python3 gpt-2-output-dataset/download_dataset.py
fi
pip3 install transformers jsonlines
cd ~ && git clone https://github.com/ROCmSoftwarePlatform/transformers
# Script to train the small 117M model
python3 transformers/scripts/gpt2-tf2/gpt2_train.py "Small" "/data/tf-gpt-2/data/" 1 1
# Script to train the Medium 345M model
python3 transformers/scripts/gpt2-tf2/gpt2_train.py "Medium" "/data/tf-gpt-2/data/" 1 1
# Script to train the Large 762M model
python3 transformers/scripts/gpt2-tf2/gpt2_train.py "Large" "/data/tf-gpt-2/data/" 1 1
# Script to train the XL 1542M model
python3 transformers/scripts/gpt2-tf2/gpt2_train.py "XL" "/data/tf-gpt-2/data/" 1 1
The arguments illustrate using either the "Small" (117M), "Medium" (345M), "Large" (762M), "XL" (1542M) GPT-2 models, and the location of the training data as whole. Using "Small" in the first argument not only loads the pretrained small model but also selects the appropriate datafile in the data folder. The third argument is to tell the trainer how many epochs to train for, and the final argument is for dataset truncation, if it is 1, then the dataset is truncated to 1000 data values, if 0, then the dataset is full (~250000 values). This script also evaluates the model with the appropriate dataset (Small, Medium, Large, XL) after finetuning.