Evaluation of power controls and counters on general-purpose Graphics Processing Units (GPUs)

General-purpose graphics processing units (GPUs) are increasingly becoming vital processing elements in high-end computing systems. A single advanced GPU may consume similar power (300 watts) as that of a high-performance computing (HPC) node. The next-generation HPC systems could have up to 16 GPUs per node requiring multiple kilowatts per node. Consequently, it is quintessential to study the characteristics of GPU power consumption and control to inform future procurement design and decisions.

This repository is intended to a)launch GPU benchmarkings, b)generate GPU metrics data including power consumption for the launched benchmaring, and c) analyze GPU power consumption patterns.

Getting Started

These instructions will get you a copy of the project up and intended to run on your GPU-enabled HPC node to:

launch GPU benchmarking/workload
generate GPU metrics data
analye GPU power consumption pattern of the benchmaring/workload
visualize the collected data. See deployment for notes on how to deploy the project on a GPU-enabled HPC system.

Prerequisites

Install NVIDIA CUDA Toolkit 10.1 or higher. For details, see NVIDIA CUDA Installation
Python 3 with required modules (e.g., numpy, pandas, matplotlib, etc.)
Compiled workload/benchmarking

Overall Deployment Framework

The following diagram shows GPU Power Control & Counters Collection Framework.

The framework consists of three modules:

Test module includes two directories (applications and data), two scripts (app_loadscaling and app_run2run), and analyze_data.

applications directory is place holder for executable application of whom power consumption behavior is to be analyzed.
data directory stores the GPU metrics data collected during the runtime of application.
app_loadscaling script enables launching of application with different input i.e. matrix sizes of 5K, 10K, 15K, 20K, and 25K.
app_run2run enables launching of application with multiple runs to get run-to-run power variations of the application.
analyze_data analyzes the GPU power data by calling the related functions in the analysis module.

Utility module includes control and profile. control enforces the required GPU power control, executes profile script, and finally resetting the GPU control parameters to default Configuration. profile launches the application and starts nvidia-smi command to start collecting GPU metrics. Once application complete its runtime, it terminates the nvidia-smi command.
Power Analysis Engine analyzes the workloads power consumption from different perspectives including group bar plot (of all workloads with different power controls), box plot (summary of power consumption of different benchmarking), power variations, power consumption with varying load, and performance analysis (GFLOPS/s, GB/s).

Running the tests

In order to generate the GPU metrics data and then perform different aspects of power analysis, follow these steps:

clone the repository on the target GPU-enabled HPC node:

git clone https://github.com/nsfcac/gpupowermodel.git

Configuration of Utility Functions

GPU Control parameters

edit control script and enable/disable different control functions (e.g. power limit, frequency)

vi gpupower/utility/control

GPU Profile Parameters

edit profile script and change profile parameters (e.g. nvidia-smi sampling rate, adding/removing GPU metrics)

vi gpupower/utility/profile

Data Generation

cd gpupower/test

GPU metrics data collection for run-to-run application

./app_run2run.sh

GPU metrics data collection for application with different matrix sizes

./app_loadscaling.sh

Data Analysis

Edit analyze_data.py available at: gpupower/test
Run analyze_data.py to generate power analysis plots:

python3 analyze_data.py

Testing with Historical GPU data

TBD

Contributing

Further contributions to enhance and extend this work are welcome.

Technical Support

In case of any technical issue in reproducing these results, you are welcome to contact Texas Tech University (TTU) developer: ghazanfar.ali@ttu.edu

Authors

Mr. Ghazanfar Ali Sanpal, PhD student, Texas Tech University
Mr. Mert Side, PhD student, Texas Tech University
Dr. Sridutt Bhalachandra, Lawrence Berkeley National Laboratory
Dr. Nicholas Wright, Lawrence Berkeley National Laboratory

License

This project is licensed under BSD 3-Clause License

Acknowledgments

The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility operated under Contract No. DEAC02- 05CH11231. Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation. This research is supported in part by the National Science Foundation under grant CNS-1817094, OAC-1835892, and CNS-1939140 (A U.S. National Science Foundation Industry-University Cooperative Research Center on Cloud and Autonomic Computing).

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
applications		applications
archive		archive
datasets		datasets
framework		framework
images		images
ml_modeling		ml_modeling
profile		profile
visualize		visualize
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of power controls and counters on general-purpose Graphics Processing Units (GPUs)

Getting Started

Prerequisites

Overall Deployment Framework

Running the tests

clone the repository on the target GPU-enabled HPC node:

Configuration of Utility Functions

GPU Control parameters

GPU Profile Parameters

Data Generation

Data Analysis

Testing with Historical GPU data

Contributing

Technical Support

Authors

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

nsfcac/gpupowermodel

Folders and files

Latest commit

History

Repository files navigation

Evaluation of power controls and counters on general-purpose Graphics Processing Units (GPUs)

Getting Started

Prerequisites

Overall Deployment Framework

Running the tests

clone the repository on the target GPU-enabled HPC node:

Configuration of Utility Functions

GPU Control parameters

GPU Profile Parameters

Data Generation

Data Analysis

Testing with Historical GPU data

Contributing

Technical Support

Authors

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages