forked from facebookincubator/AITemplate
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jenkins ci #40
Open
illsilin
wants to merge
32
commits into
amd-develop
Choose a base branch
from
jenkins-ci
base: amd-develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Jenkins ci #40
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
24749d4
add jenkinsfile
illsilin 68438c1
put all test commands into a bash script
illsilin 1e8025a
fix syntax
illsilin 5b77355
remove unneccessary line
illsilin 997af70
remove cron trigger
illsilin 918b046
upgrade rocm to 5.4.3
illsilin e19f6fa
get rid of execute_cmd
illsilin 0b68790
move python packages installation into the docker
illsilin 43099de
do not re-install AIT
illsilin 97b00eb
chmod for run_tests.sh
illsilin aa7d847
add torchvision and torchaudio, set new HF cache to suppress errors
illsilin ba27cfb
move HF cache to a different path
illsilin 43af738
create cache folder in steps
illsilin 9ad5bd2
assume /home/jenkins exists
illsilin 0af561b
use pre-built folder in docker for HF cache
illsilin c489ef9
temporarily disable vit tests and update log paths
illsilin 0060f2e
skip all tests and go to SD, update dockerfile
illsilin 0954e8f
reduce the number of build threads by half
illsilin 955826c
further reduce the number of building threads
illsilin 763154b
change the order of archiving and stashing the logs
illsilin 457a488
test stashing the logs
illsilin 895afda
re-enable tests
illsilin 5ac7a91
only stash log files
illsilin 3fbddce
fix the parsing script
illsilin 6379419
minor changes to performance scripts
illsilin 2ec4432
minor changes to performance scripts
illsilin 5019556
rename logs, update processing
illsilin 2292048
report which files are being parsed
illsilin ad7c61b
clean-up any old logs before unstashing new ones
illsilin fe37b9e
optimize dockerfile
fsx950223 a72f4b9
reduce the number of tests in regular CI, add daily QA
illsilin 2cc565b
Merge branch 'jenkins-ci' of github.com:ROCmSoftwarePlatform/AITempla…
illsilin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
def rocmnode(name) { | ||
return 'rocmtest && miopen && ' + name | ||
} | ||
|
||
def show_node_info() { | ||
sh """ | ||
echo "NODE_NAME = \$NODE_NAME" | ||
lsb_release -sd | ||
uname -r | ||
ls /opt/ -la | ||
""" | ||
} | ||
|
||
def runShell(String command){ | ||
def responseCode = sh returnStatus: true, script: "${command} > tmp.txt" | ||
def output = readFile(file: "tmp.txt") | ||
echo "tmp.txt contents: $output" | ||
return (output != "") | ||
} | ||
|
||
def getDockerImageName(){ | ||
def img | ||
img = "${env.CK_DOCKERHUB}:ait_rocm${params.ROCMVERSION}" | ||
return img | ||
} | ||
|
||
def getDockerImage(Map conf=[:]){ | ||
env.DOCKER_BUILDKIT=1 | ||
def prefixpath = conf.get("prefixpath", "/opt/rocm") // prefix:/opt/rocm | ||
def no_cache = conf.get("no_cache", false) | ||
def dockerArgs = "--build-arg BUILDKIT_INLINE_CACHE=1 --build-arg PREFIX=${prefixpath} --build-arg ROCMVERSION='${params.ROCMVERSION}' " | ||
echo "Docker Args: ${dockerArgs}" | ||
def image = getDockerImageName() | ||
//Check if image exists | ||
def retimage | ||
try | ||
{ | ||
echo "Pulling image: ${image}" | ||
retimage = docker.image("${image}") | ||
retimage.pull() | ||
} | ||
catch(Exception ex) | ||
{ | ||
error "Unable to locate image: ${image}" | ||
} | ||
return [retimage, image] | ||
} | ||
|
||
def build_ait(Map conf=[:]){ | ||
|
||
def build_cmd = """ | ||
export ROCM_PATH=/opt/rocm | ||
export ROC_USE_FGS_KERNARG=0 | ||
python3 -c "import torch; print(torch.__version__)" | ||
""" | ||
|
||
def cmd = conf.get("cmd", """ | ||
${build_cmd} | ||
""") | ||
|
||
echo cmd | ||
sh cmd | ||
} | ||
|
||
def Run_Step(Map conf=[:]){ | ||
show_node_info() | ||
|
||
env.HSA_ENABLE_SDMA=0 | ||
checkout scm | ||
|
||
def image = getDockerImageName() | ||
def prefixpath = conf.get("prefixpath", "/opt/rocm") | ||
|
||
// Jenkins is complaining about the render group | ||
def dockerOpts="--device=/dev/kfd --device=/dev/dri --group-add video --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined" | ||
if (conf.get("enforce_xnack_on", false)) { | ||
dockerOpts = dockerOpts + " --env HSA_XNACK=1 " | ||
} | ||
def dockerArgs = "--build-arg PREFIX=${prefixpath} --build-arg ROCMVERSION='${params.ROCMVERSION}' " | ||
def variant = env.STAGE_NAME | ||
def retimage | ||
|
||
gitStatusWrapper(credentialsId: "${status_wrapper_creds}", gitHubContext: "Jenkins - ${variant}", account: 'ROCmSoftwarePlatform', repo: 'AITemplate') { | ||
try { | ||
(retimage, image) = getDockerImage(conf) | ||
withDockerContainer(image: image, args: dockerOpts) { | ||
timeout(time: 5, unit: 'MINUTES'){ | ||
sh 'PATH="/opt/rocm/opencl/bin:/opt/rocm/opencl/bin/x86_64:$PATH" clinfo | tee clinfo.log' | ||
if ( runShell('grep -n "Number of devices:.*. 0" clinfo.log') ){ | ||
throw new Exception ("GPU not found") | ||
} | ||
else{ | ||
echo "GPU is OK" | ||
} | ||
} | ||
} | ||
} | ||
catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e){ | ||
echo "The job was cancelled or aborted" | ||
throw e | ||
} | ||
|
||
withDockerContainer(image: image, args: dockerOpts + ' -v=/var/jenkins/:/var/jenkins') { | ||
timeout(time: 24, unit: 'HOURS') | ||
{ | ||
build_ait(conf) | ||
dir("examples"){ | ||
if (params.RUN_FULL_QA){ | ||
sh "./run_qa.sh $HF_TOKEN ${env.BRANCH_NAME} ${NODE_NAME} ${params.ROCMVERSION}" | ||
} | ||
else{ | ||
sh "./run_tests.sh $HF_TOKEN ${env.BRANCH_NAME} ${NODE_NAME} ${params.ROCMVERSION}" | ||
} | ||
} | ||
dir("examples/01_resnet-50"){ | ||
archiveArtifacts "01_resnet50.log" | ||
stash includes: "01_resnet50.log", name: "01_resnet50.log" | ||
} | ||
dir("examples/03_bert"){ | ||
archiveArtifacts "03_bert.log" | ||
stash includes: "03_bert.log", name: "03_bert.log" | ||
} | ||
dir("examples/04_vit"){ | ||
archiveArtifacts "04_vit.log" | ||
stash includes: "04_vit.log", name: "04_vit.log" | ||
} | ||
dir("examples/05_stable_diffusion/"){ | ||
archiveArtifacts "05_sdiff.log" | ||
stash includes: "05_sdiff.log", name: "05_sdiff.log" | ||
} | ||
} | ||
} | ||
} | ||
return retimage | ||
} | ||
|
||
def Run_Step_and_Reboot(Map conf=[:]){ | ||
try{ | ||
Run_Step(conf) | ||
} | ||
catch(e){ | ||
echo "throwing error exception while building CK" | ||
echo 'Exception occurred: ' + e.toString() | ||
throw e | ||
} | ||
finally{ | ||
if (!conf.get("no_reboot", false)) { | ||
reboot() | ||
} | ||
} | ||
} | ||
|
||
def process_results(Map conf=[:]){ | ||
env.HSA_ENABLE_SDMA=0 | ||
checkout scm | ||
def image = getDockerImageName() | ||
def prefixpath = "/opt/rocm" | ||
|
||
// Jenkins is complaining about the render group | ||
def dockerOpts="--cap-add=SYS_PTRACE --security-opt seccomp=unconfined" | ||
if (conf.get("enforce_xnack_on", false)) { | ||
dockerOpts = dockerOpts + " --env HSA_XNACK=1 " | ||
} | ||
|
||
def variant = env.STAGE_NAME | ||
def retimage | ||
|
||
gitStatusWrapper(credentialsId: "${status_wrapper_creds}", gitHubContext: "Jenkins - ${variant}", account: 'ROCmSoftwarePlatform', repo: 'AITemplate') { | ||
try { | ||
(retimage, image) = getDockerImage(conf) | ||
} | ||
catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e){ | ||
echo "The job was cancelled or aborted" | ||
throw e | ||
} | ||
} | ||
|
||
withDockerContainer(image: image, args: dockerOpts + ' -v=/var/jenkins/:/var/jenkins') { | ||
timeout(time: 1, unit: 'HOURS'){ | ||
try{ | ||
dir("examples"){ | ||
// clean up any old logs, then unstash perf files to master | ||
sh "rm -rf *.log" | ||
unstash "01_resnet50.log" | ||
unstash "03_bert.log" | ||
unstash "04_vit.log" | ||
unstash "05_sdiff.log" | ||
sh "python3 process_results.py" | ||
} | ||
} | ||
catch(e){ | ||
echo "throwing error exception while processing performance test results" | ||
echo 'Exception occurred: ' + e.toString() | ||
throw e | ||
} | ||
} | ||
} | ||
} | ||
|
||
//launch amd-develop branch daily at 17:00 UT in FULL_QA mode | ||
CRON_SETTINGS = BRANCH_NAME == "amd-develop" ? '''0 17 * * * % RUN_FULL_QA=true''' : "" | ||
|
||
pipeline { | ||
agent none | ||
triggers { | ||
parameterizedCron(CRON_SETTINGS) | ||
} | ||
options { | ||
parallelsAlwaysFailFast() | ||
} | ||
parameters { | ||
string( | ||
name: 'ROCMVERSION', | ||
defaultValue: '5.4.3', | ||
description: 'Specify which ROCM version to use: 5.4.3 (default).') | ||
booleanParam( | ||
name: "RUN_FULL_QA", | ||
defaultValue: false, | ||
description: "Select whether to run small set of performance tests (default) or full QA") | ||
} | ||
environment{ | ||
dbuser = "${dbuser}" | ||
dbpassword = "${dbpassword}" | ||
dbsship = "${dbsship}" | ||
dbsshport = "${dbsshport}" | ||
dbsshuser = "${dbsshuser}" | ||
dbsshpassword = "${dbsshpassword}" | ||
status_wrapper_creds = "${status_wrapper_creds}" | ||
HF_TOKEN = "${HF_TOKEN}" | ||
DOCKER_BUILDKIT = "1" | ||
} | ||
stages{ | ||
stage("Build AITemplate") | ||
{ | ||
parallel | ||
{ | ||
stage("Build AIT and Run Tests") | ||
{ | ||
agent{ label rocmnode("gfx908 || gfx90a") } | ||
steps{ | ||
Run_Step_and_Reboot(no_reboot:true, , prefixpath: '/usr/local') | ||
} | ||
} | ||
} | ||
} | ||
stage("Process Performance Test Results") | ||
{ | ||
when { | ||
beforeAgent true | ||
expression { params.RUN_FULL_QA.toBoolean() } | ||
} | ||
parallel | ||
{ | ||
stage("Process results"){ | ||
agent { label 'mici' } | ||
steps{ | ||
process_results() | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put pip instlal in a RUN command and install sympy recordtype parameterized einops jinja2 too.
Also add pined lint python package pip install ufmt==2.0.1 click==8.1.3 black==22.12.0 flake8==5.0.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please go ahead and add any packages you need.