Skip to content

Commit

Permalink
rocm debugging
Browse files Browse the repository at this point in the history
  • Loading branch information
d3v-null committed Nov 21, 2024
1 parent 5383146 commit e3f3a47
Show file tree
Hide file tree
Showing 4 changed files with 119 additions and 0 deletions.
56 changes: 56 additions & 0 deletions 7800.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@


```txt
ok
test fee::ffi::tests::test_calc_jones_gpu_via_ffi ... [Thread 0x7fffebfc7640 (LWP 17958) exited]
[New Thread 0x7fffebfc7640 (LWP 17959)]
Thread 104 "mwa_hyperbeam-8" received signal SIGBUS, Bus error.
[Switching to thread 104, lane 0 (AMDGPU Lane 1:1:1:1/0 (0,0,0)[0,0,0])]
0x00007fff68171608 in jones_p1sin_device (nmax=<error reading variable: Cannot access memory at address private_lane#0x446c>, theta=<error reading variable: Cannot access memory at address private_lane#0x4470>, p1sin_out=<error reading variable: Cannot access memory at address private_lane#0x4478>,
p1_out=<error reading variable: Cannot access memory at address private_lane#0x4480>) at src/fee/gpu/fee.h:306
306 p1sin_out[i] = Pm_sin_merged[modified];
(gdb) set args "fee::ffi::tests::test_calc_jones_gpu_via_ffi"
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/dev/src/mwa_hyperbeam/target/debug/deps/mwa_hyperbeam-8ec1f3aea357db81 "fee::ffi::tests::test_calc_jones_gpu_via_ffi"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
running 1 test
[New Thread 0x7fffebfc7640 (LWP 18447)]
[New Thread 0x7fffeb7ff640 (LWP 18448)]
[New Thread 0x7fffeaffe640 (LWP 18449)]
[Thread 0x7fffeaffe640 (LWP 18449) exited]
[New Thread 0x7fffe8bff640 (LWP 18450)]
[Thread 0x7fffe8bff640 (LWP 18450) exited]
[New Thread 0x7fffebb3f640 (LWP 18451)]
Thread 7 "mwa_hyperbeam-8" received signal SIGSEGV, Segmentation fault.
[Switching to thread 7, lane 0 (AMDGPU Lane 1:1:1:34/0 (16,0,0)[32,0,0])]
0x00007fffebb9ca8c in fee_kernel (coeffs=..., azs=0x0, zas=0x0, num_directions=0, norm_jones=0x0, latitude_rad=0x0, iau_order=0, fee_jones=0x0) at src/fee/gpu/fee.h:381
381 const JONES *norm_jones, const FLOAT *latitude_rad, const int iau_order, JONES *fee_jones) {
```



works:

```bash
HYPERBEAM_HIP_ARCH=gfx1101 N_DIRS=1 DEBUG=1 RAYON_NUM_THREADS=1 cargo test --tests "fee::ffi::tests::test_calc_jones_gpu_via_ffi" --features=hip -- --test-threads=1
```

doesn't work

```bash
HYPERBEAM_HIP_ARCH=gfx1101 N_DIRS=999 DEBUG=1 RAYON_NUM_THREADS=1 cargo test --tests "fee::ffi::tests::test_calc_jones_gpu_via_ffi" --features=hip -- --test-threads=1
```

```bash
HYPERBEAM_HIP_ARCH=gfx1101 DEBUG=1 cargo test --no-run;
export test_bin=$(ls -1t target/debug/deps/mwa_hyperbeam-???????????????? | head -n 1)
N_DIRS=128 RAYON_NUM_THREADS=1 rocgdb --args $test_bin --test-threads=1 "fee::ffi::tests::test_calc_jones_gpu_via_ffi"


HYPERBEAM_HIP_ARCH=gfx1101 DEBUG=1 cargo test --no-run && export test_bin=$(ls -1t target/debug/deps/mwa_hyperbeam-???????????????? | head -n 1) N_DIRS=128 RAYON_NUM_THREADS=1 && rocgdb -x /home/dev/src/mwa_hyperbeam/rocgdbinit --args $test_bin --test-threads=1 "fee::ffi::tests::test_calc_jones_gpu_via_ffi"
2 changes: 2 additions & 0 deletions rocgdbinit
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
set amdgpu precise-memory on
run
47 changes: 47 additions & 0 deletions rocmtest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
LIBCLANG_PATH=/opt/rocm/llvm/lib RUST_BACKTRACE=1 cargo run --example fee_hip 1000 /opt/cal/mwa_full_embedded_element_pattern.h5



module load singularity/3.11.4-nohost
cd $MYSOFTWARE
[ -f mwa_full_embedded_element_pattern.h5 ] || wget http://ws.mwatelescope.org/static/mwa_full_embedded_element_pattern.h5
git clone https://github.com/MWATelescope/mwa_hyperbeam.git --branch=setonix
cd mwa_hyperbeam

cat <<EOF > test.sh
#!/bin/bash
export ROCM_FULLVER=\$(/opt/rocm/bin/hipconfig --version 2>&1)
echo "ROCM Version: \$ROCM_FULLVER" | tee -a fee_hip.log
echo "start: \$(date -Is)" | tee -a fee_hip.log
export RUSTUP_HOME=/tmp/rust CARGO_HOME=/tmp/cargo PATH=/tmp/cargo/bin:\$PATH
mkdir -m755 \$RUSTUP_HOME \$CARGO_HOME
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --quiet \
--profile=minimal --default-toolchain=1.74
. \$HOME/.cargo/env
export ROCM_PATH=\${ROCM_PATH:-/opt/rocm}
export LIBCLANG_PATH=\$ROCM_PATH/llvm/lib RUST_BACKTRACE=1
# seq 0 10 | sed 's/^/scale=99; a=e(/; s/$/ * l(10)); scale=0; a\/1/' | bc -l
for ndir in 1 9 99 999 9999 99999 999999; do
echo "ndir=\$ndir" | tee -a fee_hip.log
time cargo run --example=fee_hip --features=hip --quiet -- \$ndir mwa_full_embedded_element_pattern.h5 | tee -a fee_hip.log
done
echo "end: \$(date -Is)" | tee -a fee_hip.log
EOF
chmod +x test.sh
for ROCM_VER in 5.4.6 5.6.1 5.7.3; do # 6.0.2 6.1; do
echo $ROCM_VER;
export TAG="v0.3.0-setonix-rocm${ROCM_VER}"
# export TAG="v0.3.0-setonix-rocm${ROCM_VER}"
# singularity pull --force docker://d3vnull0/hyperdrive:$TAG
singularity exec --rocm \
--bind $PWD:/hyperbeam \
--workdir /hyperbeam \
--writable-tmpfs \
--cleanenv \
docker://rocm/dev-ubuntu-22.04:6.1-complete
./test.sh
done

# docker://d3vnull0/hyperdrive:$TAG \
# --bind $PWD/../hip-sys:/hip-sys
# docker://quay.io/pawsey/rocm-mpich-base:rocm${ROCM_VER}-mpich3.4.3-ubuntu22
14 changes: 14 additions & 0 deletions test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
echo "ROCM Version: $(/opt/rocm/bin/hipconfig --version 2>&1)" | tee -a fee_hip.log
echo "start: $(date -Is)" | tee -a fee_hip.log
export RUSTUP_HOME=/tmp/rust CARGO_HOME=/tmp/cargo PATH=/tmp/cargo/bin:$PATH
mkdir -m755 $RUSTUP_HOME $CARGO_HOME
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --quiet --profile=minimal --default-toolchain=1.74
. $HOME/.cargo/env
export ROCM_PATH=${ROCM_PATH:-/opt/rocm}
export LIBCLANG_PATH=$ROCM_PATH/llvm/lib RUST_BACKTRACE=1
for ndir in 1 10 100 1000 10000 1000000; do
echo "ndir=$ndir" | tee -a fee_hip.log
cargo run --example=fee_hip --features=hip 1 mwa_full_embedded_element_pattern.h5 | tee -a fee_hip.log
done
echo "end: $(date -Is)" | tee -a fee_hip.log

0 comments on commit e3f3a47

Please sign in to comment.