Skip to content

Latest commit

 

History

History
541 lines (454 loc) · 31.2 KB

CHANGELOG.md

File metadata and controls

541 lines (454 loc) · 31.2 KB

v0.6.0 - 2022-04-18

Build

  • cmake: support OpenMP for benchmark executable target when performing linking
  • cmake: reorganize benchmark device-side libs and pass the missing HIP_NVCC_FLAGS to flat src
  • cmake: add a cmake option SPMV_BUILD_BENCHMARK to disable/enable benchmark building
  • cmake: reorganize cmake script under directory benchmark
  • cmake: add cmake config for benchmark executable binary
  • cmake: set CXX standard to 14 for using std::unique_ptr
  • cmake: move clipp.h header to directory third-party

Docs

  • changelog: change repo url in changelog.md to github
  • changelog: update changelog for v0.6.0
  • readme: add document of fetching dependency clipp
  • suitesparse-dl: add document for "matrices with the same name"
  • suitesparse-dl: add document of extracting matrix market from .tar.gz file

Feat

  • correctly access row_offset when device row_offset accessing on host side is not supported
  • benchmark: change style of performance statistics log: split header and data
  • benchmark: add benchmark support for SpMV-acc strategies and cuSPARSE in func test_spmv
  • benchmark: more strong verification for benchmark
  • benchmark: destory host/device data after test
  • benchmark: verify after outputting statistics
  • benchmark: add the adaptive line kernel to benchmark
  • benchmark: benchmark support for cub CsrMV on Cuda
  • benchmark: benchmark support for hola CsrMV on Cuda
  • benchmark: throw runtime_error of hola device SpMV
  • benchmark: record preproc time, calc time and destroy time in statistics data
  • benchmark: record strategy name in statistics data
  • benchmark: disable benchmarking the default sequential spmv method on device side
  • benchmark: benchmark support for rocsparse SpMV: vector and csr-adaptive
  • benchmark: benchmark support for hola CsrMV on hip
  • benchmark: disable benchmark of thread-row strategy in spmv-acc
  • cli: show more detailed error message in HIP_CHECK macro
  • cli: add matrix market format reading support
  • cli: use clipp to parse command line argument
  • cli: support to read csr-binary format converted by suitesparse-dl conv sub-command
  • cli: update the cli and benchmark help message: value of -f can be bin
  • cli: add OpenMP support for data sorting when converting COO to CSR
  • cli: add OpenMP support to matrix-market input parsing
  • common: support reduction on CUDA platform and make DPP reduction available only on ROCm
  • common: add compatible code to make nontemporal-load/store, fma and atomicAdd work on nvidia
  • common: mv data-load assembly to platform/rocm and add compatible data loading code for CUDA
  • kernel-light: support the case of "wrap size is 32" (Nvidia GPU) for light strategy
  • kernel-line: replace the origin line method with adaptive-line method
  • kernel-line: adjust the value of ROW_NUM in adaptive line calculation
  • kernel-line: add adaptive line-kerenel: switch between line and vector-row
  • kernel-line: combine the LDS space used by line and vector-row algorithm in adaptive line
  • kernel-line-enhance: support memory coalescing in vector reduction when storing data back
  • kernel-line-enhance: add a static assert to check vector number and rows assigned to a block
  • kernel-vector-row: support to move data from thread 0 of vectors to front lanes on CUDA
  • kernel-vertor-row: support the case of "wrap size is 32" (Nvidia GPU)
  • scripts: add a compiler wrapper to remove unrecognized argument -std=gnu++14 for nvcc
  • suitesparse-dl: add cli flags to specific data dir and goroutines for downloading
  • suitesparse-dl: support to convert matrix with integer type as value
  • suitesparse-dl: add feature of generating sbatch file from matrices data for job submitting
  • suitesparse-dl: add fetch subcommand for fetching collection metadata
  • suitesparse-dl: download matrices into different NNZ categories
  • suitesparse-dl: add cli flags parsing and use dl sub-command for downloading
  • suitesparse-dl: support to convert matrix market format to binary CSR file format
  • suitesparse-dl: stop downloading if there is any error in downloading process
  • suitesparse-dl: skip the matrix downloading if the file exists
  • suitesparse-dl: download matrix to temporary file and then rename to the final file
  • suitesparse-dl: add a tool for downloading matrices of matrix market format from SuiteSparse
  • third-party: specific version and apply changes when or after downloading hola
  • third-party: add a prompt to set the WARP_SIZE after downloaded hip-hola

Fix

  • benchmark: correct verification for cub and hola spmv which only compute y=Ax
  • benchmark: add sync before launching kernel to obtain more accuracy time of kernel execution
  • benchmark: fix cub SpMV invalid device function
  • cli: fix incorrect nnz assertion while reading matrix market format
  • cli: fix Segmentation fault while using csr matrix data after variable csr_reader is released
  • cmake: add missing path third-party/include to cmake include dir in benchmark building
  • cmake: fix typos and incorrect path in benchmark cmake script to make hola-hip compiling passed
  • compile: correct possible compiling error error: size of array 'buf' is negative
  • compile: fix possible compiling error of determining template type under some compilers
  • compile: fix compiling errors in benchmark and cli building
  • compile: fix compiling issues (with OpenMP support) on nvidia platform
  • compile: add the missing template param for func dpp_wf_reduce_sum on ROCm platform
  • demo: fix possible crash when allocing buffer memory for csr file reading in demo
  • kernel-flat: fix compiling error: flat_sparse_spmv declaration does not match the definition
  • kernel-line: fix incorrect data storing logic of vector-row method in adaptive line
  • kernel-line-enhance: fix the possible stucked vector-based reduction on NVIDIA platform
  • suitesparse-dl: fix incorrect "NNZ" values stored into binary csr file header
  • suitesparse-dl: fix compiling errors in fetch sub-command
  • suitesparse-dl: fix parsing error when the matrix data type is complex
  • third-party: bump hola-hip version to fix building errors from hipFree and CSR<double> template

Merge

  • Merge pull request #27 from hpcde/fix-error-of-nvidia-support
  • Merge pull request #25 from hpcde/feature-nvidia-support
  • Merge pull request #35 from hpcde/fix-hola-hip-building-errors
  • benchmark: Merge pull request #38 from hpcde/hotfix-benchmark-building-and-testing
  • benchmark: Merge pull request #36 from hpcde/enhanced-HIP_CHECK-and-verify-for-cli-and-benchmark
  • benchmark: Merge pull request #30 from hpcde/benchmarks-improves
  • benchmark: Merge pull request #28 from hpcde/feature-benchmark
  • benchmark: Merge pull request #39 from hpcde/refactor-benchmark
  • cli: Merge pull request #26 from hpcde/feature-openmp-matrix-market-reading
  • cli: Merge pull request #31 from hpcde/feature-csr-binary-reader
  • cli: Merge pull request #22 from hpcde/feature-matrix_market_support
  • kernel-line: Merge pull request #32 from hpcde/feature-adaptive-line
  • kernel-line-enhance: Merge pull request #34 from hpcde/fix-line-enhance-shlf_down-reduce-stucked
  • kernel-line-enhance: Merge pull request #37 from hpcde/opt-line-enhance-vector-reduction
  • suitesparse-dl: Merge pull request #33 from hpcde/hotfix-suitesparse-dl-conv
  • suitesparse-dl: Merge pull request #24 from hpcde/feature-suitesparse-dl-sbatch-gen
  • suitesparse-dl: merge: Merge pull request #23 from hpcde/feature-suitesparse-downloader

Perf

  • cli: remove unnecessary data copy when converting matrix-market to CSR
  • cli: replace stringstream with user-customized line parser to parse matrix-market format
  • cli: read the whole file into buffer and than parse while loading matrix-market format
  • kernel-line-enhance: move a part of sum reduction (local & global shift) to out of rounds-loop

Refactor

  • benchmark: use macro SPMV_BENCHMARK to call spmv benchmark
  • benchmark: apply a simpler approach to benchmark different algorithms
  • cli: refactor code of matrix-market reading and parsing: split header and body reading
  • cli: move implementation code of verification.h to cpp file
  • cli: rename class mtx_reader -> csr_mtx_reader, file csr.hpp -> sparse_format.h
  • kernel-line: use param BLOCK_LDS_SIZE, rather than MAX_ROW_NNZ, to specific LDS size

Pull Requests

  • Merge pull request #29 from hpcde/feature-suitesparse-dl-binary-csr-convert

v0.5.0 - 2021-10-15

Chore

  • add LICENSE file

Docs

  • changelog: update changelog for v0.5.0
  • changelog: update changelog
  • readme: add citing information

Feat

  • kernel-adaptive: use adaptive line-enhance when the matrix is small or is short-row matrix
  • kernel-adaptive: apply line strategy(one-pass line) to short-row matrices in adaptive strategy
  • kernel-line-enhance: add adaptive line_enhance, which set kernel template parameters adaptively

Merge

  • kernel-adaptive: Merge branch 'revert-adaptive-line-enhance-case' into branch main
  • kernel-adaptive: Merge pull request #20 from hpcde/feature-adaptive-line-enhance
  • kernel-adaptive: Merge pull request #19 from hpcde/feature-kernel-line-for-short-row-matrix

Revert

  • kernel-adaptive: don't use adaptive line-enhance if a large matrix has short rows

Build

  • cmake: fix cmake find OpenMP errors while OpenMP target mode is not support
  • cmake: change the files (only headers in src/acc dir) and lib (kernel lib) for installation

Docs

  • readme: change kernel strategy in example build script, and add docs of min rocsparse version

Feat

  • cli: refactor the cli part of reading file, calling spmv, verifing, and put source to cli dir
  • cmake: set default value of cmake option SPMV_OMP_ENABLED_FLAG to OFF
  • kernel-adaptive: change condition of using flat strategy to "matrix nnz is larger than 2^23"
  • kernel-flat: add a new vector based reduction method for flat kernel strategy
  • kernel-flat: support the case "vec number is larger than reduce rows" in vec-level reduction
  • kernel-flat: add a new vector based reduction method with memory coalescing
  • kernel-flat: support to set flat kernel template parameters adaptively using average row nnz
  • kernel-flat: use config FLAT_ONE_PASS_ADAPTIVE to enable/disable adaptive flat
  • kernel-flat: apply adaptive flat to kernel strategy FLAT
  • kernel-line-enhance: create kernel strategy line-enhance with empty implementation
  • kernel-line-enhance: add kernel func implementation of strategy line-enhance
  • tool: use muellan/clipp lib for cli argument parsing
  • tools: feature of showing matrix nnz by part or row

Fix

  • kernel-flat: correct the wrong length of array break_points

Merge

  • Merge pull request #14 from hpcde/feature-more-flexible-configs
  • cli: Merge pull request #11 from hpcde/feature-new-cli
  • cmake: Merge pull request #12 from hpcde/fix-cmake-find-omp
  • kernel-flat: Merge pull request #10 from hpcde/feature-kernel-flat-vec-reduction
  • kernel-flat: Merge pull request #13 from hpcde/feature-kernel-flat-one-pass
  • kernel-line: Merge pull request #15 from hpcde/feature-kernel-line-one-pass
  • kernel-line-enhance: Merge pull request #18 from hpcde/feature-kernel-strategy-line-enhance
  • tools: Merge pull request #16 from hpcde/feature-csr-tool-block-nnz

Perf

  • cli: add OpenMP support for the new cli to parse the input matrix
  • kernel-flat: add another flat imp: each block only iterate one pass in its lifetime
  • kernel-line: set default kernel threads number (HIP_THREAD) to 256 for line-one-pass
  • kernel-line: add optimization of loading 2 matrix values each time in line-one-pass
  • kernel-line: add R param, which can assign each block R*blockDim.x/row_max_nnz rows
  • kernel-line: add another line imp: each block only iterate one pass in its lifetime
  • kernel-line-enhance: new reduce method: divide block into vectors and use vector for reduction

Refactor

  • use wavefront size generated by cmake configure (config by cmake option WAVEFRONT_SIZE)
  • kernel-flat: instantiate flat parameters (reduce method, vec size and R) in func template
  • kernel-flat: move all reduction kernel functions to file flat_reduce.hpp
  • kernel-flat: move direct reduction code in flat strategy to new func flat_reduce_direct
  • kernel-line-enhance: move direct reduction to a new func line_enhance_direct_reduce

Pull Requests

  • Merge pull request #17 from hpcde/feature-adaptive-flat

v0.4.0 - 2021-08-31

Build

  • cmake: gen strategy config from cmake and apply strategy via the macro in generated header
  • cmake: enable to build all strategies code via cmake script

Chore

  • changelog: update CHANGELOG.md for v0.3.0
  • examples: reorder the case in alphabetical order and correct cases information
  • kernel-thread-row: improved code comments of kernel function kernel_thread_row_v2

Docs

  • changelog: update change log for v0.4.0
  • develop: update document of adding a new kernel strategy in README.md

Feat

  • kernel-adaptive: add config for the new kernel strategy adaptive
  • kernel-adaptive: apply matrix data dividing based vector-row if it is available
  • kernel-adaptive: add basic implementation of adaptive method (no data blocks dividing)
  • kernel-thread-row: remove useless __syncthreads in kernel func kernel_thread_row_block_v2
  • kernel-thread-row: call kernel func kernel_thread_row_block_v2(x remapping at block level)
  • kernel-thread-row: apply vector x reampping when macro OPT_THREAD_ROW_REMAP_VEC_X is defined
  • kernel-thread-row: add another block level imp: one thread only process a row in its lifetime
  • kernel-thread-row: add another imp of "thread-row" strategy with calculation in block level
  • kernel-vector-row: adaptive kernel: apply different VECTOR_SIZE to data blocks of matrix A

Fix

  • kernel-flat: fix incorrect results when reduction size is larger than threads number in block
  • kernel-thread-row: correct the limit of "template param N can only be 1" at block level
  • kernel-vector-row: fix the wrong kernel func called in native vector-row while VECTOR_SIZE is 8

Merge

  • Merge pull request #4 from hpcde/fix-flat-and-vector-row-bugs
  • Merge branch 'cmake-enable-build-all-strategies' into 'main'
  • kernel-adaptive: Merge pull request #1 from hpcde/feature-kernel-adaptive into branch main
  • kernel-flat: Merge pull request #2 from hpcde/opt-on-kernel-flat
  • kernel-thread-row: Merge pull request #9 from hpcde/feat-thread-row-single
  • kernel-thread-row: Merge pull request #6 from hpcde/opt-thread-row-in-block-level
  • kernel-thread-row: Merge pull request #8 from hpcde/opt-thread-row-tune-kernel-config
  • kernel-thread-row: Merge pull request #7 from hpcde/opt-thread-row-in-block-level-x-remap
  • kernel-vector-row: Merge branch 'enhance-vector-row' into feature-kernel-adaptive
  • kernel-vector-row: Merge pull request #3 from hpcde/opt-vector-row-access-y-coalescing
  • kernel-vector-row: Merge branch 'enhance-vector-row' into 'main'
  • thread-row: Merge pull request #5 from hpcde/opt-thread-row-remap-vec_x

Perf

  • kernel-flat: move barrier __syncthreads() ahead(moved to the place before loading matrix data)
  • kernel-flat: apply loop unrolling to data loading of csr matrix and x vector
  • kernel-thread-row: tune the BLOCK dim of thread-row at block level(with x remapping) to 512
  • kernel-thread-row: use LDS and __shfl to load start/end row index of block to threads in block
  • kernel-thread-row: make the memory accessing mode of $x$ as column-first mode at block level
  • kernel-thread-row: apply global_load_dwordx4/x2 to load 2 double/int to LDS at block level
  • kernel-thread-row: tune grid dim for native thread-row and optimization "vec x remapping"
  • kernel-thread-row: break the row-loop if the start row index is larger/equal than matrix rows
  • kernel-thread-row: use native C++ code, instead of asssembly, to load 2 matrix values into LDS
  • kernel-thread-row: tune kernel config of thread-row at block level to <<<7010, 512>>>
  • kernel-thread-row: add a new kernel func for remapping memory access pattern of vector $x$
  • kernel-vector-row: memory coalescing of loading and storing y from vectors
  • kernel-vector-row: memory access coalescing at block level when loading and storing vector y
  • kernel-vector-row: add vector y memory coalescing support for adaptive vector-row strategy
  • kernel-vector-row: simple load-balance of wavefronts number on data blocks

Refactor

  • functions renaming to solve signature conflict when building all strategies into one lib
  • kernel-flat: use template param THRAEDS, instead of blockDim.x, as threads num in block
  • kernel-thread-row: refactor macro control on different optimization kernels in thread-row
  • kernel-thread-row: move kernel func kernel_thread_row_v2 to file thread_row_x_remap.inl
  • kernel-thread-row: mv kernel native_thread_row to native_thread_row.cpp to fix build error
  • kernel-thread-row: use constexpr to replace C macro for thread-row optimization selection
  • kernel-vector-row: extract the imp of y memory coalescing to store_y_with_coalescing

Revert

  • kernel-vector-row: recover the missing native vector-row method

Style

  • kernel-thread-row: code format of file thread_row_block_x_remap.hpp

v0.3.0 - 2021-07-15

Build

  • cmake: add ability to config CU number for kernel-line strategy and generate global config

Chore

  • update source files created authors and created date

Feat

  • use AVAILABLE_CU in cmake generated building_config.h to config LDS size of Block
  • kernel-flat: add support for multiple rounds of csr_val*vec_x in a Block loop
  • kernel-flat: make algorithm compatible with the case of "one row is cut by more than 2 blocks"
  • kernel-flat: add base implementation of flat strategy
  • kernel-thread-row: fallback to use native thread-row method when the nnz per row is large
  • kernel-thread-row: new thread-row strategy, but with more coherent memory access
  • line: add implementation of kernel strategy line

Fix

  • kernel-line: fix incorrect computation of variable block_end_row_id
  • kernel-line: fix incorrect reduction when threads number in Block is less than rows processed

Merge

  • cmake: Merge branch 'feature-cmake-config-CUs' into 'main'
  • kernel-flat: Merge branch 'feature-kernel-flat' into 'main'
  • kernel-line: Merge branch 'feature-kernel-line' into 'main'
  • kernel-thread-row: Merge branch 'opt-thread-row-for-small-row-2' into 'main'
  • kernel-thread-row: Merge branch 'opt-thread-row-for-small-row' into 'main'

Perf

  • kernel-line: use min(instead of if) to obtain array index for matrix-vector multiplication
  • kernel-thread-row: replace memory access of wavefront row start/end with broadcast(__shfl)
  • kernel-thread-row: use nontemporal load and store to load/store y vector from/to device mmory
  • kernel-thread-row: load 2 matrix values and column indexes in each loop for multiplication
  • kernel-thread-row: tune block and grid dim to achieve better performance for new kernel func
  • kernel-thread-row: remove unnecessary if in multiplication and reduction step

Refactor

  • kernel-thread-row: rename kernel func from device_sparse_spmv_acc to native_thread_row

v0.2.4 - 2021-07-06

Feat

  • utils: add utils macro to count memory bandwidth

Merge

  • kernel-vector-row: Merge branch 'opt-vector-pipeline' into 'main'
  • utils: Merge branch 'benchmark-memory-bandwidth' into 'main'

Perf

  • kernel-vector-row: new optimization: load matrix data and vector x asynchronously in pipeline

Refactor

  • kernel-vector-row: move imp of loading next x vector to function load_vec_x_into_reg
  • kernel-vector-row: move pipeline implementation to another func vector_row_kernel_pipeline

v0.2.3 - 2021-06-29

Chore

  • kernel-vector-row: add comment about row-loading in spmv_vector_row_kernel_double_buffer
  • typos: fix typos in vector-row kernel: sync -> async

Feat

  • kernel-vector-row: basic implementation of vector-row kernel strategy for double buffer
  • tools: better cli and sub-command "dist" to show nnz distribution
  • tools: add a tool for dumping nnz of each row in csr matrix

Fix

  • kernel-wf-row: fix compiling issue of "global_mem_ops.hpp not found"

Merge

  • Merge branch 'code-refactor-and-compiling-fixes' into 'main'
  • kernel-vector-row: Merge branch 'opt-vector-double-buffer' into 'main'
  • tools: Merge branch 'feature-csr-tools' into 'main'

Perf

  • kernel-vector-row: add a new vector-row optimization: preload next row when calculating
  • kernel-vector-row: load next data into buffer for later usage when performing calculation

Refactor

  • move enum sparse_operation to file src/api/types.h
  • kernel-vector-row: remove macro ASYNC_LOAD in func spmv_vector_row_kernel_double_buffer
  • kernel-vector-row: move imp of loading a row of csr data to function load_row_into_reg
  • kernel-vector-row: move double buffer optimization implementation to opt_double_buffer.hpp

v0.2.2 - 2021-06-27

Fix

  • kernel-vector-row: fix the condition of using LDS as buffer
  • kernel-vector-row: fallback to use normal vector-row strategy if the LDS is not enough

Merge

  • kernel-vector-row: Merge branch 'fix-fallback-vector-row-when-exceed-LDS' into 'main'

Refactor

  • kernel-vector-row: move row calculation of a vector to a new function vector_calc_a_row

v0.2.1 - 2021-06-23

Merge

  • kernel-vector-row: Merge branch 'opt-kernel-vector-row-memory-access' into 'main'

Perf

  • kernel-vector-row: change gridDim.x value (total blocks on GPU) to 512 to improve performance
  • kernel-vector-row: use global_load_dwordx4/global_load_dwordx2 to load 2 double/int to LDS
  • kernel-vector-row: also load column index data to LDS to achieve memory accessing optimization
  • kernel-vector-row: memory accessing optimization: load matrix rows to LDS and than use it

v0.2.0 - 2021-06-14

Build

  • cmake: move cmake generated file building_config.h from cmake source dir to cmake binary dir

Chore

  • add git-chglog tool config and generated file CHANGELOG.md

Docs

  • correct typos in README.md file
  • run: update runnig document due to the new input reading (from file)

Feat

  • demo: sync demo code for reading csr matrix from large data set
  • demo: add csr data files for performance test
  • demo: update upstream demo code for reading CSR matrix from file
  • examples: echo job node list before running jobs
  • examples: add new batch files for using large data set (run all cases and run Hardesty3 case)
  • examples: update sbatch script to run all test cases

Fix

  • cmake: fix cmake error "unsupported kernel strategy vector_row" when using the strategy
  • demo: sync demo to fix wrong rows number when reading csr matrix in demo code
  • kernel-light: correct variable names in light strategy implementation to make compiling passed

Merge

  • cmake: Merge branch 'fix-cmake-config-unsupported-kernel-strategy'
  • demo: Merge branch 'demo-read-csr-file' into 'main'
  • kernel-grpup-row: Merge branch 'feature-kernel-wf-group-row' into 'main'
  • kernel-vector-row: Merge branch 'opt-kernel-vector-row'
  • kernel-wf-row: Merge branch 'remove-i32-to-u64-conversion' into 'main'

Perf

  • kernel-group-row: rm useless rowptr[0] when computing average non-zeros (it is is always 0)
  • kernel-vector-row: apply fma instruction to kernel-vector-row strategy
  • kernel-wf-row: rm unnecessary v_ashrrev_i32_e32 while converting col index (int) to address

Refactor

  • move file utils.h to common directory
  • kernel-group-row: implementaion refactor of group-row and light strategies
  • kernel-group-row: add two new kernel strategies, "light" and "group_row", for calculating multiple rows per wf
  • kernel-grpup-row: variables renaming of group-row and light imp to satisfy the code style
  • kernel-grpup-row: remove macros in group-row and light kernel strategies
  • kernel-vector-row: rename kernel strategy group_row to vector_row

Style

  • kernel-grpup-row: add comments, resort headers for group-row and light kernel strategies imp

v0.1.1 - 2021-06-03

Perf

  • kernel-wf-row: use shift operator, rather than multiplication and division
  • kernel-wf-row: use global_load_dwordx4 to load 2 double in inner loop to reduce mem access

Refactor

  • kernel-wf-row: apply const to variables in function device_spmv_wf_row_default

v0.1.0 - 2021-06-03

Build

  • cmake: add path config of kernel source files directory for different kernel strategy
  • cmake: add cmake building script

Chore

  • add clang-format config file
  • add .gitignore file
  • examples: add sbatch script for running the program with GPU support
  • kernel-thread-row: remove unused file src/acc/hip-thread-row/CMakeLists.txt
  • kernel-wf-row: remove expired comments (about KERNEL_STRATEGY option) in config.cmake file

Docs

  • build: add building document for GPU side and CPU side
  • develop: add document for creating and using a kernel strategy
  • verify: add building document for device/CPU side verification

Feat

  • wrapper file acc/hip/spmv_hip_acc_imp.h in file Csrsparse.hpp
  • compiling: make compiling passed if cmake option HIP_ENABLE_FLAG is OFF
  • demo: add official demo code
  • demo: syny latest official demo code: replace clock counter std::clock with gettimeofday
  • kernel-block-row-ordinary: one block computes one row with ordinary method
  • kernel-row-wf: kernel strategy 'row-wf' implementation for processing one row in a wavefront
  • kernel-sync-wf-row: kernel strategy sync-wf-row implementation for processing one row in a wavafront with synchronous thread
  • kernel-thread-row: new strategy: one thread computes a row of A
  • kernel-wf-row-reg: kernel strategy wf-row-reg for processing one row in a wavafront with register __shfl_down

Fix

  • compiling: make compiling and validation passed
  • demo: sync the latest demo code, which fixed a bug of matrix A generation: "A is always dense"
  • demo: fix building error use of undeclared identifier 'cout' due to missing iostream header
  • macro: add missing macro gpu generated by cmake configure_file
  • reg-reduce: correct the incorrect use of the __shfl_down function to fix failed results validation

Merge

  • Merge branch 'refactor-merge-wf-row' into 'main'
  • Merge branch 'sync-wf-row' into refactor-merge-wf-row
  • Merge branch 'make-compiling-passed' into 'main'
  • Merge branch 'feature-kernel-wf-row-reg' into 'main'
  • Merge branch 'feature-strategy-row-wf' into 'main'
  • Merge branch 'feature-kernel-strategy' into 'main'
  • cmake: Merge branch 'cmake-configs' into 'main'
  • demo: Merge branch 'official-demo-code' into 'main'
  • kernel-block-row-ordinary: Merge branch 'feature-kernel-block-row-ordinary' into 'main'
  • kernel-thread-row: Merge branch 'feature-strategy-thread-row' into 'main'
  • kernel-wf-row: Merge branch 'opt-wf-row-strategy' into 'main'
  • kernel-wf-row: Merge branch 'opt-row-wf-strategy' into main

Perf

  • kernel-wf-row: move alpha multiplication out of the inner col-loop
  • kernel-wf-row: adjust kernel threads number and block number to make full use of CUs of GPU
  • kernel-wf-row: adjust threads number in a block to make full use of SIMDs in GPU

Refactor

  • kernel-thread-row: code refactor that may improve performance: move variables out of loop
  • kernel-wf-row: merge 3 wavefront reduce methods(default,LDS,Reg) all into wf-row strategy
  • kernel-wf-row: rename strategy 'row-wf' to 'wf-row'

Style

  • demo: code format of official demo code
  • kernel-block-row-ordinary: code format for implementation code of block-row-ordinary strategy
  • kernel-thread-row: code format of implementation of thread-row strategy
  • kernel-wf-row-reg: removed unused code