mex interface for CUDA implementation of Stephen Boyd's admm group lasso solver, with the extra feature of mulitple lambda tesing(in parallel).
http://www.stanford.edu/~boyd/papers/admm/group_lasso/group_lasso.html
This implementation is a mostly literal translation of the solver, with the added ability to test up to 31 lambdas in parallel. Can operate on any shape of dense matrix A, but should be at least size (32,32), otherwise the infernal MATLAB mex overheads will no make the call worthwhile.
inputs from MATLAB are (in order)
0) Matrix A (m,n) single precision floating point numbers (32 bit) in DENSE form AND must be passed into mex in TRANSPOSE form due to row-major format(will adjust m and n internally)
1) vector b (m,1) single precision floating point numbers
2) vector p (Psize length) 32 bit integer of K(Psize) length (partitions)
3) vector u (n,1) single precision floating point numbers
4) vector z (n,1) single precision floating point numbers
5) float (single) rho
6) float (single) alpha
7) integer max_iter
8) float (single) abstol
9) float (single) reltol
10) lambda array
11) 32 bit integer array which will return the number of iterations until convergence for each lambda(size of array is equal to number of lambdas)
outputs are (in order)
0) vector u (n,lambdas) single precision floating point numbers
1) vector z (n,lambdas) single precision floating point numbers
2) vector num_iters (num_lambdas,1) 32-bit integer array
NOTE: compile with --use_fast_math and for better parallel performance set environment variable CUDA_DEVICE_MAX_CONNECTIONS to 32 if using the Tesla line GPUs.
Testing was done with default CUDA_DEVICE_MAX_CONNECTIONS=8, but if testing more lambdas increase to number of lambdas.
NOTE: no overlocking of GPU, is running at stock 706 Mhz
dimensions A | number of lambdas | 6-core 3.9 Ghz MATLAB time | CUDA mex time | CUDA Speedup |
---|---|---|---|---|
1920x956 | 17 | 463 ms | 26 ms | 17.8x |
1133x1545 | 17 | 862 ms | 72 ms | 11.9x |
2000x1243 | 24 | 1023 ms | 49 ms | 20.8x |
1111x1537 | 24 | 1144 ms | 89 ms | 12.85x |
5000x1491 | 30 | 1369 ms | 75 ms | 18.25x |
10000x1142 | 30 | 1329 ms | 62 ms | 21.43x |
NOTE: First call of any GPU related mex interface from MATLAB will be at least 10x slower than subsequent calls, due to intial context setup. In general MATLAB adds 10-20 ms of running time vs. a clean C++ API library call to the same function.
Will perform better on 'skinny' matrices (where num_rows>=num_cols) due to fewer operations needed for that shape of Matrix.
<script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-43459430-1', 'github.com'); ga('send', 'pageview'); </script>