Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unified memory for scalar indexing of permutation matrices #313

Merged
merged 5 commits into from
Oct 2, 2024

Conversation

tgymnich
Copy link
Member

No description provided.

@tgymnich
Copy link
Member Author

How should we approach doing the same for scalar indexing inside of GPUArrays.jl?

@maleadt
Copy link
Member

maleadt commented Mar 11, 2024

I recently reworked indexing in GPUArrays to make exactly this possible, see JuliaGPU/GPUArrays.jl#499 and JuliaGPU/CUDA.jl#2138 for an implementation.

@tgymnich tgymnich mentioned this pull request Apr 10, 2024
@christiangnrd christiangnrd mentioned this pull request Jul 23, 2024
2 tasks
@tgymnich tgymnich marked this pull request as ready for review September 27, 2024 13:25
Copy link
Contributor

@christiangnrd christiangnrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Does this include scalar indexing inside of GPUArrays.jl? If not, should an issue be filed so we don't forget to eventually get to it?

I also noticed some stuff that's not really in the scope of this PR so I'll submit a separate one.

@maleadt maleadt changed the title Use unified memory for scalar indexing Use unified memory for scalar indexing of permutation matrices Oct 2, 2024
@maleadt
Copy link
Member

maleadt commented Oct 2, 2024

Does this include scalar indexing inside of GPUArrays.jl? If not, should an issue be filed so we don't forget to eventually get to it?

#443

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 45fe9d1 Previous: 71b784e Ratio
private array/construct 27791.666666666668 ns 23715.25 ns 1.17
private array/broadcast 461833.5 ns 474145.5 ns 0.97
private array/random/randn/Float32 997792 ns 994125 ns 1.00
private array/random/randn!/Float32 622708 ns 644458.5 ns 0.97
private array/random/rand!/Int64 565833 ns 569958 ns 0.99
private array/random/rand!/Float32 581375 ns 606250 ns 0.96
private array/random/rand/Int64 885500 ns 831750 ns 1.06
private array/random/rand/Float32 963375 ns 897625 ns 1.07
private array/copyto!/gpu_to_gpu 489791 ns 660666 ns 0.74
private array/copyto!/cpu_to_gpu 744542 ns 555208 ns 1.34
private array/copyto!/gpu_to_cpu 566020.5 ns 709417 ns 0.80
private array/accumulate/1d 1431521 ns 1430125 ns 1.00
private array/accumulate/2d 1475146 ns 1499500 ns 0.98
private array/iteration/findall/int 2276479.5 ns 2210520.5 ns 1.03
private array/iteration/findall/bool 2036750 ns 2041209 ns 1.00
private array/iteration/findfirst/int 1685416.5 ns 1704833 ns 0.99
private array/iteration/findfirst/bool 1667834 ns 1645334 ns 1.01
private array/iteration/scalar 2383604 ns 2430625 ns 0.98
private array/iteration/logical 3429416.5 ns 3432895.5 ns 1.00
private array/iteration/findmin/1d 1792438 ns 1763667 ns 1.02
private array/iteration/findmin/2d 1377625 ns 1353479 ns 1.02
private array/reductions/reduce/1d 795479.5 ns 730853.5 ns 1.09
private array/reductions/reduce/2d 725292 ns 709708 ns 1.02
private array/reductions/mapreduce/1d 783999.5 ns 800041 ns 0.98
private array/reductions/mapreduce/2d 718375 ns 713125 ns 1.01
private array/permutedims/4d 951520.5 ns 949333 ns 1.00
private array/permutedims/2d 924896 ns 930958 ns 0.99
private array/permutedims/3d 999959 ns 1018708.5 ns 0.98
private array/copy 865333 ns 582583 ns 1.49
latency/precompile 4410091792 ns 4403995333 ns 1.00
latency/ttfp 6887169042 ns 6895957979 ns 1.00
latency/import 725338979.5 ns 723655188 ns 1.00
integration/metaldevrt 753042 ns 757604 ns 0.99
integration/byval/slices=1 1542104 ns 1623541 ns 0.95
integration/byval/slices=3 8907292 ns 8853854 ns 1.01
integration/byval/reference 1589042 ns 1573521 ns 1.01
integration/byval/slices=2 2730792 ns 2624459 ns 1.04
kernel/indexing 462750 ns 455583 ns 1.02
kernel/indexing_checked 435021 ns 461916 ns 0.94
kernel/launch 10958 ns 10875 ns 1.01
metal/synchronization/stream 19208 ns 19250 ns 1.00
metal/synchronization/context 19708 ns 19791 ns 1.00
shared array/construct 23972.25 ns 23972.166666666668 ns 1.00
shared array/broadcast 466584 ns 478708 ns 0.97
shared array/random/randn/Float32 1017250 ns 987500 ns 1.03
shared array/random/randn!/Float32 626250 ns 641062.5 ns 0.98
shared array/random/rand!/Int64 573875 ns 576520.5 ns 1.00
shared array/random/rand!/Float32 585292 ns 592333.5 ns 0.99
shared array/random/rand/Int64 849417 ns 870458 ns 0.98
shared array/random/rand/Float32 888729 ns 935229 ns 0.95
shared array/copyto!/gpu_to_gpu 537375 ns 546667 ns 0.98
shared array/copyto!/cpu_to_gpu 83875 ns 94125 ns 0.89
shared array/copyto!/gpu_to_cpu 85709 ns 84208 ns 1.02
shared array/accumulate/1d 1428958.5 ns 1434979 ns 1.00
shared array/accumulate/2d 1491271 ns 1497729 ns 1.00
shared array/iteration/findall/int 2012833 ns 1971125 ns 1.02
shared array/iteration/findall/bool 1778541 ns 1777500 ns 1.00
shared array/iteration/findfirst/int 1415437.5 ns 1410291 ns 1.00
shared array/iteration/findfirst/bool 1392125 ns 1388708 ns 1.00
shared array/iteration/scalar 189208 ns 189562.5 ns 1.00
shared array/iteration/logical 3203417 ns 3205291 ns 1.00
shared array/iteration/findmin/1d 1491375 ns 1479229 ns 1.01
shared array/iteration/findmin/2d 1389708 ns 1373083.5 ns 1.01
shared array/reductions/reduce/1d 678083 ns 616666 ns 1.10
shared array/reductions/reduce/2d 711229.5 ns 716854.5 ns 0.99
shared array/reductions/mapreduce/1d 686437 ns 686417 ns 1.00
shared array/reductions/mapreduce/2d 716667 ns 710584 ns 1.01
shared array/permutedims/4d 950125 ns 960250 ns 0.99
shared array/permutedims/2d 923583 ns 925458.5 ns 1.00
shared array/permutedims/3d 998875 ns 1015208.5 ns 0.98
shared array/copy 867500 ns 598354.5 ns 1.45

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit ff7c7eb into main Oct 2, 2024
2 checks passed
@maleadt maleadt deleted the unified-memory-linalg branch October 2, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants