Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make memorynew intrinsic #55913

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

oscardssmith
Copy link
Member

@oscardssmith oscardssmith commented Sep 28, 2024

This speeds up making new Memorys and allow the compiler to better understand what's going on, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

Benchmarks:

julia> f(T) = Memory{T}(undef, 2);

julia> @btime f(Int);
  7.879 ns (1 allocation: 48 bytes) #before
  3.959 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Union{Int, Nothing})
  8.426 ns (1 allocation: 48 bytes) #before
  4.017 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Any);
  8.637 ns (1 allocation: 48 bytes) #before
  3.920 ns (1 allocation: 48 bytes) #after 

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

src/cgutils.cpp Outdated Show resolved Hide resolved
@nsajko nsajko added the arrays [a, r, r, a, y, s] label Sep 28, 2024
@oscardssmith oscardssmith added the performance Must go faster label Sep 29, 2024
@oscardssmith
Copy link
Member Author

@gbaraldi so with LLVM assertions enabled I'm getting

julia: /home/oscardssmith/Documents/Code/julia/deps/srccache/llvm-julia-18.1.7-2/llvm/lib/IR/Instructions.cpp:685: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

[398134] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7108b6c2871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm8CallInst4initEPNS_12FunctionTypeEPNS_5ValueENS_8ArrayRefIS4_EENS5_INS_17OperandBundleDefTIS4_EEEERKNS_5TwineE at /home/oscardssmith/Documents/Code/julia/usr/bin/../lib/libLLVM.so.18.1jl (unknown line)
CallInst at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1692 [inlined]
Create at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1540 [inlined]
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2398
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2420 [inlined]
emit_memorynew at /home/oscardssmith/Documents/Code/julia/src/cgutils.cpp:4555 [inlined]

which is on the line that does ctx.builder.CreateCall(prepare_call(jl_alloc_genericmemory_unchecked_func), {ptls, nbytes, cg_typ});. Any ideas?

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

I'd print everyone involved here with the way I showed you yesterday thing->print(dbgs()), So print the func and the arguments

@oscardssmith
Copy link
Member Author

This now works! For simple examples like Memory{Int}(undef, 3) it's about 7ns compared to 11 previously. Also this version is almost able to be analyzed by our escape analysis to delete the Memory when it doesn't (we just need to teach the llvm-alloc-helpers about gc_loaded, and for non @inbounds accesses we need to teach LLVM that the boundschecks are dead).

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

As an example of what is possible.

Allocopt was able to go from

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  %memoryref_mem = call dereferenceable(40) ptr addrspace(10) @julia.gc_alloc_obj(ptr nonnull %current_task1, i64 40, ptr addrspace(10) addrspacecast (ptr @"+Core.GenericMemory#771.jit" to ptr addrspace(10))), !dbg !14
  %0 = addrspacecast ptr addrspace(10) %memoryref_mem to ptr addrspace(11), !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr addrspace(11) %0, i64 0, i32 1, !dbg !14
  %2 = call nonnull ptr @julia.pointer_from_objref(ptr addrspace(11) %0) #4, !dbg !14
  %3 = getelementptr inbounds i8, ptr %2, i64 16, !dbg !14
  store ptr %3, ptr addrspace(11) %1, align 8, !dbg !14
  store i64 3, ptr addrspace(11) %0, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) %memoryref_mem, ptr %3), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

to. Removing the allocation. Which likely would allow it to just return the 11

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %memoryref_mem = alloca [40 x i8], align 16
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  call void @llvm.lifetime.start.p0(i64 40, ptr %memoryref_mem)
  %0 = freeze [40 x i8] undef, !dbg !14
  store [40 x i8] %0, ptr %memoryref_mem, align 1, !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr %memoryref_mem, i64 0, i32 1, !dbg !14
  %2 = getelementptr inbounds i8, ptr %memoryref_mem, i64 16, !dbg !14
  store ptr %2, ptr %1, align 8, !dbg !14
  store i64 3, ptr %memoryref_mem, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) null, ptr %2), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

@oscardssmith
Copy link
Member Author

julia> @btime Memory{Int8}(undef, 3)
  7.169 ns (1 allocation: 32 bytes) # before
  3.986 ns (1 allocation: 32 bytes) # after

@giordano
Copy link
Contributor

giordano commented Oct 4, 2024

FixedSizeArrays.jl is going to be much nicer 🙂

src/cgutils.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 7, 2024
combined with #55913, the compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
src/ccall.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 10, 2024
combined with #55913, the
compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
@giordano
Copy link
Contributor

Can you please add an llvm pass test for #56030 (comment) (removing all memory for a simple case where the Memory object doesn't escape)?

@oscardssmith
Copy link
Member Author

Do you want an actual LLVM pass, or can I just write a test for 0 allocations?

@giordano
Copy link
Contributor

I think an llvm test would be more robust, but probably a simple zero-allocation test would do the job as well.

@oscardssmith
Copy link
Member Author

LOL. This test is so good it broke a doctest in performance tips. We're testing to show that you get allocations if you have "bad" code that allocates arrays, but now it doesn't allocate :laughing

@oscardssmith
Copy link
Member Author

This is now on top of #55995 (to figure out why we weren't optimizing correctly), but other than that, I think this is good to go!

@giordano
Copy link
Contributor

Maybe a test of no allocations in simple cases as discussed above? 🙂

base/compiler/tfuncs.jl Outdated Show resolved Hide resolved
@@ -905,6 +906,12 @@ macro goto(name::Symbol)
end

# linear indexing
function getindex(A::GenericMemory, i::Int)
@_noub_if_noinbounds_meta
@boundscheck ult_int(bitcast(UInt, sub_int(i, 1)), bitcast(UInt, A.length)) || throw_boundserror(A, (i,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make this the bootstrap method too? The boundscheck macro is just expanded simply to @_boundscheck &&, so this seems compatible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants