-
Notifications
You must be signed in to change notification settings - Fork 2
Extend tile reduction operations with additional operators #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What is the point of having an |
|
@AntonOresten I agree. I noted this in the commit related to it.
|
|
I am still working on it and scan operations. Axis convenience came up when working on scan ops. I just included it in this pull request. Either way it's going to be consensus decision as we go forward. |
|
Maybe the reduction methods can be exposed through |
|
Ideally yes. using Test
using CUDA
using cuTile
import cuTile as ct
function reduceKernel(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1}, tile_size::ct.Constant{Int})
bid = ct.bid(1)
tile = ct.load(a, bid, (tile_size[],))
result = ct.reduce_min(tile, Val(1))
ct.store(b, bid, result)
return nothing
end
sz = 32
N = 2^15
a = CUDA.rand(Float32, N)
b = CUDA.zeros(Float32, cld(N, sz))
CUDA.@sync ct.launch(reduceKernel, cld(length(a), sz), a, b, ct.Constant(sz))
|
|
Yay. Something. I don't have blackwell gpu. I rented vast.ai 5080 gpu. worth it i guess. |
|
Ah, I think there is a valid use of the |
1a44b31 to
3820e74
Compare
|
I'm not a fan of the |
|
@maleadt You spoke my mind verbatim. I have a package that is very heavy on very complicated scan, reduce, top-k like computations with event triggering state machine transitions. And none of the above simple reduce_{ops} actually help me with my goal. I would love something like you suggested and that's my hope on this package. I am just pushing reduce_{ops} while exploring the package itself and understanding what it takes. I actually went in the path you suggested first, to check if its possible to generate code for custom julia function. Also, the package I have, with CUDA.jl based custom kernels, is unique on its own and I don't have a baseline to compare with a reference implementation like other DL and AI package and pipelines enjoy. I am hoping this library will also help me prepare a baseline too. Otherwise I will have to go with c++ path and deal with permutation and combinations of configuration, kernel splitting, fusion etc, which is daunting than actual research work. Appreciate the work you have put into this. |
…both float and integer types - Add reduce_mul, reduce_min, reduce_and, reduce_or, reduce_xor intrinsic functions - Remove AbstractFloat constraint from reduce_sum and reduce_max - Add corresponding emit_intrinsic! handlers for new operations - Add integer encode_reduce_body methods for and, or, xor operations Summary of Additions | Function | Symbol | Types | |----------|--------|-------| | `reduce_sum` | `:add` | Any | | `reduce_max` | `:max` | Any | | `reduce_mul` | `:mul` | Any | | `reduce_min` | `:min` | Any | | `reduce_and` | `:and` | Integer only | | `reduce_or` | `:or` | Integer only | | `reduce_xor` | `:xor` | Integer only |
- Add axis(i::Integer) -> Val{i-1} convenience function
- Use instead of raw Val for self-documenting axis selection
axis convenience is a bit helper function for `Val`. But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.
…patch - Add wrapper functions in operations.jl for reduce_mul, reduce_min, reduce_and, reduce_or, reduce_xor with appropriate type constraints - Refactor identity value selection to use dispatch instead of if-else chain - Correct identity values: - add: 0.0 - max: -Inf (float) or 0 (int) - mul: 1.0 - min: +Inf (float) or typemax(Int64) (int) - and: 0 (interpreted as -1 bits by backend) - or: 0 - xor: 0
- Add IntIdentity struct to bytecode/writer.jl for proper integer identity encoding - Add encode_tagged_int! function for encoding integer identity attributes (tag 0x01) - Dispatch encode_identity! on identity type for proper encoding - Update reduce_identity to return IntIdentity for integer operations - Import ReduceIdentity, FloatIdentity, IntIdentity in intrinsics.jl Identity values now properly typed: - Float operations → FloatIdentity - Integer operations → IntIdentity
The intrinsics were updated to support all types but the wrapper functions in operations.jl still had T <: AbstractFloat constraint, causing method lookup failures for integer types.
- reduce_sum, reduce_max, reduce_mul, reduce_min now use T <: Number - Provides type safety while supporting all numeric types - More self-documenting than unconstrained T
- IntegerIdentity now has signed::Bool field - encode_tagged_int! encodes signed with zigzag varint, unsigned with plain varint - Add is_signed() helper that checks T <: SignedInteger - Update all reduce_identity calls to pass is_signed(T)
- Abstract type now called OperationIdentity to reflect use by both reduce and scan operations - FloatIdentity and IntegerIdentity now inherit from OperationIdentity - Updated comments and docs to reflect the broader scope - Updated import in intrinsics.jl
- IdentityOp: abstract type for binary operation identities - FloatIdentityOp: concrete type for float identities - IntegerIdentityOp: concrete type for integer identities (with signed field) - Applied consistently across writer.jl, encodings.jl, intrinsics.jl, and core.jl
- T <: Integer && !(T <: Unsigned) correctly identifies: - Int32, Int64, etc. as signed (true) - UInt32, UInt64, etc. as unsigned (false)
Ensures type-consistent encoding for Int8, Int16, etc. intrinsics: use type-dependent identity values for reduce ops - add: zero(T) - max: typemin(T) - mul: one(T) - min: typemax(T) - and: is_signed ? -1 : typemax(T) for proper bit representation - or, xor: zero(T) Fixes encoding error for UInt32 (9223372036854775807 does not fit in 32 bits) Update core.jl fix reduce_min identity to use typemax(T) instead of typemax(Int64) - For UInt32, typemax(UInt32) = 4294967295 fits in 32 bits - typemax(Int64) = 9223372036854775807 does not fit and caused encoding error
test: add comprehensive reduce operations tests - Tests for all reduce ops: add, mul, min, max, and, or, xor - Tests for Float32, Float64, Int32, UInt32, Int8 types - Tests for axis 0 and axis 1 reductions - Compares GPU results against CPU reference implementations - Includes UInt32 and Int8 tests for identity encoding fix
used agent to create tests and hence the wrath.
Prepares for reuse by scan operations. Function is shape-agnostic and depends only on operation type and element type.
Julia's reduce with dims= requires explicit init for &,|,⊻ operators. Use typemax(T) for AND (identity with all bits set).
The original code tried to convert Int64 directly to UInt64, which fails for negative values like typemin(Int32) = -2147483648. Zigzag encoding maps: (n << 1) ⊻ (n >> 63), enabling proper encoding of negative integers in varint format.
Zigzag encoding: (n << 1) ⊻ (n >> 63) properly handles negative values like typemin(Int32) = -2147483648. Unsigned values use plain varint encoding since they don't need zigzag.
The correct implementation is in src/bytecode/basic.jl:
function encode_signed_varint!(buf, x)
x = x << 1
if x < 0
x = ~x
end
encode_varint!(buf, x)
end
The duplicate in writer.jl was shadowing the correct one.
For unsigned integer types like UInt16, UInt32, the comparison must use unsigned signedness, not the default SignednessSigned. This fixes wrong reduction results for unsigned types where signed comparison was causing values to be interpreted incorrectly (e.g., 0xFFFF interpreted as -1).
- writer.jl: Changed IntegerIdentityOp.value from Int64 to UInt128 to store full 64-bit unsigned values. Added mask_to_width() to mask values to correct bit width before zigzag encoding. - core.jl: Added to_uint128() helper to convert signed/unsigned values to UInt128 via bit reinterpretation for proper identity value storage. - examples/reducekernel.jl: Added comprehensive tests for all reduce operations (min, max, sum, xor, or, and) on UInt16/32/64, Int16/32/64, and Float16/32/64. Fixes: - UInt64 reduce_min and reduce_and now work correctly - Int16 reduce_max and reduce_and now work correctly - All small integer types (Int8, Int16, Int32) now encode properly with SLEB128
- Use kernel factory pattern matching reducekernel.jl style - Replace per-row loops with simpler broadcasting approach - Consolidate tests using @testset for loops over element types and operations - Streamline CPU reference computation
|
If this is replaced by #37, could you close this one? |
|
#37 is partial simpler PR. Closing this and will open multiple PR meeting this. |
This PR extends the tile reduction operations to support a broader range of reduction operators for both floating-point and integer types.
Changes
1. Core Intrinsics (
src/compiler/intrinsics/core.jl)reduce_mul,reduce_minoperations (work with any element type)reduce_and,reduce_or,reduce_xorbitwise operations (integer-only)AbstractFloatconstraint fromreduce_sumandreduce_maxemit_intrinsic!handlers for new operationsand,or,xoroperations2. Operations (
src/language/operations.jl)axis(i::Integer)helper function for 1-based to 0-based axis conversionct.cumsum(tile, ct.axis(1))instead of rawVal3. Documentation (
README.md)axis(i)helper to the reductions sectionSupported Reductions
reduce_sumreduce_mulreduce_maxreduce_minreduce_andreduce_orreduce_xor