Extend tile reduction operations with additional operators #21

arhik · 2026-01-11T13:58:35Z

This PR extends the tile reduction operations to support a broader range of reduction operators for both floating-point and integer types.

Changes

1. Core Intrinsics (src/compiler/intrinsics/core.jl)

Added reduce_mul, reduce_min operations (work with any element type)
Added reduce_and, reduce_or, reduce_xor bitwise operations (integer-only)
Removed AbstractFloat constraint from reduce_sum and reduce_max
Added corresponding emit_intrinsic! handlers for new operations
Added integer encode methods for and, or, xor operations

2. Operations (src/language/operations.jl)

Added axis(i::Integer) helper function for 1-based to 0-based axis conversion
Provides self-documenting API: ct.cumsum(tile, ct.axis(1)) instead of raw Val

3. Documentation (README.md)

Documented new reduce operations in the operations table
Added axis(i) helper to the reductions section

Supported Reductions

Operation	Types	Description
`reduce_sum`	Any	Sum along axis
`reduce_mul`	Any	Product along axis
`reduce_max`	Any	Maximum along axis
`reduce_min`	Any	Minimum along axis
`reduce_and`	Integer	Bitwise AND along axis
`reduce_or`	Integer	Bitwise OR along axis
`reduce_xor`	Integer	Bitwise XOR along axis

AntonOresten · 2026-01-11T14:20:13Z

What is the point of having an axis function? I don't think it needs to be corrected to 0-based indexing that early, and they don't need to be Val since they can be constant-propagated anyway.

arhik · 2026-01-11T14:26:39Z

@AntonOresten I agree.

I noted this in the commit related to it.

reduce ops update and axis convenience
axis convenience is a bit helper function for Val.

But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.

arhik · 2026-01-11T14:31:24Z

I am still working on it and scan operations. Axis convenience came up when working on scan ops. I just included it in this pull request. Either way it's going to be consensus decision as we go forward.

AntonOresten · 2026-01-11T15:11:54Z

Maybe the reduction methods can be exposed through Base.reduce, dispatching on the binary operator (+, max, * etc). The ct.reduce_* methods currently drop the reduced dimension, unlike Base.reduce. This might lead to a slippery slope, turning reshape, permute, transpose, etc. into methods of their Base counterparts, but then there's no nice counterpart for extract.

arhik · 2026-01-11T15:15:24Z

Ideally yes.

using Test
using CUDA
using cuTile
import cuTile as ct

function reduceKernel(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1}, tile_size::ct.Constant{Int})
    bid = ct.bid(1)
    tile = ct.load(a, bid, (tile_size[],))
    result = ct.reduce_min(tile, Val(1))
    ct.store(b, bid, result)
    return nothing
end

sz = 32
N = 2^15
a = CUDA.rand(Float32, N)
b = CUDA.zeros(Float32, cld(N, sz))
CUDA.@sync ct.launch(reduceKernel, cld(length(a), sz), a, b, ct.Constant(sz))

julia> b[1:5]
5-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.005346642
0.0034002797
0.029549925
0.14612293
0.0142624555

julia> minimum(a[1:32])
0.005346642f0

julia> minimum(a[33:64])
0.0034002797f0

arhik · 2026-01-11T15:16:13Z

Yay. Something. I don't have blackwell gpu. I rented vast.ai 5080 gpu. worth it i guess.

AntonOresten · 2026-01-11T20:11:05Z

Ah, I think there is a valid use of the axis helper, but it is not user-facing. I was rather confused by the emphasis and documentation of this one helper.

maleadt · 2026-01-13T14:49:31Z

I'm not a fan of the reduce_max etc functions; those were just placeholders. We should probably expose these using more idiomatic constructs, like max or cumsum etc. If possible by actually implementing something underlying like mapreduce! and not Base.sum directly, but I'm not sure how realistic that is (initially it could error out on unsupported configurations, but longer term it could be extended to actually allow some of the more interesting combinations like summing over a function, or only reducing some dimensions).

arhik · 2026-01-13T19:34:05Z

@maleadt You spoke my mind verbatim. I have a package that is very heavy on very complicated scan, reduce, top-k like computations with event triggering state machine transitions. And none of the above simple reduce_{ops} actually help me with my goal. I would love something like you suggested and that's my hope on this package. I am just pushing reduce_{ops} while exploring the package itself and understanding what it takes. I actually went in the path you suggested first, to check if its possible to generate code for custom julia function.

Also, the package I have, with CUDA.jl based custom kernels, is unique on its own and I don't have a baseline to compare with a reference implementation like other DL and AI package and pipelines enjoy. I am hoping this library will also help me prepare a baseline too. Otherwise I will have to go with c++ path and deal with permutation and combinations of configuration, kernel splitting, fusion etc, which is daunting than actual research work.

Appreciate the work you have put into this.

…both float and integer types - Add reduce_mul, reduce_min, reduce_and, reduce_or, reduce_xor intrinsic functions - Remove AbstractFloat constraint from reduce_sum and reduce_max - Add corresponding emit_intrinsic! handlers for new operations - Add integer encode_reduce_body methods for and, or, xor operations Summary of Additions | Function | Symbol | Types | |----------|--------|-------| | `reduce_sum` | `:add` | Any | | `reduce_max` | `:max` | Any | | `reduce_mul` | `:mul` | Any | | `reduce_min` | `:min` | Any | | `reduce_and` | `:and` | Integer only | | `reduce_or` | `:or` | Integer only | | `reduce_xor` | `:xor` | Integer only |

- Add axis(i::Integer) -> Val{i-1} convenience function - Use instead of raw Val for self-documenting axis selection

axis convenience is a bit helper function for `Val`. But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.

…patch - Add wrapper functions in operations.jl for reduce_mul, reduce_min, reduce_and, reduce_or, reduce_xor with appropriate type constraints - Refactor identity value selection to use dispatch instead of if-else chain - Correct identity values: - add: 0.0 - max: -Inf (float) or 0 (int) - mul: 1.0 - min: +Inf (float) or typemax(Int64) (int) - and: 0 (interpreted as -1 bits by backend) - or: 0 - xor: 0

- Add IntIdentity struct to bytecode/writer.jl for proper integer identity encoding - Add encode_tagged_int! function for encoding integer identity attributes (tag 0x01) - Dispatch encode_identity! on identity type for proper encoding - Update reduce_identity to return IntIdentity for integer operations - Import ReduceIdentity, FloatIdentity, IntIdentity in intrinsics.jl Identity values now properly typed: - Float operations → FloatIdentity - Integer operations → IntIdentity

The intrinsics were updated to support all types but the wrapper functions in operations.jl still had T <: AbstractFloat constraint, causing method lookup failures for integer types.

- reduce_sum, reduce_max, reduce_mul, reduce_min now use T <: Number - Provides type safety while supporting all numeric types - More self-documenting than unconstrained T

- IntegerIdentity now has signed::Bool field - encode_tagged_int! encodes signed with zigzag varint, unsigned with plain varint - Add is_signed() helper that checks T <: SignedInteger - Update all reduce_identity calls to pass is_signed(T)

- Abstract type now called OperationIdentity to reflect use by both reduce and scan operations - FloatIdentity and IntegerIdentity now inherit from OperationIdentity - Updated comments and docs to reflect the broader scope - Updated import in intrinsics.jl

- IdentityOp: abstract type for binary operation identities - FloatIdentityOp: concrete type for float identities - IntegerIdentityOp: concrete type for integer identities (with signed field) - Applied consistently across writer.jl, encodings.jl, intrinsics.jl, and core.jl

- T <: Integer && !(T <: Unsigned) correctly identifies: - Int32, Int64, etc. as signed (true) - UInt32, UInt64, etc. as unsigned (false)

Ensures type-consistent encoding for Int8, Int16, etc. intrinsics: use type-dependent identity values for reduce ops - add: zero(T) - max: typemin(T) - mul: one(T) - min: typemax(T) - and: is_signed ? -1 : typemax(T) for proper bit representation - or, xor: zero(T) Fixes encoding error for UInt32 (9223372036854775807 does not fit in 32 bits) Update core.jl fix reduce_min identity to use typemax(T) instead of typemax(Int64) - For UInt32, typemax(UInt32) = 4294967295 fits in 32 bits - typemax(Int64) = 9223372036854775807 does not fit and caused encoding error

test: add comprehensive reduce operations tests - Tests for all reduce ops: add, mul, min, max, and, or, xor - Tests for Float32, Float64, Int32, UInt32, Int8 types - Tests for axis 0 and axis 1 reductions - Compares GPU results against CPU reference implementations - Includes UInt32 and Int8 tests for identity encoding fix

used agent to create tests and hence the wrath.

Prepares for reuse by scan operations. Function is shape-agnostic and depends only on operation type and element type.

Julia's reduce with dims= requires explicit init for &,|,⊻ operators. Use typemax(T) for AND (identity with all bits set).

The original code tried to convert Int64 directly to UInt64, which fails for negative values like typemin(Int32) = -2147483648. Zigzag encoding maps: (n << 1) ⊻ (n >> 63), enabling proper encoding of negative integers in varint format.

Zigzag encoding: (n << 1) ⊻ (n >> 63) properly handles negative values like typemin(Int32) = -2147483648. Unsigned values use plain varint encoding since they don't need zigzag.

The correct implementation is in src/bytecode/basic.jl: function encode_signed_varint!(buf, x) x = x << 1 if x < 0 x = ~x end encode_varint!(buf, x) end The duplicate in writer.jl was shadowing the correct one.

For unsigned integer types like UInt16, UInt32, the comparison must use unsigned signedness, not the default SignednessSigned. This fixes wrong reduction results for unsigned types where signed comparison was causing values to be interpreted incorrectly (e.g., 0xFFFF interpreted as -1).

- writer.jl: Changed IntegerIdentityOp.value from Int64 to UInt128 to store full 64-bit unsigned values. Added mask_to_width() to mask values to correct bit width before zigzag encoding. - core.jl: Added to_uint128() helper to convert signed/unsigned values to UInt128 via bit reinterpretation for proper identity value storage. - examples/reducekernel.jl: Added comprehensive tests for all reduce operations (min, max, sum, xor, or, and) on UInt16/32/64, Int16/32/64, and Float16/32/64. Fixes: - UInt64 reduce_min and reduce_and now work correctly - Int16 reduce_max and reduce_and now work correctly - All small integer types (Int8, Int16, Int32) now encode properly with SLEB128

@testset

- Use kernel factory pattern matching reducekernel.jl style - Replace per-row loops with simpler broadcasting approach - Consolidate tests using @testset for loops over element types and operations - Streamline CPU reference computation

AntonOresten · 2026-01-17T21:06:52Z

If this is replaced by #37, could you close this one?

arhik · 2026-01-18T05:29:26Z

#37 is partial simpler PR. Closing this and will open multiple PR meeting this.

arhik marked this pull request as draft January 11, 2026 14:25

arhik force-pushed the tilereduce branch 2 times, most recently from 1a44b31 to 3820e74 Compare January 11, 2026 20:41

arhik marked this pull request as ready for review January 12, 2026 07:43

arhik marked this pull request as draft January 12, 2026 07:58

arhik marked this pull request as ready for review January 12, 2026 11:10

maleadt force-pushed the tilereduce branch from ff8b77b to dfffe98 Compare January 13, 2026 14:44

maleadt mentioned this pull request Jan 13, 2026

tile scan ops related changes #22

Closed

arhik added 13 commits January 16, 2026 16:56

operations: add axis(i) helper for 1-based to 0-based axis conversion

2b21f73

- Add axis(i::Integer) -> Val{i-1} convenience function - Use instead of raw Val for self-documenting axis selection

make axis public

3f9584f

make new reduce_{ops} public

43cd64b

reduce ops update and axis convenience

12ebf37

axis convenience is a bit helper function for `Val`. But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.

rename IntIdentity to IntegerIdentity for clarity

e689672

fix: remove AbstractFloat constraint from reduce_sum and reduce_max

aa67f22

The intrinsics were updated to support all types but the wrapper functions in operations.jl still had T <: AbstractFloat constraint, causing method lookup failures for integer types.

use Number constraint for numeric reduce operations

eb98ec1

- reduce_sum, reduce_max, reduce_mul, reduce_min now use T <: Number - Provides type safety while supporting all numeric types - More self-documenting than unconstrained T

remove unused is_reduce kwarg from encode_tagged_int

6652087

arhik added 16 commits January 16, 2026 17:00

fix is_signed to use proper Julia type hierarchy check

9fc8d0a

- T <: Integer && !(T <: Unsigned) correctly identifies: - Int32, Int64, etc. as signed (true) - UInt32, UInt64, etc. as unsigned (false)

multiline comment mess in reduce_ops.jl

959daa4

used agent to create tests and hence the wrath.

intrinsics: rename reduce_identity -> operation_identity

9b0418f

Prepares for reuse by scan operations. Function is shape-agnostic and depends only on operation type and element type.

test: fix CPU reference functions for bitwise ops

47a30bc

Julia's reduce with dims= requires explicit init for &,|,⊻ operators. Use typemax(T) for AND (identity with all bits set).

bytecode: fix zigzag encoding for signed varint

baa5e65

The original code tried to convert Int64 directly to UInt64, which fails for negative values like typemin(Int32) = -2147483648. Zigzag encoding maps: (n << 1) ⊻ (n >> 63), enabling proper encoding of negative integers in varint format.

reverting zigzag encoding

90fca21

bytecode: fix zigzag encoding for signed varint

2eb3171

Zigzag encoding: (n << 1) ⊻ (n >> 63) properly handles negative values like typemin(Int32) = -2147483648. Unsigned values use plain varint encoding since they don't need zigzag.

bytecode: remove duplicate encode_signed_varint!

5dea6f3

The correct implementation is in src/bytecode/basic.jl: function encode_signed_varint!(buf, x) x = x << 1 if x < 0 x = ~x end encode_varint!(buf, x) end The duplicate in writer.jl was shadowing the correct one.

reverting original encode_signed_varint

db685ae

revert comment inside encode_signed_varint

6a1b21d

Simplify reduce_ops.jl tests with broadcasting

eb34ca3

- Use kernel factory pattern matching reducekernel.jl style - Replace per-row loops with simpler broadcasting approach - Consolidate tests using @testset for loops over element types and operations - Streamline CPU reference computation

arhik force-pushed the tilereduce branch from fc6d2fd to eb34ca3 Compare January 16, 2026 20:24

arhik marked this pull request as draft January 16, 2026 20:25

arhik mentioned this pull request Jan 17, 2026

feat: Add integer reduction support for reduce ops #37

Open

arhik closed this Jan 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend tile reduction operations with additional operators #21

Extend tile reduction operations with additional operators #21

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

AntonOresten commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

AntonOresten commented Jan 11, 2026 •

edited

Loading

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

AntonOresten commented Jan 11, 2026 •

edited

Loading

Uh oh!

maleadt commented Jan 13, 2026

Uh oh!

arhik commented Jan 13, 2026

Uh oh!

AntonOresten commented Jan 17, 2026

Uh oh!

arhik commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Extend tile reduction operations with additional operators #21

Extend tile reduction operations with additional operators #21

Uh oh!

Conversation

arhik commented Jan 11, 2026

Changes

Supported Reductions

Uh oh!

AntonOresten commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

AntonOresten commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

arhik commented Jan 11, 2026

Uh oh!

AntonOresten commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 13, 2026

Uh oh!

arhik commented Jan 13, 2026

Uh oh!

AntonOresten commented Jan 17, 2026

Uh oh!

arhik commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntonOresten commented Jan 11, 2026 •

edited

Loading

AntonOresten commented Jan 11, 2026 •

edited

Loading