Skip to content

Conversation

@arhik
Copy link
Contributor

@arhik arhik commented Jan 11, 2026

This PR extends the tile reduction operations to support a broader range of reduction operators for both floating-point and integer types.

Changes

1. Core Intrinsics (src/compiler/intrinsics/core.jl)

  • Added reduce_mul, reduce_min operations (work with any element type)
  • Added reduce_and, reduce_or, reduce_xor bitwise operations (integer-only)
  • Removed AbstractFloat constraint from reduce_sum and reduce_max
  • Added corresponding emit_intrinsic! handlers for new operations
  • Added integer encode methods for and, or, xor operations

2. Operations (src/language/operations.jl)

  • Added axis(i::Integer) helper function for 1-based to 0-based axis conversion
  • Provides self-documenting API: ct.cumsum(tile, ct.axis(1)) instead of raw Val

3. Documentation (README.md)

  • Documented new reduce operations in the operations table
  • Added axis(i) helper to the reductions section

Supported Reductions

Operation Types Description
reduce_sum Any Sum along axis
reduce_mul Any Product along axis
reduce_max Any Maximum along axis
reduce_min Any Minimum along axis
reduce_and Integer Bitwise AND along axis
reduce_or Integer Bitwise OR along axis
reduce_xor Integer Bitwise XOR along axis

@AntonOresten
Copy link
Contributor

What is the point of having an axis function? I don't think it needs to be corrected to 0-based indexing that early, and they don't need to be Val since they can be constant-propagated anyway.

@arhik arhik marked this pull request as draft January 11, 2026 14:25
@arhik
Copy link
Contributor Author

arhik commented Jan 11, 2026

@AntonOresten I agree.

I noted this in the commit related to it.

reduce ops update and axis convenience
axis convenience is a bit helper function for Val.

But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.

@arhik
Copy link
Contributor Author

arhik commented Jan 11, 2026

I am still working on it and scan operations. Axis convenience came up when working on scan ops. I just included it in this pull request. Either way it's going to be consensus decision as we go forward.

@AntonOresten
Copy link
Contributor

AntonOresten commented Jan 11, 2026

Maybe the reduction methods can be exposed through Base.reduce, dispatching on the binary operator (+, max, * etc). The ct.reduce_* methods currently drop the reduced dimension, unlike Base.reduce. This might lead to a slippery slope, turning reshape, permute, transpose, etc. into methods of their Base counterparts, but then there's no nice counterpart for extract.

@arhik
Copy link
Contributor Author

arhik commented Jan 11, 2026

Ideally yes.

using Test
using CUDA
using cuTile
import cuTile as ct

function reduceKernel(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1}, tile_size::ct.Constant{Int})
    bid = ct.bid(1)
    tile = ct.load(a, bid, (tile_size[],))
    result = ct.reduce_min(tile, Val(1))
    ct.store(b, bid, result)
    return nothing
end

sz = 32
N = 2^15
a = CUDA.rand(Float32, N)
b = CUDA.zeros(Float32, cld(N, sz))
CUDA.@sync ct.launch(reduceKernel, cld(length(a), sz), a, b, ct.Constant(sz))

julia> b[1:5]
5-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.005346642
0.0034002797
0.029549925
0.14612293
0.0142624555

julia> minimum(a[1:32])
0.005346642f0

julia> minimum(a[33:64])
0.0034002797f0

@arhik
Copy link
Contributor Author

arhik commented Jan 11, 2026

Yay. Something. I don't have blackwell gpu. I rented vast.ai 5080 gpu. worth it i guess.

@AntonOresten
Copy link
Contributor

AntonOresten commented Jan 11, 2026

Ah, I think there is a valid use of the axis helper, but it is not user-facing. I was rather confused by the emphasis and documentation of this one helper.

@arhik arhik force-pushed the tilereduce branch 2 times, most recently from 1a44b31 to 3820e74 Compare January 11, 2026 20:41
@arhik arhik marked this pull request as ready for review January 12, 2026 07:43
@arhik arhik marked this pull request as draft January 12, 2026 07:58
@arhik arhik marked this pull request as ready for review January 12, 2026 11:10
@maleadt
Copy link
Member

maleadt commented Jan 13, 2026

I'm not a fan of the reduce_max etc functions; those were just placeholders. We should probably expose these using more idiomatic constructs, like max or cumsum etc. If possible by actually implementing something underlying like mapreduce! and not Base.sum directly, but I'm not sure how realistic that is (initially it could error out on unsupported configurations, but longer term it could be extended to actually allow some of the more interesting combinations like summing over a function, or only reducing some dimensions).

@arhik
Copy link
Contributor Author

arhik commented Jan 13, 2026

@maleadt You spoke my mind verbatim. I have a package that is very heavy on very complicated scan, reduce, top-k like computations with event triggering state machine transitions. And none of the above simple reduce_{ops} actually help me with my goal. I would love something like you suggested and that's my hope on this package. I am just pushing reduce_{ops} while exploring the package itself and understanding what it takes. I actually went in the path you suggested first, to check if its possible to generate code for custom julia function.

Also, the package I have, with CUDA.jl based custom kernels, is unique on its own and I don't have a baseline to compare with a reference implementation like other DL and AI package and pipelines enjoy. I am hoping this library will also help me prepare a baseline too. Otherwise I will have to go with c++ path and deal with permutation and combinations of configuration, kernel splitting, fusion etc, which is daunting than actual research work.

Appreciate the work you have put into this.

arhik added 13 commits January 16, 2026 16:56
…both float and integer types

- Add reduce_mul, reduce_min, reduce_and, reduce_or, reduce_xor intrinsic functions
- Remove AbstractFloat constraint from reduce_sum and reduce_max
- Add corresponding emit_intrinsic! handlers for new operations
- Add integer encode_reduce_body methods for and, or, xor operations

Summary of Additions

| Function | Symbol | Types |
|----------|--------|-------|
| `reduce_sum` | `:add` | Any |
| `reduce_max` | `:max` | Any |
| `reduce_mul` | `:mul` | Any |
| `reduce_min` | `:min` | Any |
| `reduce_and` | `:and` | Integer only |
| `reduce_or` | `:or` | Integer only |
| `reduce_xor` | `:xor` | Integer only |
- Add axis(i::Integer) -> Val{i-1} convenience function
- Use instead of raw Val for self-documenting axis selection
axis convenience is a bit helper function for `Val`.

But I see reduce is already one-based. Not sure if we should go with it. It doesn't harm anything. its just a convenience.
…patch

- Add wrapper functions in operations.jl for reduce_mul, reduce_min, reduce_and,
  reduce_or, reduce_xor with appropriate type constraints
- Refactor identity value selection to use dispatch instead of if-else chain
- Correct identity values:
  - add: 0.0
  - max: -Inf (float) or 0 (int)
  - mul: 1.0
  - min: +Inf (float) or typemax(Int64) (int)
  - and: 0 (interpreted as -1 bits by backend)
  - or: 0
  - xor: 0
- Add IntIdentity struct to bytecode/writer.jl for proper integer identity encoding
- Add encode_tagged_int! function for encoding integer identity attributes (tag 0x01)
- Dispatch encode_identity! on identity type for proper encoding
- Update reduce_identity to return IntIdentity for integer operations
- Import ReduceIdentity, FloatIdentity, IntIdentity in intrinsics.jl

Identity values now properly typed:
- Float operations → FloatIdentity
- Integer operations → IntIdentity
The intrinsics were updated to support all types but the wrapper functions
in operations.jl still had T <: AbstractFloat constraint, causing method
lookup failures for integer types.
- reduce_sum, reduce_max, reduce_mul, reduce_min now use T <: Number
- Provides type safety while supporting all numeric types
- More self-documenting than unconstrained T
- IntegerIdentity now has signed::Bool field
- encode_tagged_int! encodes signed with zigzag varint, unsigned with plain varint
- Add is_signed() helper that checks T <: SignedInteger
- Update all reduce_identity calls to pass is_signed(T)
- Abstract type now called OperationIdentity to reflect use by both reduce and scan operations
- FloatIdentity and IntegerIdentity now inherit from OperationIdentity
- Updated comments and docs to reflect the broader scope
- Updated import in intrinsics.jl
arhik added 16 commits January 16, 2026 17:00
- IdentityOp: abstract type for binary operation identities
- FloatIdentityOp: concrete type for float identities
- IntegerIdentityOp: concrete type for integer identities (with signed field)
- Applied consistently across writer.jl, encodings.jl, intrinsics.jl, and core.jl
- T <: Integer && !(T <: Unsigned) correctly identifies:
  - Int32, Int64, etc. as signed (true)
  - UInt32, UInt64, etc. as unsigned (false)
Ensures type-consistent encoding for Int8, Int16, etc.

intrinsics: use type-dependent identity values for reduce ops

- add: zero(T)
- max: typemin(T)
- mul: one(T)
- min: typemax(T)
- and: is_signed ? -1 : typemax(T) for proper bit representation
- or, xor: zero(T)

Fixes encoding error for UInt32 (9223372036854775807 does not fit in 32 bits)

Update core.jl

fix reduce_min identity to use typemax(T) instead of typemax(Int64)

- For UInt32, typemax(UInt32) = 4294967295 fits in 32 bits
- typemax(Int64) = 9223372036854775807 does not fit and caused encoding error
test: add comprehensive reduce operations tests

- Tests for all reduce ops: add, mul, min, max, and, or, xor
- Tests for Float32, Float64, Int32, UInt32, Int8 types
- Tests for axis 0 and axis 1 reductions
- Compares GPU results against CPU reference implementations
- Includes UInt32 and Int8 tests for identity encoding fix
used agent to create tests and hence the wrath.
Prepares for reuse by scan operations. Function is shape-agnostic
and depends only on operation type and element type.
Julia's reduce with dims= requires explicit init for &,|,⊻ operators.
Use typemax(T) for AND (identity with all bits set).
The original code tried to convert Int64 directly to UInt64, which fails
for negative values like typemin(Int32) = -2147483648.

Zigzag encoding maps: (n << 1) ⊻ (n >> 63), enabling proper encoding of
negative integers in varint format.
Zigzag encoding: (n << 1) ⊻ (n >> 63) properly handles negative values
like typemin(Int32) = -2147483648.

Unsigned values use plain varint encoding since they don't need zigzag.
The correct implementation is in src/bytecode/basic.jl:
  function encode_signed_varint!(buf, x)
      x = x << 1
      if x < 0
          x = ~x
      end
      encode_varint!(buf, x)
  end

The duplicate in writer.jl was shadowing the correct one.
For unsigned integer types like UInt16, UInt32, the comparison must use
unsigned signedness, not the default SignednessSigned.

This fixes wrong reduction results for unsigned types where signed comparison
was causing values to be interpreted incorrectly (e.g., 0xFFFF interpreted as -1).
- writer.jl: Changed IntegerIdentityOp.value from Int64 to UInt128 to store full
  64-bit unsigned values. Added mask_to_width() to mask values to correct bit
  width before zigzag encoding.

- core.jl: Added to_uint128() helper to convert signed/unsigned values to
  UInt128 via bit reinterpretation for proper identity value storage.

- examples/reducekernel.jl: Added comprehensive tests for all reduce operations
  (min, max, sum, xor, or, and) on UInt16/32/64, Int16/32/64, and Float16/32/64.

Fixes:
- UInt64 reduce_min and reduce_and now work correctly
- Int16 reduce_max and reduce_and now work correctly
- All small integer types (Int8, Int16, Int32) now encode properly with SLEB128
- Use kernel factory pattern matching reducekernel.jl style
- Replace per-row loops with simpler broadcasting approach
- Consolidate tests using @testset for loops over element types and operations
- Streamline CPU reference computation
@AntonOresten
Copy link
Contributor

If this is replaced by #37, could you close this one?

@arhik
Copy link
Contributor Author

arhik commented Jan 18, 2026

#37 is partial simpler PR. Closing this and will open multiple PR meeting this.

@arhik arhik closed this Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants