-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Thanks for your nice work! I am planning to valid the efficience of the chunk useage, while i find it is comment on the code in Rzlinear. Meanwhile, i also notice the in the loop
` for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
# Note that for simplicity, we don't apply a mask `here.`
# This means that if K is not a multiple of BLOCK_SIZE_K,
# this will access out-of-bounds memory and produce an
# error or (worse!) incorrect results.
# TODO(Keren): Add K checks
#offs_k += BLOCK_SIZE_K TODO(aditya) this throws error map::at (do not know why)
offs_k = k * BLOCK_SIZE_K + tl.arange(0, BLOCK_SIZE_K)
a_mask = (offs_cm[:, None] < M) & (offs_k[None,:] < K)
b_mask = (offs_k[:, None] < K) & (offs_cn[None,:] < N)
a = tl.load(a_ptrs, mask=a_mask, other=a_zero)
b = tl.load(b_ptrs, mask=b_mask, other=b_zero)
# We accumulate along the K dimension
c += tl.dot(a, b, allow_tf32=allow_tf32)
# Advance the ptrs to the next K block
a_ptrs += BLOCK_SIZE_K * stride_ak
#TODO(aditya) temp int64 fix
#b_ptrs = b_offset + ((k + 1) * R3 + pid_n * R2 +
# R1) % R0 % (H - BLOCK_SIZE_K * BLOCK_SIZE_N)
b_ptrs = b_offset + ((((k+1) * R3 + pid_n * R2 + R1)%R0) * R0 + (((k+1) * R7 + pid_n * R5 + R4)%R0)) % (H - BLOCK_SIZE_K * BLOCK_SIZE_N)
`
in RzLinearForward.py. My understanding is the current pid_n will load values from K*BLOCK_SIZE_K adresses, which compose the chunk. Is my understanding right?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels