Skip to content

Is the use of chunk implemented in FakeRoast here or RzLinear repository? #3

@huang-hz

Description

@huang-hz

Thanks for your nice work! I am planning to valid the efficience of the chunk useage, while i find it is comment on the code in Rzlinear. Meanwhile, i also notice the in the loop

`    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
        # Note that for simplicity, we don't apply a mask `here.`
        # This means that if K is not a multiple of BLOCK_SIZE_K,
        # this will access out-of-bounds memory and produce an
        # error or (worse!) incorrect results.
        # TODO(Keren): Add K checks
        #offs_k += BLOCK_SIZE_K TODO(aditya) this throws error map::at (do not know why)
        offs_k = k * BLOCK_SIZE_K + tl.arange(0, BLOCK_SIZE_K)
        a_mask = (offs_cm[:, None] < M) & (offs_k[None,:] < K)
        b_mask = (offs_k[:, None] < K) & (offs_cn[None,:] < N)
        a = tl.load(a_ptrs, mask=a_mask, other=a_zero)
        b = tl.load(b_ptrs, mask=b_mask, other=b_zero)
        # We accumulate along the K dimension
        c += tl.dot(a, b, allow_tf32=allow_tf32)
        # Advance the ptrs to the next K block
        a_ptrs += BLOCK_SIZE_K * stride_ak
        #TODO(aditya) temp int64 fix
        #b_ptrs = b_offset + ((k + 1) * R3 + pid_n * R2 +
        #                     R1) % R0 % (H - BLOCK_SIZE_K * BLOCK_SIZE_N)
        b_ptrs = b_offset + ((((k+1) * R3 + pid_n * R2 + R1)%R0) * R0 + (((k+1) * R7 + pid_n * R5 + R4)%R0)) % (H - BLOCK_SIZE_K * BLOCK_SIZE_N)
`

in RzLinearForward.py. My understanding is the current pid_n will load values from K*BLOCK_SIZE_K adresses, which compose the chunk. Is my understanding right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions