Flashinfer==0.2.0 precision error when tested on vLLM unit tests #736

Dr-Left · 2025-01-13T09:57:52Z

I tested flashiner==0.2.0 with vllm unit tests. When prefilling, the precision mismatch seems unacceptable (~0.04 absolute error), and all tests on prefilling have failed due to tensor mismatch assertion failed. (logs listed below)

Even I wrote the naive attention in fp32, the issue is still there and the difference between the two results remain unchanged.

Env:
torch==2.3.0+cu121
vllm==0.6.4
cuda version: 12.1

Test file adapted from the vllm codebase at test/kernels/test_flashinfer.py

Full test code at https://gist.github.com/Dr-Left/ec3336767ae860a964501d9f0fbb35c0

Test failed logs:

============================= test session starts ==============================
platform linux -- Python 3.9.21, pytest-8.3.4, pluggy-1.5.0 -- ~/.conda/envs/vllm/bin/python
cachedir: .pytest_cache
rootdir: ~/draft_dp/vllm
configfile: pyproject.toml
plugins: anyio-4.7.0
collecting ... collected 480 items / 336 deselected / 144 selected

vllm/tests/kernels/test_flashinfer.py::test_flashinfer_prefill_with_paged_kv[None-dtype0-16-128-num_heads0-seq_lens0] 2025-01-13 08:20:38,634 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-01-13 08:21:26,992 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-01-13 08:21:27,021 - INFO - flashinfer.jit: Loading JIT ops: page
2025-01-13 08:21:38,031 - INFO - flashinfer.jit: Finished loading JIT ops: page
FAILED [  0%]

        qo_indptr = [0]
        kv_indptr = [0]
        kv_indices = []
        kv_last_page_lens = []
        for i in range(num_seqs):
            seq_len = kv_lens[i]
            assert seq_len > 0
            num_blocks = (seq_len + block_size - 1) // block_size
            kv_indices.extend(block_tables[i, :num_blocks])
            kv_indptr.append(kv_indptr[-1] + num_blocks)
            kv_last_page_len = seq_len % block_size
            if kv_last_page_len == 0:
                kv_last_page_len = block_size
            kv_last_page_lens.append(kv_last_page_len)
            qo_indptr.append(qo_indptr[-1] + query_lens[i])

        qo_indptr = torch.tensor(qo_indptr, dtype=torch.int32)
        kv_indptr = torch.tensor(kv_indptr, dtype=torch.int32)
        kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
        kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)

        workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8)
        torch.set_default_device("cpu")
        wrapper = flashinfer.BatchPrefillWithPagedKVCacheWrapper(
            workspace_buffer, "NHD", backend="fa3")
        wrapper.begin_forward(
            qo_indptr,
            kv_indptr,
            kv_indices,
            kv_last_page_lens,
            num_query_heads,
            num_kv_heads,
            head_size,
            block_size,
            q_data_type=dtype,
            kv_data_type=dtype,
        )
        torch
                                    scale=scale,
                                    soft_cap=soft_cap)
>       torch.testing.assert_close(output, ref_output, atol=1e-2, rtol=1e-2), \
            f"{torch.max(torch.abs(output - ref_output))}"
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 1776 / 276480 (0.6%)
E       Greatest absolute difference: 0.0408935546875 at index (1, 1, 23) (up to 0.01 allowed)
E       Greatest relative difference: 453.25 at index (1, 0, 118) (up to 0.01 allowed)

vllm/tests/kernels/test_flashinfer.py:269: AssertionError
----------------------------- Captured stderr call -----------------------------
2025-01-13 08:20:38,634 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-01-13 08:21:26,992 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-01-13 08:21:27,021 - INFO - flashinfer.jit: Loading JIT ops: page
2025-01-13 08:21:38,031 - INFO - flashinfer.jit: Finished loading JIT ops: page
=============================== warnings summary ===============================
vllm/vllm/connections.py:8
  ~/draft_dp/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
  No module named 'vllm._version'
    from vllm.version import __version__ as VLLM_VERSION

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED vllm/tests/kernels/test_flashinfer.py::test_flashinfer_prefill_with_paged_kv[None-dtype0-16-128-num_heads0-seq_lens0] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1776 / 276480 (0.6%)
Greatest absolute difference: 0.0408935546875 at index (1, 1, 23) (up to 0.01 allowed)
Greatest relative difference: 453.25 at index (1, 0, 118) (up to 0.01 allowed)
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
=========== 1 failed, 336 deselected, 1 warning in 61.75s (0:01:01) ============

The text was updated successfully, but these errors were encountered:

yzh119 self-assigned this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flashinfer==0.2.0 precision error when tested on vLLM unit tests #736

Flashinfer==0.2.0 precision error when tested on vLLM unit tests #736

Dr-Left commented Jan 13, 2025

Flashinfer==0.2.0 precision error when tested on vLLM unit tests #736

Flashinfer==0.2.0 precision error when tested on vLLM unit tests #736

Comments

Dr-Left commented Jan 13, 2025