Skip to content

H100 receipts pack: M≪N work-shrink exact retrieval shows 308× lower J/query at N=20M (public-safe) #4672

@StanByriukov02

Description

@StanByriukov02

Hi TensorRT team — I’m posting this as a routing request to the right CUDA/performance or inference-performance integration owner.

I have a public-safe H100 results pack (I will attach a zip to this issue). It contains:

  • raw benchmark JSON outputs + measured energy receipts (NVML / nvidia-smi sampling)
  • explicit PASS/FAIL gates
  • a compact one-pager summary (summary_public.json) + schema + a tiny validator script
  • the Python harness used to produce the key measurement (measurement side is not a black box)

Headline result (from a single H100, short steady window; exact top‑1 check):
N=20,000,000 candidates, query_len=256

  • full_scan_top1: p95 ≈ 37.523 ms, energy/query ≈ 4.46297 J
  • range_scan_top1 (M≪N work‑shrink/routing): p95 ≈ 0.11414 ms, energy/query ≈ 0.0144809 J
    => ~308× lower J/query and ~300× lower p95 latency, with top‑1 exactness still matching.

Why I think this matters:
This is a scaling regime shift. The baseline cost scales with N, while the routed path scales with M≪N. At some N the baseline becomes infeasible (OOM wall for explicit NxN fp16 materialization), while the routed path still runs.

How to verify quickly after attaching the zip:

  1. Open README.txt in the zip (it points to the exact JSON field paths).
  2. Check the main record:
    prototypes/prototype_ctdr_landauer_lab/benchmarks/results/gpu_2025-12-19/joules_query_prefix_range.json
    Look for:
    • delta.energy_per_query_ratio_full_over_range
    • passfail.range_scan.pass == true
    • correctness.range.ok == true
  3. Optional: validate the pack summary (no external deps):
    python partner_packet_nvidia/public_teaser/pack_tools/validate_summary_public.py partner_packet_nvidia/public_teaser/assets/summary_public.json

What I’m asking:
Who is the right person/team to evaluate this (CUDA perf / inference perf integration)? If you can route me, I can share a short runbook; I can provide full reproducible harness + implementation details.

(For clarity: the pack contains no kernel source/PTX/SASS.)

CTDR_public_pack_20251219.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions