Add GPU profiler #1997

jainapurva · 2025-04-01T20:29:08Z

Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
Adds comprehensive tests to validate profiler and memory profiling functionalities.
Updates benchmark configuration and runner logic to integrate these new profiling features.

jainapurva · 2025-04-01T20:29:09Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-04-01T20:29:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1997

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e367b21 with merge base 9516764 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

) ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522 ghstack-comment-id: 2770609971 Pull Request resolved: #1997

Copilot

Pull Request Overview

This pull request adds profiling capabilities and integrates a Perfetto UI link to the benchmark framework while also enhancing memory profiling and test coverage.

Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
Adds comprehensive tests to validate profiler and memory profiling functionalities.
Updates benchmark configuration and runner logic to integrate these new profiling features.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
benchmarks/microbenchmarks/utils.py	New functions added for trace file upload, Perfetto URL generation, and model/memory profiling.
benchmarks/microbenchmarks/test/test_benchmark_profiler.py	New tests added to verify profiler and memory profiling functionality.
benchmarks/microbenchmarks/test/benchmark_config.yml	Updated configuration to enable profiling for specific models.
benchmarks/microbenchmarks/benchmark_runner.py	Enhanced error handling and conditional inclusion of benchmark results.
benchmarks/microbenchmarks/benchmark_inference.py	Integrated profiling steps into the inference benchmark run.

Comments suppressed due to low confidence (2)

benchmarks/microbenchmarks/utils.py:95

[nitpick] Consider using an f-string to construct the Perfetto UI URL for improved readability, e.g., using f"https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/{manifold_path}" instead of string concatenation.

            "https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/" + manifold_path

benchmarks/microbenchmarks/utils.py:81

[nitpick] Consider adding a test case to simulate a failure in upload_trace_file (e.g., by mocking subprocess.run to return a non-zero code) to ensure that error handling in generate_model_profile works as expected.

        print("[ERROR] Upload failed, maybe the trace file exists.")

) ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522 ghstack-comment-id: 2770609971 Pull Request resolved: #1997

[ghstack-poisoned]

Copilot

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot · 2025-04-07T17:51:39Z

benchmarks/microbenchmarks/utils.py

+    DEFAULT_TTL_SEC = 28 * 24 * 60 * 60
+    file_name = os.path.basename(local_path)
+    manifold_path = os.path.join(
+        MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}"


Consider using getpass.getuser() instead of os.getlogin() to handle non-interactive environments better, as os.getlogin() may fail in such contexts.

Suggested change

MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}"

MANIFOLD_FOLDER, f"{getpass.getuser()}_{str(uuid.uuid4())}_{file_name}"

Differential Revision: D71503133 Pull Request resolved: #1991

Differential Revision: D72179480 Pull Request resolved: #1998

Differential Revision: D71370592 Pull Request resolved: #1994

stack-info: PR: #1999, branch: drisspg/stack/45

Summary: fixing CI before branch cut Test Plan: python test/quantization/test_galore_quant.py and CI Reviewers: Subscribers: Tasks: Tags:

* up * up

Differential Revision: D71370597 Pull Request resolved: #2004

Differential Revision: D71370604 Pull Request resolved: #2006

…ention Differential Revision: D71370603 Pull Request resolved: #2008

Differential Revision: D71370598 Pull Request resolved: #2010

Differential Revision: D71370602 Pull Request resolved: #2011

Add KleidiAI gemm kernels (#2000) Summary: This PR pulls in two new KleidiAI kernels: * kai_matmul_clamp_f32_qai8dxp1x4_qsi4c32p8x4_1x8_neon_dotprod (GEMV) * kai_matmul_clamp_f32_qai8dxp4x4_qsi4c32p8x4_4x8_neon_dotprod (GEMM) and adds them for automatic mr-based kernel selection when TORCHAO_ENABLE_ARM_NEON_DOT is set. It also adds new tests for these kernels, and refactors the kleidiai testing code so that in future new kleidiai kernels can be tested with a one line addition: ``` TEST( test_linear_8bit_act_xbit_weight, matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod) { test_linear_8bit_act_xbit_weight_kleidiai< matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod>(); } ``` The exisitng testing code (still exists for more coverage) depended on code generation. Reviewed By: Jack-Khuu Differential Revision: D72179835

…2013)

remove float8nocompmile CI since it's flaky on sm89

**Summary:** Previously, `Int8DynActInt4QATQuantizer` had slightly diverging numerics between the prepare and convert steps. This is because the prepare step uses quantization primitives shared with AQT (specifically `quantize_affine` and `dequantize_affine`), while the convert step relies on old ops from the `torch.ops.quantized_decomposed` namespace. The diverging numerics is negligible for small models, but the quantization errors begin to compound for larger models with many linear layers. More specifically, there are three different places where the divergence occurs during activation quantization: 1. **Choose qparams.** The prepare step casts the qparams to `torch.float32`, whereas the convert step casts the scales to `torch.float64` and zero points to `torch.int64`. 2. **Quantize.** The prepare step performs round before adding zero points and uses torch functions, while the convert step adds before rounding and uses torch tensor methods. ``` x = torch.clamp( torch.round(x * (1.0 / scale)) + zero_point, qmin, qmax, ) x = ( x.mul(1.0 / scale) .add(zero_point) .round() .clamp(qmin, qmax) .to(quantize_dtype) ) ``` 3. **Dequantize.** The prepare step casts to `torch.int32` before adding the zero points, and casts back to the original dtype before multiplying the scale. The convert step only casts at the very end. ``` x = x.to(torch.int32) - zero_point.to(torch.int32) x = x.to(orig_dtype) x = x * scale x = x - zero_point x = x * scale x = x.to(orig_dtype) ``` This commit makes the convert path use the same torchao quantization primitives as the prepare path, thereby resolving the 3 above differences. Now, the prepare and convert steps match exactly in terms of numerics over many trials. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_per_token_vs_convert python test/quantization/test_qat.py -k test_qat_8da4w_prepare_vs_convert

stack-info: PR: #2022, branch: drisspg/stack/46

…2017) Replace torchao.prototype.parq with facebookresearch/parq submodule

Differential Revision: D72413684 Pull Request resolved: #2023

…1967) * up * up * up * up

* Fix slice and padding for TensorCoreTiledLayout for int4 weight only quantization Summary: Previously some of the code paths are not exercised, so the bug was not discovered but there are some bug related to slice operation and padding, basically scale and zero_point are not padded before, this results in errors when it is required. Test Plan: python test/dtypes/test_affine_quantized.py -k test_slice Reviewers: Subscribers: Tasks: Tags: * skip if no cuda * update callsites for post_process * add back missing post process * adding missing arg for floatx

**Summary:** Fixes the issue where `Int4WeightEmbeddingQATQuantizer`'s convert path assigned the scales and zero points to the wrong attributes ("scales" and "zeros" instead of "scale" and "zero point"), and also ensures the precisions are correctly set. **Test Plan:** python test/quantization/test_qat.py -k test_qat_4w_embedding

* Add gguf q4_k_s quantization Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags: * fix * test with phi4 * pre-commit run * update * run precommit * format

Differential Revision: D72435827 Pull Request resolved: #2029

…lyConfig (#1972) * up * up * up * up * up * up * up * up

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2025

jainapurva added a commit that referenced this pull request Apr 1, 2025

Add profiler and Perfetto UI link with comprehensive tests (#1984, #1992

e34f121

) ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522 ghstack-comment-id: 2770609971 Pull Request resolved: #1997

jainapurva changed the title ~~Add profiler and Perfetto UI link with comprehensive tests (#1984, #1992)~~ Add profiler and Perfetto UI link with comprehensive tests Apr 1, 2025

jainapurva requested a review from Copilot April 1, 2025 20:33

Copilot AI reviewed Apr 1, 2025

View reviewed changes

jainapurva changed the title ~~Add profiler and Perfetto UI link with comprehensive tests~~ Add GPU profiler Apr 1, 2025

Add profiler and Perfetto UI link with comprehensive tests (#1984, #1992

967ea76

) ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522 ghstack-comment-id: 2770609971 Pull Request resolved: #1997

jainapurva force-pushed the gh/jainapurva/25/head branch from ace610f to 967ea76 Compare April 4, 2025 06:00

jainapurva mentioned this pull request Apr 4, 2025

More models #2018

Open

More models

dd9f50d

jainapurva mentioned this pull request Apr 4, 2025

More models #2019

Open

jainapurva force-pushed the gh/jainapurva/25/head branch from 4acb7e8 to dd9f50d Compare April 4, 2025 16:55

Update

328b7bf

[ghstack-poisoned]

This was referenced Apr 4, 2025

More models #2020

Draft

[ghstack] Add support for more shapes #2021

Draft

jainapurva marked this pull request as draft April 4, 2025 19:56

jainapurva requested a review from Copilot April 7, 2025 17:51

Copilot AI reviewed Apr 7, 2025

View reviewed changes

jainapurva added topic: performance Use this tag if this PR improves the performance of a feature benchmark labels Apr 8, 2025

metascroy and others added 8 commits April 8, 2025 14:08

Reintroduce has_weight_zeros as a template param

acc3c79

Differential Revision: D71503133 Pull Request resolved: #1991

Claen up op interface

77c4ef1

Differential Revision: D72179480 Pull Request resolved: #1998

quantized matmul

7959ac3

Differential Revision: D71370592 Pull Request resolved: #1994

Allow builds on less than sm75 raise runtime failure (#1999)

0304a52

stack-info: PR: #1999, branch: drisspg/stack/45

Skip galore test if not cuda (#2003)

3f89080

Summary: fixing CI before branch cut Test Plan: python test/quantization/test_galore_quant.py and CI Reviewers: Subscribers: Tasks: Tags:

Fix experimental CI (#2005)

4e8f7f8

* up * up

Add fp32xint8 matmul

97f6618

Differential Revision: D71370597 Pull Request resolved: #2004

Add quantized q @ k test for intented used in quantized attention

8f9bd0a

Differential Revision: D71370604 Pull Request resolved: #2006

jerryzh168 and others added 20 commits April 8, 2025 14:08

Update version.txt (#2009)

2f62e01

Initial prototype of differentiable _scaled_grouped_mm function (#1969)

49705d9

Add quantized attn_scores @ v test for intented used in quantized att…

71a3d96

…ention Differential Revision: D71370603 Pull Request resolved: #2008

add fallback kernel and interface

50d133a

Differential Revision: D71370598 Pull Request resolved: #2010

Add fallback kernel and interface for rhs only quantized matmul

d741ff0

Differential Revision: D71370602 Pull Request resolved: #2011

Update float8nocompile test code to use new float8 matmul function (#…

5409515

…2013)

Remove float8nocompile CI (#1976)

916f9d7

remove float8nocompmile CI since it's flaky on sm89

Update clean_release_notes.py (#2014)

0436d35

Skip failing tests for rowwise-scaled (#2022)

90bff95

stack-info: PR: #2022, branch: drisspg/stack/46

Update torchao.prototype.parq and add 4-bit Llama 3.2 1B benchmark (#…

711d584

…2017) Replace torchao.prototype.parq with facebookresearch/parq submodule

Use quantized gemm only on aarch64

ee2b9c7

Differential Revision: D72413684 Pull Request resolved: #2023

Adds utility to replace Q/DQ ops with torchao quantized linear ops (#…

05ae22c

…1967) * up * up * up * up

torch/ao

4b8a0d8

Differential Revision: D72435827 Pull Request resolved: #2029

Adds Q/DQ layout support for embedding quantization with IntxWeightOn…

da111e4

…lyConfig (#1972) * up * up * up * up * up * up * up * up

Merge remote-tracking branch 'origin/main' into base-fix

e367b21

jainapurva mentioned this pull request Apr 10, 2025

Add profiling to benchmarking #2032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU profiler #1997

Add GPU profiler #1997

jainapurva commented Apr 1, 2025 •

edited

Loading

jainapurva commented Apr 1, 2025 •

edited

Loading

pytorch-bot bot commented Apr 1, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI left a comment

Copilot AI Apr 7, 2025

	MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}"
	MANIFOLD_FOLDER, f"{getpass.getuser()}_{str(uuid.uuid4())}_{file_name}"

Add GPU profiler #1997

Are you sure you want to change the base?

Add GPU profiler #1997

Conversation

jainapurva commented Apr 1, 2025 • edited Loading

jainapurva commented Apr 1, 2025 • edited Loading

pytorch-bot bot commented Apr 1, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1997

✅ No Failures

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Apr 7, 2025

Choose a reason for hiding this comment

jainapurva commented Apr 1, 2025 •

edited

Loading

jainapurva commented Apr 1, 2025 •

edited

Loading

pytorch-bot bot commented Apr 1, 2025 •

edited

Loading