-
Notifications
You must be signed in to change notification settings - Fork 241
Add GPU profiler #1997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GPU profiler #1997
Conversation
jainapurva
commented
Apr 1, 2025
•
edited
Loading
edited
- Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
- Adds comprehensive tests to validate profiler and memory profiling functionalities.
- Updates benchmark configuration and runner logic to integrate these new profiling features.
Stack from ghstack (oldest at bottom): |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1997
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e367b21 with merge base 9516764 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds profiling capabilities and integrates a Perfetto UI link to the benchmark framework while also enhancing memory profiling and test coverage.
- Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
- Adds comprehensive tests to validate profiler and memory profiling functionalities.
- Updates benchmark configuration and runner logic to integrate these new profiling features.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
benchmarks/microbenchmarks/utils.py | New functions added for trace file upload, Perfetto URL generation, and model/memory profiling. |
benchmarks/microbenchmarks/test/test_benchmark_profiler.py | New tests added to verify profiler and memory profiling functionality. |
benchmarks/microbenchmarks/test/benchmark_config.yml | Updated configuration to enable profiling for specific models. |
benchmarks/microbenchmarks/benchmark_runner.py | Enhanced error handling and conditional inclusion of benchmark results. |
benchmarks/microbenchmarks/benchmark_inference.py | Integrated profiling steps into the inference benchmark run. |
Comments suppressed due to low confidence (2)
benchmarks/microbenchmarks/utils.py:95
- [nitpick] Consider using an f-string to construct the Perfetto UI URL for improved readability, e.g., using f"https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/{manifold_path}" instead of string concatenation.
"https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/" + manifold_path
benchmarks/microbenchmarks/utils.py:81
- [nitpick] Consider adding a test case to simulate a failure in upload_trace_file (e.g., by mocking subprocess.run to return a non-zero code) to ensure that error handling in generate_model_profile works as expected.
print("[ERROR] Upload failed, maybe the trace file exists.")
ace610f
to
967ea76
Compare
4acb7e8
to
dd9f50d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
DEFAULT_TTL_SEC = 28 * 24 * 60 * 60 | ||
file_name = os.path.basename(local_path) | ||
manifold_path = os.path.join( | ||
MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using getpass.getuser() instead of os.getlogin() to handle non-interactive environments better, as os.getlogin() may fail in such contexts.
MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}" | |
MANIFOLD_FOLDER, f"{getpass.getuser()}_{str(uuid.uuid4())}_{file_name}" |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
Differential Revision: D71503133 Pull Request resolved: #1991
Differential Revision: D72179480 Pull Request resolved: #1998
Differential Revision: D71370592 Pull Request resolved: #1994
stack-info: PR: #1999, branch: drisspg/stack/45
Summary: fixing CI before branch cut Test Plan: python test/quantization/test_galore_quant.py and CI Reviewers: Subscribers: Tasks: Tags:
* up * up
Differential Revision: D71370597 Pull Request resolved: #2004
Differential Revision: D71370604 Pull Request resolved: #2006
…ention Differential Revision: D71370603 Pull Request resolved: #2008
Differential Revision: D71370598 Pull Request resolved: #2010
Differential Revision: D71370602 Pull Request resolved: #2011
Add KleidiAI gemm kernels (#2000) Summary: This PR pulls in two new KleidiAI kernels: * kai_matmul_clamp_f32_qai8dxp1x4_qsi4c32p8x4_1x8_neon_dotprod (GEMV) * kai_matmul_clamp_f32_qai8dxp4x4_qsi4c32p8x4_4x8_neon_dotprod (GEMM) and adds them for automatic mr-based kernel selection when TORCHAO_ENABLE_ARM_NEON_DOT is set. It also adds new tests for these kernels, and refactors the kleidiai testing code so that in future new kleidiai kernels can be tested with a one line addition: ``` TEST( test_linear_8bit_act_xbit_weight, matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod) { test_linear_8bit_act_xbit_weight_kleidiai< matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod>(); } ``` The exisitng testing code (still exists for more coverage) depended on code generation. Reviewed By: Jack-Khuu Differential Revision: D72179835
remove float8nocompmile CI since it's flaky on sm89
**Summary:** Previously, `Int8DynActInt4QATQuantizer` had slightly diverging numerics between the prepare and convert steps. This is because the prepare step uses quantization primitives shared with AQT (specifically `quantize_affine` and `dequantize_affine`), while the convert step relies on old ops from the `torch.ops.quantized_decomposed` namespace. The diverging numerics is negligible for small models, but the quantization errors begin to compound for larger models with many linear layers. More specifically, there are three different places where the divergence occurs during activation quantization: 1. **Choose qparams.** The prepare step casts the qparams to `torch.float32`, whereas the convert step casts the scales to `torch.float64` and zero points to `torch.int64`. 2. **Quantize.** The prepare step performs round before adding zero points and uses torch functions, while the convert step adds before rounding and uses torch tensor methods. ``` x = torch.clamp( torch.round(x * (1.0 / scale)) + zero_point, qmin, qmax, ) x = ( x.mul(1.0 / scale) .add(zero_point) .round() .clamp(qmin, qmax) .to(quantize_dtype) ) ``` 3. **Dequantize.** The prepare step casts to `torch.int32` before adding the zero points, and casts back to the original dtype before multiplying the scale. The convert step only casts at the very end. ``` x = x.to(torch.int32) - zero_point.to(torch.int32) x = x.to(orig_dtype) x = x * scale x = x - zero_point x = x * scale x = x.to(orig_dtype) ``` This commit makes the convert path use the same torchao quantization primitives as the prepare path, thereby resolving the 3 above differences. Now, the prepare and convert steps match exactly in terms of numerics over many trials. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_per_token_vs_convert python test/quantization/test_qat.py -k test_qat_8da4w_prepare_vs_convert
stack-info: PR: #2022, branch: drisspg/stack/46
…2017) Replace torchao.prototype.parq with facebookresearch/parq submodule
Differential Revision: D72413684 Pull Request resolved: #2023
…1967) * up * up * up * up
* Fix slice and padding for TensorCoreTiledLayout for int4 weight only quantization Summary: Previously some of the code paths are not exercised, so the bug was not discovered but there are some bug related to slice operation and padding, basically scale and zero_point are not padded before, this results in errors when it is required. Test Plan: python test/dtypes/test_affine_quantized.py -k test_slice Reviewers: Subscribers: Tasks: Tags: * skip if no cuda * update callsites for post_process * add back missing post process * adding missing arg for floatx
**Summary:** Fixes the issue where `Int4WeightEmbeddingQATQuantizer`'s convert path assigned the scales and zero points to the wrong attributes ("scales" and "zeros" instead of "scale" and "zero point"), and also ensures the precisions are correctly set. **Test Plan:** python test/quantization/test_qat.py -k test_qat_4w_embedding
* Add gguf q4_k_s quantization Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags: * fix * test with phi4 * pre-commit run * update * run precommit * format
…lyConfig (#1972) * up * up * up * up * up * up * up * up