Skip to content

Add GPU profiler #1997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Add GPU profiler #1997

wants to merge 31 commits into from

Conversation

jainapurva
Copy link
Contributor

@jainapurva jainapurva commented Apr 1, 2025

  • Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
  • Adds comprehensive tests to validate profiler and memory profiling functionalities.
  • Updates benchmark configuration and runner logic to integrate these new profiling features.

@jainapurva
Copy link
Contributor Author

jainapurva commented Apr 1, 2025

Copy link

pytorch-bot bot commented Apr 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1997

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e367b21 with merge base 9516764 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2025
jainapurva added a commit that referenced this pull request Apr 1, 2025
)

ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522
ghstack-comment-id: 2770609971
Pull Request resolved: #1997
@jainapurva jainapurva changed the title Add profiler and Perfetto UI link with comprehensive tests (#1984, #1992) Add profiler and Perfetto UI link with comprehensive tests Apr 1, 2025
@jainapurva jainapurva requested a review from Copilot April 1, 2025 20:33
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds profiling capabilities and integrates a Perfetto UI link to the benchmark framework while also enhancing memory profiling and test coverage.

  • Introduces functions to upload trace files, generate Perfetto UI URLs, and profile model and memory usage.
  • Adds comprehensive tests to validate profiler and memory profiling functionalities.
  • Updates benchmark configuration and runner logic to integrate these new profiling features.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
benchmarks/microbenchmarks/utils.py New functions added for trace file upload, Perfetto URL generation, and model/memory profiling.
benchmarks/microbenchmarks/test/test_benchmark_profiler.py New tests added to verify profiler and memory profiling functionality.
benchmarks/microbenchmarks/test/benchmark_config.yml Updated configuration to enable profiling for specific models.
benchmarks/microbenchmarks/benchmark_runner.py Enhanced error handling and conditional inclusion of benchmark results.
benchmarks/microbenchmarks/benchmark_inference.py Integrated profiling steps into the inference benchmark run.
Comments suppressed due to low confidence (2)

benchmarks/microbenchmarks/utils.py:95

            "https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/" + manifold_path

benchmarks/microbenchmarks/utils.py:81

  • [nitpick] Consider adding a test case to simulate a failure in upload_trace_file (e.g., by mocking subprocess.run to return a non-zero code) to ensure that error handling in generate_model_profile works as expected.
        print("[ERROR] Upload failed, maybe the trace file exists.")

@jainapurva jainapurva changed the title Add profiler and Perfetto UI link with comprehensive tests Add GPU profiler Apr 1, 2025
)

ghstack-source-id: a5f8301acb77a180a395aa8dd4c1aa9c2ccd7522
ghstack-comment-id: 2770609971
Pull Request resolved: #1997
@jainapurva jainapurva force-pushed the gh/jainapurva/25/head branch from ace610f to 967ea76 Compare April 4, 2025 06:00
@jainapurva jainapurva mentioned this pull request Apr 4, 2025
@jainapurva jainapurva mentioned this pull request Apr 4, 2025
@jainapurva jainapurva force-pushed the gh/jainapurva/25/head branch from 4acb7e8 to dd9f50d Compare April 4, 2025 16:55
[ghstack-poisoned]
This was referenced Apr 4, 2025
@jainapurva jainapurva marked this pull request as draft April 4, 2025 19:56
@jainapurva jainapurva requested a review from Copilot April 7, 2025 17:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

DEFAULT_TTL_SEC = 28 * 24 * 60 * 60
file_name = os.path.basename(local_path)
manifold_path = os.path.join(
MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}"
Copy link
Preview

Copilot AI Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using getpass.getuser() instead of os.getlogin() to handle non-interactive environments better, as os.getlogin() may fail in such contexts.

Suggested change
MANIFOLD_FOLDER, f"{os.getlogin()}_{str(uuid.uuid4())}_{file_name}"
MANIFOLD_FOLDER, f"{getpass.getuser()}_{str(uuid.uuid4())}_{file_name}"

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

@jainapurva jainapurva added topic: performance Use this tag if this PR improves the performance of a feature benchmark labels Apr 8, 2025
metascroy and others added 8 commits April 8, 2025 14:08
Differential Revision: D71503133

Pull Request resolved: #1991
Differential Revision: D72179480

Pull Request resolved: #1998
Differential Revision: D71370592

Pull Request resolved: #1994
Summary:
fixing CI before branch cut

Test Plan:
python test/quantization/test_galore_quant.py

and CI

Reviewers:

Subscribers:

Tasks:

Tags:
Differential Revision: D71370597

Pull Request resolved: #2004
Differential Revision: D71370604

Pull Request resolved: #2006
jerryzh168 and others added 20 commits April 8, 2025 14:08
…ention

Differential Revision: D71370603

Pull Request resolved: #2008
Differential Revision: D71370598

Pull Request resolved: #2010
Differential Revision: D71370602

Pull Request resolved: #2011
Add KleidiAI gemm kernels (#2000)

Summary:

This PR pulls in two new KleidiAI kernels:
* kai_matmul_clamp_f32_qai8dxp1x4_qsi4c32p8x4_1x8_neon_dotprod (GEMV)
* kai_matmul_clamp_f32_qai8dxp4x4_qsi4c32p8x4_4x8_neon_dotprod (GEMM)

and adds them for automatic mr-based kernel selection when TORCHAO_ENABLE_ARM_NEON_DOT is set.  It also adds new tests for these kernels, and refactors the kleidiai testing code so that in future new kleidiai kernels can be tested with a one line addition:

```
TEST(
    test_linear_8bit_act_xbit_weight,
    matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod) {
  test_linear_8bit_act_xbit_weight_kleidiai<
      matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod>();
}
```

The exisitng testing code (still exists for more coverage) depended on code generation.

Reviewed By: Jack-Khuu

Differential Revision: D72179835
remove float8nocompmile CI since it's flaky on sm89
**Summary:** Previously, `Int8DynActInt4QATQuantizer` had
slightly diverging numerics between the prepare and convert
steps. This is because the prepare step uses quantization
primitives shared with AQT (specifically `quantize_affine`
and `dequantize_affine`), while the convert step relies on
old ops from the `torch.ops.quantized_decomposed` namespace.
The diverging numerics is negligible for small models, but
the quantization errors begin to compound for larger models
with many linear layers.

More specifically, there are three different places where the
divergence occurs during activation quantization:

1. **Choose qparams.** The prepare step casts the qparams to
`torch.float32`, whereas the convert step casts the scales to
`torch.float64` and zero points to `torch.int64`.

2. **Quantize.** The prepare step performs round before adding
zero points and uses torch functions, while the convert step
adds before rounding and uses torch tensor methods.
```
x = torch.clamp(
    torch.round(x * (1.0 / scale)) + zero_point, qmin, qmax,
)

x = (
    x.mul(1.0 / scale)
    .add(zero_point)
    .round()
    .clamp(qmin, qmax)
    .to(quantize_dtype)
)
```

3. **Dequantize.** The prepare step casts to `torch.int32`
before adding the zero points, and casts back to the original
dtype before multiplying the scale. The convert step only casts
at the very end.
```
x = x.to(torch.int32) - zero_point.to(torch.int32)
x = x.to(orig_dtype)
x = x * scale

x = x - zero_point
x = x * scale
x = x.to(orig_dtype)
```

This commit makes the convert path use the same torchao
quantization primitives as the prepare path, thereby resolving
the 3 above differences. Now, the prepare and convert steps match
exactly in terms of numerics over many trials.

**Test Plan:**
python test/quantization/test_qat.py -k test_fake_quantize_per_token_vs_convert
python test/quantization/test_qat.py -k test_qat_8da4w_prepare_vs_convert
stack-info: PR: #2022, branch: drisspg/stack/46
…2017)

Replace torchao.prototype.parq with facebookresearch/parq submodule
Differential Revision: D72413684

Pull Request resolved: #2023
* Fix slice and padding for TensorCoreTiledLayout for int4 weight only quantization

Summary:
Previously some of the code paths are not exercised, so the bug was not discovered

but there are some bug related to slice operation and padding, basically
scale and zero_point are not padded before, this results in errors when it is required.

Test Plan:
python test/dtypes/test_affine_quantized.py -k test_slice

Reviewers:

Subscribers:

Tasks:

Tags:

* skip if no cuda

* update callsites for post_process

* add back missing post process

* adding missing arg for floatx
**Summary:** Fixes the issue where
`Int4WeightEmbeddingQATQuantizer`'s convert path assigned the
scales and zero points to the wrong attributes ("scales" and
"zeros" instead of "scale" and "zero point"), and also ensures
the precisions are correctly set.

**Test Plan:**
python test/quantization/test_qat.py -k test_qat_4w_embedding
* Add gguf q4_k_s quantization

Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28

but implemented a simple choose_qparams that can fit the gguf format:
Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)

Test Plan:
python test/prototype/test_gguf_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

* fix

* test with phi4

* pre-commit run

* update

* run precommit

* format
Differential Revision: D72435827

Pull Request resolved: #2029
…lyConfig (#1972)

* up

* up

* up

* up

* up

* up

* up

* up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: performance Use this tag if this PR improves the performance of a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants