profile with kineto for small kernels #80

amirakb89 · 2024-12-30T17:33:09Z

This commit allows the user to profile tiny kernels with kineto profiler to eliminate the CPU overhead of the kernel launch.
the following argument needs to be passed to run with this feature:
--export-trace

Reviewed By: jwfromm Differential Revision: D67293350 fbshipit-source-id: 76ee573031729fd918cbdc0133c14f3fbbe3decf

Summary: Pull Request resolved: pytorch#3507 X-link: facebookresearch/FBGEMM#588 https://fb.workplace.com/groups/1943855219377055/permalink/1982050665557510/ https://gitlab.arm.com/kleidi/kleidiai/-/commit/8e6db85154a9dd100d5553a20c0df6ee437eb745 pytorch@667ce9b Reviewed By: meyering Differential Revision: D66766306 fbshipit-source-id: f70ef581e214edad15aa0ca093d753adbacb163d

Summary: X-link: facebookresearch/FBGEMM#592 X-link: facebookresearch/FBGEMM#568 Pull Request resolved: pytorch#3488 - Break up D66310520 into backend and frontend diffs Reviewed By: leitian Differential Revision: D66986498 fbshipit-source-id: 1779a9a2a4611eda1298afc0e840839c7da46b10

Summary: Pull Request resolved: pytorch#3484 X-link: facebookresearch/FBGEMM#565 Use 3D grid to reduce the risk of running into grid size overflow in generate_vbe_metadata Reviewed By: r-barnes Differential Revision: D66948760 fbshipit-source-id: 505d9b72e0d74d1707e4aa0ab9af48f26cf18b4a

…3454) Summary: Pull Request resolved: pytorch#3454 X-link: facebookresearch/FBGEMM#538 2/2 of enabling bounds check V2 for APS FM, following APS principles, we would like to surface the V2 switch up to the APS user config, hence in this diff we are extending existing BoundsCheckMode with V2 counterparts, and pass the version flag into the operator. this diff enabled v2 via backward compatible modes update with V2 prefix which is intuitive for user to switch More context can be found in https://docs.google.com/document/d/1hEhk2isMOXuWPyQJxiOzNq0ivfECsZUT7kT_IBmou_I/edit?tab=t.0#heading=h.q89rllowo3eb Reviewed By: sryap Differential Revision: D66512098 fbshipit-source-id: d2181a82462ca1c2c93360d4108766edeb38d000

Summary: Pull Request resolved: pytorch#3444 X-link: facebookresearch/FBGEMM#530 This diff adds support for true dynamic M as is found in grouped_gemm. To do so, we add a new `zero_start_index_M` argument that must be provided by the user and indicates the number of non-zero M in each tensor. One nice thing about this approach is that we can now do a single kernel call to set up the gemm arguments. We make `zero_start_index_M` optional as it requires fixed N and K. When N and K vary across group, we use the previous static shape approach. Reviewed By: bradleyhd, jiawenliu64 Differential Revision: D66682886 fbshipit-source-id: 9c4554dba9becf33fcc87cd1b01266fead716916

Summary: Pull Request resolved: pytorch#3509 X-link: facebookresearch/FBGEMM#593 when calaculting num_thread and group_per_thread to distribute work, rounding gets accumulated and effectively expand the input space. for example (the new UT), when input tensor is (1, 2^31 - 8), ``` a.numel: 2147483640 num_threads: 46341 groups_per_thread: 1449 num_groups: 67108864 num_threads * groups_per_threads= 67148109 > num_groups ``` in kernel, when we try to access memory, input_start = num_threads * groups_per_threads * pid, so when pid is large, we end up visiting data outside the input Reviewed By: jwfromm Differential Revision: D67369392 fbshipit-source-id: 62c28fe3a94911a10921e233ff5ae42097e9dbb4

…3508) Summary: Pull Request resolved: pytorch#3508 X-link: facebookresearch/FBGEMM#589 reland D66990975 with fix for the NaN issued observed during LLaMa4 17B model run with fp8_rowwise FFN Specifically, offset was not properly updated when loading/storing data. Reviewed By: jwfromm Differential Revision: D67303282 fbshipit-source-id: 334d32019424de6daff4261b1d5ebe3c977fdabd

Summary: X-link: facebookresearch/FBGEMM#597 Pull Request resolved: pytorch#3516 added pyper configuration for mx4 goup size. Reviewed By: irobert0126, renganxu Differential Revision: D67407064 fbshipit-source-id: a23765777879491836fcb9f1a00ba8f1e1b26b76

…orch#3512) Summary: X-link: facebookresearch/FBGEMM#596 Pull Request resolved: pytorch#3512 Reviewed By: avikchaudhuri Differential Revision: D67381311 fbshipit-source-id: 345264f99d6f4b77508b4ea95fe20b3482ad1f04

Summary: X-link: facebookresearch/FBGEMM#599 - Fix the CMake minimum version in conda install - Fix issue with missing `librhash.so.0` when installing `gcc` - Fix build issues with bazel, and upgrade bazel version to latest Pull Request resolved: pytorch#3514 Reviewed By: spcyppt Differential Revision: D67435456 Pulled By: q10 fbshipit-source-id: 2fe53c59251df3633771b2b6b0d97c15a33df7b6

Summary: Pull Request resolved: pytorch#3519 X-link: facebookresearch/FBGEMM#601 For extremely large inputs, we found that boundary check values were sufficiently large that they were causing integer overflow. This resulted in triton triggering masking for all loads and stores which lead to garbage outputs. This diff fixes the issue by more carefully doing int64 upcasting for super large tensors. After this change, all super large tests pass. Reviewed By: qchip Differential Revision: D67495115 fbshipit-source-id: dcea639a7343d5782823f103a0572870aa496b05

Summary: X-link: facebookresearch/FBGEMM#602 Pull Request resolved: pytorch#3520 In the diff D66465811 we introduced a bulk initialization function `_insert_all_kv` for ssd tensors. However, large tensors take a long time to fully initialize, and ideally this can happen in the background so it doesn't increase TTFB of the training jobs. This change does exactly that, moves this initialization to a separate thread, allowing other initialization in the training job, like reading data, to happen concurrently. In order to avoid pushing synchronization to the user space, this change introduces getter and setter for ssd_db, which ensure initialization is fully done before weights are used. Reviewed By: duduyi2013, drdarshan, jiayulu Differential Revision: D67480511 fbshipit-source-id: 6faf54621fc6e26a9791ac23e48aa7890329077a

Summary: X-link: facebookresearch/FBGEMM#605 - Set the --plat-name explicitly to `manylinux_2_28` Pull Request resolved: pytorch#3521 Reviewed By: spcyppt Differential Revision: D67538191 Pulled By: q10 fbshipit-source-id: b2f8cc0b81c7e46bd2e380c03a6fa68da11786d6

Summary: Pull Request resolved: pytorch#3342 X-link: facebookresearch/FBGEMM#436 A new optional optimizer state `row_counter` is added to Adam to perform bias correction per embedding row. `row_counter` serves as the iteration counter when a row (an index) occurs and used to do bias correction. Without rowwise bias correction (existing Adam), ``` m_hat_t = m_t / (1.0 - powf(beta1, iter)); v_hat_t = v_t / (1.0 - powf(beta2, iter)); ``` With rowwise bias correction enabled. ``` // when index `idx` occurs _row_counter = row_counter[idx] + 1; m_hat_t = m_t / (1.0 - powf(beta1, _row_counter)); v_hat_t = v_t / (1.0 - powf(beta2, _row_counter)); ``` This request is from IG to allow all the models to be scaled on sparse features with expected 1.5% NE on Stories. ------- **__The functionality is not set by default.__** Frontend: D64848802 To enable the bias correction, `use_rowwise_bias_correction` needs to be set to True through extra_optimizer_config. ``` extra_optimizer_config = UserEnabledConfigDefinition(use_rowwise_bias_correction=True) emb_op = SplitTableBatchedEmbeddingBagsCodegen ( embedding_specs=[ (E, D, M, compute_device) for (E, D, M) in zip(Es, Ds, managed) ], optimizer=OptimType.Adam extra_optimizer_config=extra_optimizer_config, ... ) ``` ------ **__Performance from Kineto__** (unweighted) ``` Baseline* | default** | enabled*** forward | cpu | 2.293 s | 2.188 s | 2.043 s | cuda | 12.512 ms | 12.539 ms | 12.547 ms backward | cpu | 69.861 ms | 66.546 ms | 65.880 ms | cuda | 103.429 ms | 103.395 ms | 103.130 ms ``` \* Baseline: before changes \** default: default setting; use_bias_correction = False \*** enabled: use_bias_correction = True Reviewed By: sryap Differential Revision: D64808460 fbshipit-source-id: 9706bcc4601b370f4d67c81b833fb1cd46377a6c

Summary: X-link: facebookresearch/FBGEMM#611 I'm switching to 3.12 to build FBGEMM docs for now. The trigger of the failure is that we now have torch 3.13t as an experimental nightly build. cc atalman There seems to be a mix up in how 3.13 and 3.13t are used in this workflow. This could mean troubles when people try to use torch 3.13 and 3.13t. ### Testing https://github.com/pytorch/FBGEMM/actions/runs/12474407777/job/34816361009?pr=3531 Pull Request resolved: pytorch#3531 Reviewed By: huydhn Differential Revision: D67612996 Pulled By: q10 fbshipit-source-id: aeec242e2a5919dd4cf521f1a1a727b48c354bf6

Summary: X-link: facebookresearch/FBGEMM#612 This reverts commit 5c16f4b. This is not needed anymore after pytorch/pytorch#143423. I think this will also fix the issue with building torchrec CPU https://github.com/pytorch/FBGEMM/actions/runs/12470608879/job/34806045264?pr=3528#step:18:219⁩ ### Testing https://github.com/pytorch/FBGEMM/actions/runs/12470608879 Pull Request resolved: pytorch#3528 Reviewed By: q10 Differential Revision: D67602736 Pulled By: huydhn fbshipit-source-id: ec5888acacd96295dd6dfe26e5fa87b28810b2bc

Summary: Pull Request resolved: pytorch#3524 X-link: facebookresearch/FBGEMM#606 as title Reviewed By: jiayulu Differential Revision: D67563019 fbshipit-source-id: 6c349c2bdc2b97f1455e52dc102fe674a75c375f

Summary: Pull Request resolved: pytorch#3525 X-link: facebookresearch/FBGEMM#607 after calling tensor.detach, tensor.require_grad will automatically switch to false, following the same pattern in PMT tensor Reviewed By: jiayulu Differential Revision: D67563018 fbshipit-source-id: 43f2437327f471351e436cd019b423841773ac99

Summary: X-link: facebookresearch/FBGEMM#613 - Add support for CUDA 12.6 builds in OSS Pull Request resolved: pytorch#3503 Reviewed By: spcyppt Differential Revision: D67662576 Pulled By: q10 fbshipit-source-id: f9f6b16d7a9f9153b4afdbbbb504ad47bf908095

Summary: X-link: facebookresearch/FBGEMM#609 Pull Request resolved: pytorch#3527 Masking is needed to avoid out-of-bound reference for the last row. This fixes an illegal access error. Reviewed By: jianyuh Differential Revision: D67588103 fbshipit-source-id: 7ea9203bea8212e7ab340891b1d49d7c9eb85d39

Summary: X-link: facebookresearch/FBGEMM#614 Pull Request resolved: pytorch#3532 Adding a warp-specialize GEMM kernel. A couple highlights: - `tl.async_task` warp partition. Rewrote the well-known flattened 1D-loop GEMM persistent kernel with a more intuitive natural Warp-specialized 2D-loop persistent kernel with producer/consumer co-operative model. - One kernel for both, enabled autotuning across both WS and non-WS with same kernel, since. WS may not always be beneficial. We have also updated the tutorial to support multi-modes, including non-WS. - Use regular load instead of TMA for scale loading within each consumer - Compiler-automated accumulator initialization omission. The first iteration of the matmul K-loop does not need to an accumulator when computing the output which is fed to the next iteration as the accumulator. Therefore peeling the first iteration out of the K-loop can avoid the zero initialization of the accumulator. `tl.assume` is used to deal with the case when the loop doesn’t run at all. Reviewed By: jianyuh Differential Revision: D67676051 fbshipit-source-id: 4c552e37c358dc48d19b26ea3019c7afcb1ef18a

avbokovoy

Overall, LGTM

fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py

Summary: Pull Request resolved: pytorch#3522 X-link: facebookresearch/FBGEMM#603 It turns out that setting up the grouped gemm kernel arguments can be a significant overhead. This diff more carefully checks the number of groups to dispatch to either a hipmemcpy based approach, which works well when there are 16 more groups, or a series of kernels that directly sets the gpu memory for each group. For smaller number of groups, this approach provides a pretty substantial speedup. For example when running this command: ``` buck2 run @//mode/{opt,amd-gpu,inplace} -c fbcode.enable_gpu_sections=true -c fbcode.triton_backend=amd -c fbcode.rocm_arch=mi300 //deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/bench:quantize_bench -- --kernels=bf16_baseline,ck_rowwise,ck_rowwise_grouped --grouped --M=128,64 --N=8192,8192 --K=8192,4096 --no_cuda_graph ``` Performance before was: ``` ck_rowwise_grouped sim: 17.438. ck_rowwise_grouped ms: 0.136. ck_rowwise_grouped TFLOPS: 158.115. ck_rowwise_grouped GB/s: 773.976. ``` Performance after is: ``` ck_rowwise_grouped sim: 17.438. ck_rowwise_grouped ms: 0.112. ck_rowwise_grouped TFLOPS: 192.489. ck_rowwise_grouped GB/s: 942.236. ``` Reviewed By: jianyuh, mxz297 Differential Revision: D67531231 fbshipit-source-id: e1a6f4af969993dc755ae83de8d6008ddc966391

Summary: Pull Request resolved: pytorch#3534 X-link: facebookresearch/FBGEMM#616 This diff cleans up some of the APIs for FBGEMM grouped gemm and updates CUTLASS bf16 grouped gemm to use a single kernel launch to initialize gemm arguments. This should help reduce overhead a bit. The only notable API change exposed to the user is that all grouped gemm functions now return lists of outputs where bf16 previously returned a single tensor blob. This does mean that in some cases we'll have to do an extra `torch.stack` to unify the groups. If this turns out to be costly, I think we can instead have two grouped gemm implementations, one for dynamic (which returns a single tensor) and one for static shapes which returns a list of tensors. Reviewed By: jianyuh Differential Revision: D67423469 fbshipit-source-id: 1016b84856cf19942e6d1763ab982766b700475d

Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862 fbshipit-source-id: 98d38c7f238ccbc97769c6b3a36e1d1540c1a6ce

Summary: Pull Request resolved: pytorch#3530 X-link: facebookresearch/FBGEMM#610 This small diff adds the new Cutlass fp8 grouped gemm operators to quantize bench for easier performance profiling. I also adjusted the groups argument to quantize_bench to make it a bit easier to specify input shapes. Reviewed By: jianyuh Differential Revision: D67610633 fbshipit-source-id: 8f2873aa5bbb5024312c86e811350fed61285b83

Summary: X-link: facebookresearch/FBGEMM#615 Pull Request resolved: pytorch#3533 Reviewed By: spcyppt Differential Revision: D67725264 Pulled By: q10 fbshipit-source-id: 3f7206f47781f0d4916a808017743639f8c1e5af

Summary: X-link: facebookresearch/FBGEMM#654 - Update ROCm and CUDA versions in docs Pull Request resolved: pytorch#3569 Reviewed By: spcyppt Differential Revision: D68137294 Pulled By: q10 fbshipit-source-id: 880357b71a1376d87cbeedd8aee21d1e9371978b

Summary: Pull Request resolved: pytorch#3568 X-link: facebookresearch/FBGEMM#653 - Move FP32 kernels to OSS. This diff does not integrate the FP32 kernels into the OSS build just yet, as there appears to be some build issues - I will follow up on this in a future diff Reviewed By: embg Differential Revision: D68119470 fbshipit-source-id: 77372c0ca59a92ab6d2f1a4598b8c884b51080ca

…ytorch#3396) Summary: X-link: facebookresearch/FBGEMM#655 Pull Request resolved: pytorch#3396 It will be used by the FP8 quantization of Q, which is a 3-D tensors. With it, we don't need a special reshape of the scale outside the kernel call Reviewed By: jiawenliu64 Differential Revision: D66196131 fbshipit-source-id: beb4e99eb0c8b7adcb444a60f4929629115c853f

Summary: X-link: facebookresearch/FBGEMM#621 X-link: facebookresearch/FBGEMM#466 Pull Request resolved: pytorch#3375 - Add `index_t` support to TBE training backward kernels Reviewed By: basilwong Differential Revision: D65933410 fbshipit-source-id: 4817a873cc1a56e872bf6384c7ae709f471efb32

Summary: Pull Request resolved: pytorch#3572 X-link: facebookresearch/FBGEMM#658 In practice, there are cases where inputs will have a 0 dimension, indicating that the inputs are empty. Our custom functions should elegantly return an empty tensor like pytorch does for such inputs. This diff adds this behavior to all custom operators and updates the test to make sure the functionality works as expected. Reviewed By: jasonjk-park, jiawenliu64 Differential Revision: D68170745 fbshipit-source-id: 0fc329eea7f0af3fe8dd591551dbbdace4eea1e5

Summary: X-link: facebookresearch/FBGEMM#643 Pull Request resolved: pytorch#3558 A new optional optimizer state `row_counter` is added to Adam to perform bias correction per embedding row. `row_counter` serves as the iteration counter when a row (an index) occurs and used to do bias correction. Without rowwise bias correction (existing Adam), ``` m_hat_t = m_t / (1.0 - powf(beta1, iter)); v_hat_t = v_t / (1.0 - powf(beta2, iter)); ``` With rowwise bias correction enabled. ``` // when index `idx` occurs _row_counter = row_counter[idx] + 1; m_hat_t = m_t / (1.0 - powf(beta1, _row_counter)); v_hat_t = v_t / (1.0 - powf(beta2, _row_counter)); ``` This request is from IG to allow all the models to be scaled on sparse features with expected 1.5% NE on Stories. ------- **__The functionality is not set by default.__** Frontend: D64848802 To enable the bias correction, `use_rowwise_bias_correction` needs to be set to True through extra_optimizer_config. ``` extra_optimizer_config = UserEnabledConfigDefinition(use_rowwise_bias_correction=True) emb_op = SplitTableBatchedEmbeddingBagsCodegen ( embedding_specs=[ (E, D, M, compute_device) for (E, D, M) in zip(Es, Ds, managed) ], optimizer=OptimType.Adam extra_optimizer_config=extra_optimizer_config, ... ) ``` ------ **__Performance__** ``` Baseline* | default** | enabled*** forward | cpu | 2.293 s | 2.188 s | 2.043 s | cuda | 12.512 ms | 12.539 ms | 12.547 ms backward | cpu | 69.861 ms | 66.546 ms | 65.880 ms | cuda | 103.429 ms | 103.395 ms | 103.130 ms ``` \* Baseline: before changes \** default: default setting; use_bias_correction = False \*** enabled: use_bias_correction = True Reviewed By: sryap Differential Revision: D64848802 fbshipit-source-id: be0a3f29d59478a1cae9d03e1eba39852bc87b39

Summary: X-link: facebookresearch/FBGEMM#622 Pull Request resolved: pytorch#3377 X-link: facebookresearch/FBGEMM#468 - Add `index_t` support to TBE training backward kernels Reviewed By: basilwong Differential Revision: D65960050 fbshipit-source-id: d647ef90b45d9e930310cdcba159159d8e773213

…torch#3574) Summary: X-link: facebookresearch/FBGEMM#662 As https://github.com/pytorch/FBGEMM/blob/4f620223837d68303097775db0afbcff8013603d/.github/scripts/fbgemm_gpu_postbuild.bash#L20 uses `patchelf`, it should be in requirements.txt Pull Request resolved: pytorch#3574 Reviewed By: spcyppt Differential Revision: D68224847 Pulled By: q10 fbshipit-source-id: 5d91bed8aafa1db533c01df03ce33b4f043cb884

Summary: Pull Request resolved: pytorch#3575 X-link: facebookresearch/FBGEMM#660 some consumers of KVTensorWrapper build cpu-only packages. this diff made the following changes to avoid linking against cuda libraries: - put KVTensorWrapper in its own header file - add a dummy cpu target for KVTensorWrapper Reviewed By: q10, sryap Differential Revision: D68060586 fbshipit-source-id: 3fb4ade32108d557d2e1d19b629449867f0f0e7b

Summary: X-link: facebookresearch/FBGEMM#663 Since pytorch#3266 is merged, the v2 forward kernel should be tested for ROCm devices as well. cc: liligwu Pull Request resolved: pytorch#3573 Reviewed By: leitian, spcyppt Differential Revision: D68237138 Pulled By: q10 fbshipit-source-id: 3f743303c13ae79976dda273ffb75deee57fd11f

Summary: Pull Request resolved: pytorch#3577 X-link: facebookresearch/FBGEMM#659 This Diff enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to: - Support pytorch operations - Support cuda graph with handling scale as tensor - Support larger dim M - Support benchmark/unittest For decode attn linear shapes: - When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16 - When BS>4, TRT-LLM FP8 gemm does not bring perf gain This TRT-LLM kernel is based on tensorwise quantization not rowwise. Reviewed By: jwfromm Differential Revision: D68193920 fbshipit-source-id: fbf34e283e9430a8fed63ddb91781ade321012e3

Summary: Pull Request resolved: pytorch#3576 X-link: facebookresearch/FBGEMM#661 This diff includes: 1. Allow specifying split-K in the common tempalate 2. Add a few more instances 3. Update tuning for some decode shapes Reviewed By: jwfromm Differential Revision: D68233557 fbshipit-source-id: 168fb31a2bb281a2879babcdb50751330e01c798

amirakb89 · 2025-01-20T14:45:46Z

PR merged in upstream (#3850).

r-barnes and others added 22 commits December 16, 2024 16:42

Disable c10::optional macros in deeplearning

09293c7

Reviewed By: jwfromm Differential Revision: D67293350 fbshipit-source-id: 76ee573031729fd918cbdc0133c14f3fbbe3decf

remove output dtype restriction in ssd tbe (pytorch#3524)

bf354fa

Summary: Pull Request resolved: pytorch#3524 X-link: facebookresearch/FBGEMM#606 as title Reviewed By: jiayulu Differential Revision: D67563019 fbshipit-source-id: 6c349c2bdc2b97f1455e52dc102fe674a75c375f

amirakb89 assigned avbokovoy and liligwu Dec 30, 2024

avbokovoy approved these changes Dec 31, 2024

View reviewed changes

fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py Outdated Show resolved Hide resolved

jwfromm and others added 5 commits December 31, 2024 09:48

CUDA 12.6 support, pt 2 (pytorch#3533)

213d849

Summary: X-link: facebookresearch/FBGEMM#615 Pull Request resolved: pytorch#3533 Reviewed By: spcyppt Differential Revision: D67725264 Pulled By: q10 fbshipit-source-id: 3f7206f47781f0d4916a808017743639f8c1e5af

q10 and others added 3 commits January 14, 2025 10:45

amirakb89 force-pushed the profile_with_kineto branch from 41bfd75 to 8fbb074 Compare January 15, 2025 00:05

q10 and others added 8 commits January 14, 2025 23:40

amirakb89 force-pushed the profile_with_kineto branch from 8fbb074 to 9b1ec89 Compare January 16, 2025 17:56

amirakb89 force-pushed the profile_with_kineto branch from 9b1ec89 to 5bb4730 Compare January 16, 2025 19:47

amirakb89 and others added 12 commits January 16, 2025 15:33

profile with kineto for small kernels

ef4e6f9

add warm-up for kineto profiler

a479495

Add warmup-ms argument to benchmark_requests

57e2587

cherry pickedthe timed warm-up and added updated the warm-up approach

ff9babd

fixed the lint issues, print the log with kineto

2f1c531

warmup-ms and kineto profiler added for the nbit_device-with-spec

9f4e6f8

warmup and kineto profiler moved to a function

0e8fc1e

if warmup_ms or num_warmups removed

d85e913

print the log if the export_trace is true

cfd36d2

remove torch.cuda.event time calculator

35f934a

fix lint issue

d4927a5

lint ufmt fix

5308694

amirakb89 force-pushed the profile_with_kineto branch from e2808f3 to 5308694 Compare January 16, 2025 21:33

amirakb89 closed this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile with kineto for small kernels #80

profile with kineto for small kernels #80

amirakb89 commented Dec 30, 2024

avbokovoy left a comment

amirakb89 commented Jan 20, 2025

profile with kineto for small kernels #80

profile with kineto for small kernels #80

Conversation

amirakb89 commented Dec 30, 2024

avbokovoy left a comment

Choose a reason for hiding this comment

amirakb89 commented Jan 20, 2025