forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ifu 2023 12 06 #49
Merged
Merged
Ifu 2023 12 06 #49
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…, int_nbit_split_embedding_codegen_lookup_function (pytorch#2102) Summary: Pull Request resolved: pytorch#2102 Main goal is to make torchrec inference modules dynamo traceable with dynamic shapes. Adding meta function for `int_nbit_split_embedding_codegen_lookup_function`, changing `int` to `SymInt` in schema. Adding meta function for `block_bucketize_sparse_features` (RW sharding) Reviewed By: ezyang Differential Revision: D50741802 Privacy Context Container: L1138451 fbshipit-source-id: d0b9788e57c4551bc6f0c3ce364723d470e70f16
Summary: Auto-generated with ``` fbgs "}; // namespace" -l | sort | uniq | sed 's/fbsource.//' | xargs -n 50 sed -i 's_}; // namespace_} // namespace_' ``` Reviewed By: dmm-fb Differential Revision: D51029740 fbshipit-source-id: 177e3f6e6b0ab7e986b1147952cd5e2f59d4b1fc
Differential Revision: D51029740 Original commit changeset: 177e3f6e6b0a Original Phabricator Diff: D51029740 fbshipit-source-id: c71ff386342902f2cfa6552d6a834ea3f2475e32
…pu] Remove test_aot_dispatch_static* tests from opcheck tests" for test or build failures (pytorch#2115) Summary: Pull Request resolved: pytorch#2115 This diff is reverting D50941763 D50941763: [fbgemm_gpu] Remove test_aot_dispatch_static* tests from opcheck tests by zou3519 has been identified to be causing the following test or build failures: Tests affected: - [deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test - test_aot_dispatch_static__test_permute_indices (deeplearning.fbgemm.fbgemm_gpu.test.sparse_ops_test.SparseOpsTest)](https://www.internalfb.com/intern/test/844425044733284/) Here's the Multisect link: https://www.internalfb.com/multisect/3468682 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Reviewed By: q10 Differential Revision: D51011831 fbshipit-source-id: e61dda526a7d915eb36bf6a6be8fb8583410a9a6
Summary: Pull Request resolved: pytorch#2116 We use `reorder_batched_ad_indices` to [rebatch id_score_list weights](https://www.internalfb.com/code/fbsource/[e3bbe1eaf65e]/fbcode/caffe2/caffe2/fb/predictor/rebatch/GPURebatchUtils.cpp?lines=305) which is quantized to BFloat 16. However, BFloat16 is currently not supported in `reorder_batched_ad_indices`, see error trace: P868895010 This diff adds this support for BFloat16 dtype. Reviewed By: YazhiGao Differential Revision: D50817983 fbshipit-source-id: 4949acac8d1524dc10c7931e28bdfcabd2e94477
Summary: Pull Request resolved: pytorch#2117 - Migrate the merge pooled embedding operator definitions to the cpu source file to avoid undefined ops errors when building the CPU-only variant of OSS FBGEMM_GPU Reviewed By: sryap Differential Revision: D51040487 fbshipit-source-id: 20961b13e0ba10000e693267fb8324161cd62831
Summary: X-link: pytorch/pytorch#112851 We've made the following changes: - The new way to use the API is `m.impl_abstract_pystub(module, context)`. Every subsequent m.def of an op inside the TORCH_LIBRARY block gives the op the `impl_abstract_pystub`. - Added a mechanism to determine if an operator was defined in Python or C++. Library.define in Python appends the op to a global set, which is analogous to what we do for tracking Library.impl. - If someone does `torch.library.impl_abstract` in Python for an operator, then we require that it has an `impl_abstract_pystub` specified and we also check that the module in the `impl_abstract_pystub` is the same as the module where the call to `torch.library.impl_abstract` exists. - Unfortunately we can't check the "context" (which is the buck target on buck-based systems) because buck sits above us. bypass-github-export-checks Reviewed By: ezyang Differential Revision: D50972148 fbshipit-source-id: 34ab31493d9bccd0d351b463cfb508ba0b05eef4
Summary: Pull Request resolved: pytorch#2118 We missed a line in the original diff. After Ed's D50886564, test_aot_dispatch_static and test_aot_dispatch_dynamic are redundant, so we're just going to go with one of them to reduce complexity and overall test times. Reviewed By: williamwen42 Differential Revision: D51051732 fbshipit-source-id: 8be518dd562264f5933eb2cf01d58db425af4204
Differential Revision: D50972148 Original commit changeset: 34ab31493d9b Original Phabricator Diff: D50972148 fbshipit-source-id: 102516e7d0defe2049d4758ef54f32fd1a4a499b
Reviewed By: zou3519 Differential Revision: D50895307 fbshipit-source-id: a5c0709fd1e8d96c38f446e1c922bb2ff54909d0
Summary: X-link: pytorch/pytorch#113182 We've made the following changes: - The new way to use the API is `m.impl_abstract_pystub(module, context)`. Every subsequent m.def of an op inside the TORCH_LIBRARY block gives the op the `impl_abstract_pystub`. - Added a mechanism to determine if an operator was defined in Python or C++. Library.define in Python appends the op to a global set, which is analogous to what we do for tracking Library.impl. - If someone does `torch.library.impl_abstract` in Python for an operator, then we require that it has an `impl_abstract_pystub` specified and we also check that the module in the `impl_abstract_pystub` is the same as the module where the call to `torch.library.impl_abstract` exists. - Unfortunately we can't check the "context" (which is the buck target on buck-based systems) because buck sits above us. bypass-github-export-checks Reviewed By: ezyang Differential Revision: D51080493 fbshipit-source-id: 703e349b8071799a6218b2e03369d1b02c150505
Reviewed By: zou3519 Differential Revision: D50893098 fbshipit-source-id: 0ef9d0fa47ad94bb0be62191f966e829fac0d17d
Reviewed By: zou3519 Differential Revision: D50895693 fbshipit-source-id: e6b774369446661486667252ee28aad2a3318cc2
Summary: Meta fn should contain device information (FakeTensor) for merge_pooled_embeddings. BEFORE dynamo export fails with ``` File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/5408a275d581d7c2/scripts/ads_pt2_inference/__pt2_cli__/pt2_cli#link-tree/torch/_subclasses/fake_tensor.py", line 1264, in merge_devices raise RuntimeError( torch._dynamo.exc.TorchRuntimeError: Failed running call_function fbgemm.permute_pooled_embs_auto_grad(*(FakeTensor(..., device='meta', size=(10, 5124)), FakeTensor(..., device='cuda:0', size=(40,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(39,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(40,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(39,), dtype=torch.int64)), **{}): Unhandled FakeTensor Device Propagation for fbgemm.permute_pooled_embs.default, found two different devices meta, cuda:0 ``` Reviewed By: ezyang Differential Revision: D51121648 Privacy Context Container: L1138451 fbshipit-source-id: cb9c8d320072a4e0e266f4752602d5635cbf43bc
Summary: Pull Request resolved: pytorch#2112 Move all source files related to `embedding_inplace_update` into its own directory Reviewed By: spcyppt Differential Revision: D50993688 fbshipit-source-id: 50bb307c1d7164667005b525d0b318adf9e3c695
Summary: Pull Request resolved: pytorch#2125 Original commit changeset: 50bb307c1d71 Original Phabricator Diff: D50993688 Reviewed By: drdarshan Differential Revision: D51165476 fbshipit-source-id: 6382e32b2683bf65571442d1d16b5a799f661c68
Summary: Pull Request resolved: pytorch#2127 Move all source files related to `embedding_inplace_update` into its own directory Reviewed By: spcyppt Differential Revision: D51170862 fbshipit-source-id: 726ae5df567e99e861cc47180ce11ffd87a71519
Summary: X-link: pytorch/pytorch#113201 Pull Request resolved: pytorch#2119 Logs show these ops are being used with PT2, so we are grandfathering in these ops to the pt2_compliant tag. Most of these ops are tested, some aren't. bypass-github-export-checks Reviewed By: ezyang Differential Revision: D51076460 fbshipit-source-id: b08efb10fef0a0437a6c09cf0ac7f374f3b308ab
…ch#2129) Summary: Pull Request resolved: pytorch#2129 Title Reviewed By: zou3519 Differential Revision: D51211981 fbshipit-source-id: 14d4201670b9998288bc2bcf9b476730cd66c6c1
Summary: Pull Request resolved: pytorch#2132 Add FakeTensor support for segement_sum_csr by adding impl_abstract, following: https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit#heading=h.ptttacy8y1u9 Reviewed By: bdhirsh Differential Revision: D51296192 fbshipit-source-id: 8918ddc45e1ba570c8148c3ac172a4d96240e010
Summary: Pull Request resolved: pytorch#2137 D51296192 breaks OSS CI e.g. https://github.com/pytorch/FBGEMM/actions/runs/6873908944/job/18695145904?pr=2107 We have to support PyTorch 2.1, which doesn't have this attribute defined. Fortunately, there is a workaround available, where we can replace torch.library.impl_abstract with the locally-defined impl_abstract Reviewed By: q10 Differential Revision: D51362600 fbshipit-source-id: fa1424e495f19cf494e29595439a062d63139dee
…orch#2107) Summary: Pull Request resolved: pytorch#2107 This diff add support for variable bucket size for block bucketize_sparse features for RW sharding. E.g. Given bucket_sizes_pos as [[0,5,15], [0,10,13]] For batch 0, indices in [0,5) will be assigned to bucket 0, indices in [5,15) will be assigned to bucket 1. For batch 1, indices in [0,10) will be assigned to bucket 0, indices in [10,13) will be assigned to bucket 1. The new index will be original index - bucket_sizes_pos[new_bucket_id-1] i.e. for batch = 0, index = 12, it will be assigned to bucket 1 and the new index is 12 - 5 = 7. Reviewed By: jiayisuse Differential Revision: D50868649 fbshipit-source-id: b1901bce5c539acc4d6a8e11eb74d1ad4ec86195
Summary: Pull Request resolved: pytorch#2136 Add test for D50868649. Have to land the backend changes first. D50868649 put CUDA/C++ changes in one diff. Then put the python changes in this diff. Otherwise we will have compatibility check failure: https://www.internalfb.com/servicelab/experiment/2501798166/trial/2502750974 Reviewed By: jingsh Differential Revision: D51314019 fbshipit-source-id: 4bd9ebd09b0f2a4201f6fc87cfc0c6fdcb1e2392
Summary: Pull Request resolved: pytorch#2133 - Migrate layout_transform_ops into its own directory Reviewed By: spcyppt Differential Revision: D51290598 fbshipit-source-id: 99ff3f170b29b999a761f327f909053b3bf9f2a6
Summary: Pull Request resolved: pytorch#2139 Add generated opcheck tests to `TestFP8RowwiseQuantizationConversion` in quantize_ops_test.py in order to fix `fbgemm::FloatToFP8RowwiseQuantized` and `fbgemm::FP8RowwiseQuantizedToFloat` Reviewed By: zou3519 Differential Revision: D51375136 fbshipit-source-id: 2683253b4b6f5e18e2792a27e19cad3fb1719833
…orch#2077) Summary: Pull Request resolved: pytorch#2077 Add an auto-vectorization implementation for int4 CPU-TBE kernel to accelerate ARM platform. Also adds relevant unit tests and benchmarks. This is based on my auto-vec tries D50142928 and bangshengtang's further optimizations including D50212552, D50213049, D50243725 and D50303751 Reviewed By: bangshengtang Differential Revision: D50289383 fbshipit-source-id: 11320ef18883dabd0739852ac8b083fb1bc2883b
Summary: Pull Request resolved: pytorch#2131 This diff addresses illegal memory access issues with FP8 quantization kernel on large batch training on APS, S373293. Specifically, it fails in `_FP8rowwise_to_float_gpu_t`, when dequantizing a large size tensor. The issue is due to index overflow when tensor size is large. We fix this by using `PackedTensorAccessor64` and `int64_t`. This also fixes the test result mismatch (e.g., P869606114). Reviewed By: sryap Differential Revision: D51187587 fbshipit-source-id: 612c9ef5d8096547e8e2362824a7ce72d3d44ca6
Summary: X-link: pytorch/torchrec#1525 Pull Request resolved: pytorch#2142 As titled Reviewed By: dstaay-fb Differential Revision: D51053207 fbshipit-source-id: 8aee7d967ceec9ea0739f5d2c56d6e541a0d3648
Summary: Pull Request resolved: pytorch#2141 - Migrate ssd_split_embeddings_cache code into its own directory Reviewed By: sryap Differential Revision: D51411817 fbshipit-source-id: 90935128331430b82f20aea673243c173ebda6fb
Summary: Pull Request resolved: pytorch#2140 Benchmark block_bucketize_sparse_features Reviewed By: jiayisuse Differential Revision: D51288847 fbshipit-source-id: dbb8dc705f32bc90fbdb316bdba8923c89d4f606
pytorch#2138) Summary: X-link: pytorch/executorch#1226 Pull Request resolved: pytorch#2138 Landing non-PyTorch portions first; then the PyTorch portions of pytorch/pytorch#101995 will land to Github. Reviewed By: malfet Differential Revision: D51355841 fbshipit-source-id: 4eed885733189f21e342613431b637de72979cb4
Summary: Pull Request resolved: pytorch#2165 - Migrate embedding_bounds_check out of codegen and into split_embeddings_utils Reviewed By: sryap Differential Revision: D51607319 fbshipit-source-id: 55051c08b8041d8d3bba65caa0b991e6be779e9f
…orch#2163) Summary: Pull Request resolved: pytorch#2163 We observe an extremely long start up time before the first `cudaLaunchKernel` after switching to CUDA 12. This causes the performance of TBE to degrade if `warmup_runs=0`. This diff modifies `benchmark_requests` which is used extensively in TBE benchmarks to run at least one warm up iteration even when `warmup_runs=0` to exclude the first kernel time from the profiling result. Moreover, we add `--warmup-runs` to every TBE benchmarks to allow the users to increase the warmup iterations. Reviewed By: q10 Differential Revision: D51603915 fbshipit-source-id: 0f8ea9c089d51735a6c58417b1d8872732327f34
…TBE tests Summary: X-link: pytorch/pytorch#114641 1. We stop using excess memory in generate_opcheck_tests. This is safe because all the individual test utils already ensure that they do not modify the inputs. 2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open source). They were previously removed because they OOM'ed when run serially; (1) and (3) cut down the memory usage to ~20gb peak. 3. I needed to skip some newly failing generated tests and also some that had an impact on the memory usage. bypass-github-export-checks Reviewed By: sryap, williamwen42 Differential Revision: D51601964 fbshipit-source-id: e3fa9be771ea87b5fcd59b9ea3935506642ad75a
Differential Revision: D51607319 Original commit changeset: 55051c08b804 Original Phabricator Diff: D51607319 fbshipit-source-id: 1453b9ae4278c64b9055c74b4fd301cedd2c475b
…#2168) Summary: Pull Request resolved: pytorch#2168 We revert a diff due to incompatibility between frontend and backend packages in code freeze. Here we only reland the backend part, which will not pick up by production package. The frontend part would be land after the code freeze. Reviewed By: IvanKobzarev Differential Revision: D51496816 fbshipit-source-id: beac54e7d629e8919d1b280161dba491ed8a3431
Summary: Pull Request resolved: pytorch#2174 It needed an abstract impl. Reviewed By: williamwen42 Differential Revision: D51647394 fbshipit-source-id: 127168a49026f1f1261ab4fe883c05d80450b5e3
Summary: Pull Request resolved: pytorch#2173 See title. Reviewed By: williamwen42 Differential Revision: D51647393 fbshipit-source-id: 2c18a70faea73545f1aa9eb4e190f114ec2fde91
Summary: Pull Request resolved: pytorch#2172 - Added abstract impl in sparse_ops.py. I don't think it's worth splitting up the abstract impls into multiple .py files right now, unless someone comes to us with memory issues. Reviewed By: williamwen42 Differential Revision: D51647392 fbshipit-source-id: 33743bb8f4b7755b5b34a83324822628b4b30e56
Summary: Pull Request resolved: pytorch#2171 Also deleted two skips that were marked as flaky. Those don't appear to actually be flaky. Reviewed By: williamwen42 Differential Revision: D51647391 fbshipit-source-id: 4c70558e2e6d4cb7d2c8566e40f5da06050681d4
Summary: Pull Request resolved: pytorch#2159 Reviewed By: kinto0 Differential Revision: D51595408 fbshipit-source-id: 49bf60a11b37d6daf522f05bcbda5dca01c6a873
Summary: Pull Request resolved: pytorch#2176 The PT2 compliancy tests call the op additional times and compare outputs. To make `FloatToFP8RowwiseQuantized` PT2 compliant, this diff initializes empty values for fp8 quantize to make output results deterministic. Reviewed By: q10 Differential Revision: D51419189 fbshipit-source-id: bf36b51510810ff61912a4fd2d306065958b8cd8
…pytorch#2175) Summary: Pull Request resolved: pytorch#2175 Reviewed By: azad-meta Differential Revision: D51673332 fbshipit-source-id: ffad51a6bb2c8746b6595bd6828c5bb8d2bf9e87
Summary: - Update AVX2 and AVX512 flags to account for nvcc as the front-end compiler Pull Request resolved: pytorch#2167 Reviewed By: spcyppt Differential Revision: D51681615 Pulled By: q10 fbshipit-source-id: 231aa051f121ff7a5f6aac56f335442bbd312a49
… more ops as pt2_compliant_tag" for otest failure (pytorch#2179) Summary: Pull Request resolved: pytorch#2179 This diff is reverting D51647391 D51647391: Mark some more ops as pt2_compliant_tag by zou3519 has been identified to be causing the following test failure: Tests affected: - [deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test - test_schema__test_permute_indices (deeplearning.fbgemm.fbgemm_gpu.test.sparse_ops_test.SparseOpsTest)](https://www.internalfb.com/intern/test/562950068029445/) Here's the Multisect link: https://www.internalfb.com/multisect/3631441 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Reviewed By: zou3519 Differential Revision: D51687883 fbshipit-source-id: 22d0bd6edd4206e1a6acb9b5f173846783b4479d
Summary: Pull Request resolved: pytorch#2170 - The problem with the original op was that in the Autograd implementation, it needed to call Tensor.item(). This doesn't work with FakeTensors (maybe it can some day in the future). - We create two new ops, `jagged_index_select_2d_forward_v2` and `jagged_index_add_2d_forward_v2` (which is effectively the backward) that do the Tensor.item() calls, and change fbgemm::jagged_index_select's Autograd implementation to call those. - We add abstract impls for those two new ops. - Finally, we move the fbgemm::jagged_index_select implementation to CompositeImplicitAutograd (and delete the CPU/CUDA impls, because those are redundant). Reviewed By: williamwen42, aakhundov Differential Revision: D51670069 fbshipit-source-id: b7ae86dcb02a993ec3bad94a839b707faf4f9098
…iant (pytorch#2180) Summary: Pull Request resolved: pytorch#2180 - We fix fbgemm::permute_1d_sparse_data's implementation to never return views. The schema for the operator promises that it is functional. This was causing flaky tests. - fbgemm::permute_2d_sparse_data already has that fix applied, so we just remove the skip Reviewed By: williamwen42 Differential Revision: D51711257 fbshipit-source-id: 388aa89d7b7be3036ec3a620eb1c239e2e02d0dc
Summary: Pull Request resolved: pytorch#2181 These passed all generated opcheck tests. Reviewed By: williamwen42 Differential Revision: D51711876 fbshipit-source-id: 26be06c5183c8de2036e455cf21df6a149bd6da8
…pytorch#2184) Summary: Pull Request resolved: pytorch#2184 Reviewed By: azad-meta Differential Revision: D51710423 fbshipit-source-id: a282a011278cff953d19bf0f969d982240e554c4
Summary: Title Reviewed By: zou3519 Differential Revision: D51692611 fbshipit-source-id: 5b9353f8730352ce3a8d7f89124668a8450c8456
Summary: Pull Request resolved: pytorch#2178 - Migrate embedding_bounds_check out of codegen and into split_embeddings_utils Reviewed By: sryap Differential Revision: D51688407 fbshipit-source-id: e021212f7c7123e770998cd6115f4a99461b1562
Differential Revision: D51688407 Original commit changeset: e021212f7c71 Original Phabricator Diff: D51688407 fbshipit-source-id: fa2c01bbd0f48609561dc6a353f9088bb91f0177
Summary: Pull Request resolved: pytorch#2186 Since `linearize_cache_indices` did not support the case where `indices` and `offsets` have different types, we casted both `indices` and `offsets` to the same type, that is `int64_t`. In some cases, this caused the memory requirment to surge causing the peak memory requirement to increase. This diff modifies the `linearize_cache_indices` op to support when `indices` and `offsets` have different types. Reviewed By: ehsanardestani Differential Revision: D51723551 fbshipit-source-id: f40c9cc6e8e4435a8cc01702edd85cf843835c07
…rse_features (pytorch#2150) Summary: Pull Request resolved: pytorch#2150 CPU version implementation D50868649 Reviewed By: sryap Differential Revision: D51414310 fbshipit-source-id: 6b447911c902feff5039bd6fc2a69d029621b1cf
…ch.int64 for block_bucketize_pos (pytorch#2188) Summary: Pull Request resolved: pytorch#2188 When block_bucketize_pos using torch.int64, it would fail TorchRec uneven sharing test. This is fix resolving this issue Reviewed By: sryap Differential Revision: D51713675 fbshipit-source-id: 240e836587d89ed118a9d24e9bf4b267031333b8
…torch#2169) Summary: Pull Request resolved: pytorch#2169 CPU implementation D51288847 Reviewed By: tissue3 Differential Revision: D51533599 fbshipit-source-id: cf9c7fcbe7043f385e97901a95916287c9f618a5
Tests passed locally |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.