Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu 2023 12 06 #49

Merged
merged 74 commits into from
Dec 7, 2023
Merged

Ifu 2023 12 06 #49

merged 74 commits into from
Dec 7, 2023

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Dec 6, 2023

No description provided.

Ivan Kobzarev and others added 30 commits November 3, 2023 16:30
…, int_nbit_split_embedding_codegen_lookup_function (pytorch#2102)

Summary:
Pull Request resolved: pytorch#2102

Main goal is to make torchrec inference modules dynamo traceable with dynamic shapes.
Adding meta function for `int_nbit_split_embedding_codegen_lookup_function`, changing `int` to `SymInt` in schema.
Adding meta function for `block_bucketize_sparse_features`  (RW sharding)

Reviewed By: ezyang

Differential Revision:
D50741802

Privacy Context Container: L1138451

fbshipit-source-id: d0b9788e57c4551bc6f0c3ce364723d470e70f16
Summary:
Auto-generated with
```
fbgs "}; // namespace" -l | sort | uniq | sed 's/fbsource.//' | xargs -n 50 sed -i 's_}; // namespace_} // namespace_'
```

Reviewed By: dmm-fb

Differential Revision: D51029740

fbshipit-source-id: 177e3f6e6b0ab7e986b1147952cd5e2f59d4b1fc
Differential Revision:
D51029740

Original commit changeset: 177e3f6e6b0a

Original Phabricator Diff: D51029740

fbshipit-source-id: c71ff386342902f2cfa6552d6a834ea3f2475e32
…pu] Remove test_aot_dispatch_static* tests from opcheck tests" for test or build failures (pytorch#2115)

Summary:
Pull Request resolved: pytorch#2115

This diff is reverting D50941763
D50941763: [fbgemm_gpu] Remove test_aot_dispatch_static* tests from opcheck tests by zou3519 has been identified to be causing the following test or build failures:

Tests affected:
- [deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test - test_aot_dispatch_static__test_permute_indices (deeplearning.fbgemm.fbgemm_gpu.test.sparse_ops_test.SparseOpsTest)](https://www.internalfb.com/intern/test/844425044733284/)

Here's the Multisect link:
https://www.internalfb.com/multisect/3468682
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Reviewed By: q10

Differential Revision: D51011831

fbshipit-source-id: e61dda526a7d915eb36bf6a6be8fb8583410a9a6
Summary:
Pull Request resolved: pytorch#2116

We use `reorder_batched_ad_indices` to [rebatch id_score_list weights](https://www.internalfb.com/code/fbsource/[e3bbe1eaf65e]/fbcode/caffe2/caffe2/fb/predictor/rebatch/GPURebatchUtils.cpp?lines=305) which is quantized to BFloat 16. However, BFloat16 is currently not supported in `reorder_batched_ad_indices`, see error trace: P868895010

This diff adds this support for BFloat16 dtype.

Reviewed By: YazhiGao

Differential Revision: D50817983

fbshipit-source-id: 4949acac8d1524dc10c7931e28bdfcabd2e94477
Summary:
Pull Request resolved: pytorch#2117

- Migrate the merge pooled embedding operator definitions to the cpu source file to avoid undefined ops errors when building the CPU-only variant of OSS FBGEMM_GPU

Reviewed By: sryap

Differential Revision: D51040487

fbshipit-source-id: 20961b13e0ba10000e693267fb8324161cd62831
Summary:
X-link: pytorch/pytorch#112851

We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

bypass-github-export-checks

Reviewed By: ezyang

Differential Revision: D50972148

fbshipit-source-id: 34ab31493d9bccd0d351b463cfb508ba0b05eef4
Summary:
Pull Request resolved: pytorch#2118

We missed a line in the original diff.

After Ed's D50886564, test_aot_dispatch_static and
test_aot_dispatch_dynamic are redundant, so we're just going to go with one
of them to reduce complexity and overall test times.

Reviewed By: williamwen42

Differential Revision: D51051732

fbshipit-source-id: 8be518dd562264f5933eb2cf01d58db425af4204
Differential Revision:
D50972148

Original commit changeset: 34ab31493d9b

Original Phabricator Diff: D50972148

fbshipit-source-id: 102516e7d0defe2049d4758ef54f32fd1a4a499b
Reviewed By: zou3519

Differential Revision: D50895307

fbshipit-source-id: a5c0709fd1e8d96c38f446e1c922bb2ff54909d0
Summary:
X-link: pytorch/pytorch#113182

We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
  Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
  the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
  Library.define in Python appends the op to a global set, which is analogous
  to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
  we require that it has an `impl_abstract_pystub` specified and we also check
  that the module in the `impl_abstract_pystub` is the same as the module where
  the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
  buck-based systems) because buck sits above us.

bypass-github-export-checks

Reviewed By: ezyang

Differential Revision: D51080493

fbshipit-source-id: 703e349b8071799a6218b2e03369d1b02c150505
Reviewed By: zou3519

Differential Revision: D50893098

fbshipit-source-id: 0ef9d0fa47ad94bb0be62191f966e829fac0d17d
Reviewed By: zou3519

Differential Revision: D50895693

fbshipit-source-id: e6b774369446661486667252ee28aad2a3318cc2
Summary:
Meta fn should contain device information (FakeTensor) for merge_pooled_embeddings.

BEFORE dynamo export fails with
```
 File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/5408a275d581d7c2/scripts/ads_pt2_inference/__pt2_cli__/pt2_cli#link-tree/torch/_subclasses/fake_tensor.py", line 1264, in merge_devices
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function fbgemm.permute_pooled_embs_auto_grad(*(FakeTensor(..., device='meta', size=(10, 5124)), FakeTensor(..., device='cuda:0', size=(40,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(39,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(40,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(39,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for fbgemm.permute_pooled_embs.default, found two different devices meta, cuda:0
```

Reviewed By: ezyang

Differential Revision:
D51121648

Privacy Context Container: L1138451

fbshipit-source-id: cb9c8d320072a4e0e266f4752602d5635cbf43bc
Summary:
Pull Request resolved: pytorch#2112

Move all source files related to `embedding_inplace_update` into its own directory

Reviewed By: spcyppt

Differential Revision: D50993688

fbshipit-source-id: 50bb307c1d7164667005b525d0b318adf9e3c695
Summary:
Pull Request resolved: pytorch#2125

Original commit changeset: 50bb307c1d71

Original Phabricator Diff: D50993688

Reviewed By: drdarshan

Differential Revision: D51165476

fbshipit-source-id: 6382e32b2683bf65571442d1d16b5a799f661c68
Summary:
Pull Request resolved: pytorch#2127

Move all source files related to `embedding_inplace_update` into its own directory

Reviewed By: spcyppt

Differential Revision: D51170862

fbshipit-source-id: 726ae5df567e99e861cc47180ce11ffd87a71519
Summary:
X-link: pytorch/pytorch#113201

Pull Request resolved: pytorch#2119

Logs show these ops are being used with PT2, so we are grandfathering in these
ops to the pt2_compliant tag. Most of these ops are tested, some aren't.

bypass-github-export-checks

Reviewed By: ezyang

Differential Revision: D51076460

fbshipit-source-id: b08efb10fef0a0437a6c09cf0ac7f374f3b308ab
…ch#2129)

Summary:
Pull Request resolved: pytorch#2129

Title

Reviewed By: zou3519

Differential Revision: D51211981

fbshipit-source-id: 14d4201670b9998288bc2bcf9b476730cd66c6c1
Summary:
Pull Request resolved: pytorch#2132

Add FakeTensor support for segement_sum_csr by adding impl_abstract, following: https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit#heading=h.ptttacy8y1u9

Reviewed By: bdhirsh

Differential Revision: D51296192

fbshipit-source-id: 8918ddc45e1ba570c8148c3ac172a4d96240e010
Summary:
Pull Request resolved: pytorch#2137

D51296192 breaks OSS CI e.g. https://github.com/pytorch/FBGEMM/actions/runs/6873908944/job/18695145904?pr=2107

We have to support PyTorch 2.1, which doesn't have this attribute defined.  Fortunately, there is a workaround available, where we can replace torch.library.impl_abstract with the locally-defined impl_abstract

Reviewed By: q10

Differential Revision: D51362600

fbshipit-source-id: fa1424e495f19cf494e29595439a062d63139dee
…orch#2107)

Summary:
Pull Request resolved: pytorch#2107

This diff add support for variable bucket size for block bucketize_sparse features for RW sharding.
E.g. Given bucket_sizes_pos as [[0,5,15], [0,10,13]]
For batch 0, indices in [0,5) will be assigned to bucket 0, indices in [5,15) will be assigned to bucket 1.
For batch 1, indices in [0,10) will be assigned to bucket 0, indices in [10,13) will be assigned to bucket 1.
The new index will be original index - bucket_sizes_pos[new_bucket_id-1]
i.e. for batch = 0, index = 12, it will be assigned to bucket 1 and the new index is 12 - 5 = 7.

Reviewed By: jiayisuse

Differential Revision: D50868649

fbshipit-source-id: b1901bce5c539acc4d6a8e11eb74d1ad4ec86195
Summary:
Pull Request resolved: pytorch#2136

Add test for D50868649.
Have to land the backend changes first.  D50868649 put CUDA/C++ changes in one diff.  Then put the python changes in this diff.
Otherwise we will have compatibility check failure: https://www.internalfb.com/servicelab/experiment/2501798166/trial/2502750974

Reviewed By: jingsh

Differential Revision: D51314019

fbshipit-source-id: 4bd9ebd09b0f2a4201f6fc87cfc0c6fdcb1e2392
Summary:
Pull Request resolved: pytorch#2133

- Migrate layout_transform_ops into its own directory

Reviewed By: spcyppt

Differential Revision: D51290598

fbshipit-source-id: 99ff3f170b29b999a761f327f909053b3bf9f2a6
Summary:
Pull Request resolved: pytorch#2139

Add generated opcheck tests to `TestFP8RowwiseQuantizationConversion` in quantize_ops_test.py in order to fix `fbgemm::FloatToFP8RowwiseQuantized` and `fbgemm::FP8RowwiseQuantizedToFloat`

Reviewed By: zou3519

Differential Revision: D51375136

fbshipit-source-id: 2683253b4b6f5e18e2792a27e19cad3fb1719833
…orch#2077)

Summary:
Pull Request resolved: pytorch#2077

Add an auto-vectorization implementation for int4 CPU-TBE kernel to accelerate ARM platform. Also adds relevant unit tests and benchmarks.

This is based on my auto-vec tries D50142928 and bangshengtang's further optimizations including D50212552, D50213049, D50243725 and D50303751

Reviewed By: bangshengtang

Differential Revision: D50289383

fbshipit-source-id: 11320ef18883dabd0739852ac8b083fb1bc2883b
Summary:
Pull Request resolved: pytorch#2131

This diff addresses illegal memory access issues with FP8 quantization kernel  on large batch training on APS, S373293. Specifically, it fails in `_FP8rowwise_to_float_gpu_t`, when dequantizing a large size tensor. The issue is due to index overflow when tensor size is large.

We fix this by using `PackedTensorAccessor64` and `int64_t`. This also fixes the test result mismatch (e.g., P869606114).

Reviewed By: sryap

Differential Revision: D51187587

fbshipit-source-id: 612c9ef5d8096547e8e2362824a7ce72d3d44ca6
Summary:
X-link: pytorch/torchrec#1525

Pull Request resolved: pytorch#2142

As titled

Reviewed By: dstaay-fb

Differential Revision: D51053207

fbshipit-source-id: 8aee7d967ceec9ea0739f5d2c56d6e541a0d3648
Summary:
Pull Request resolved: pytorch#2141

- Migrate ssd_split_embeddings_cache code into its own directory

Reviewed By: sryap

Differential Revision: D51411817

fbshipit-source-id: 90935128331430b82f20aea673243c173ebda6fb
Summary:
Pull Request resolved: pytorch#2140

Benchmark block_bucketize_sparse_features

Reviewed By: jiayisuse

Differential Revision: D51288847

fbshipit-source-id: dbb8dc705f32bc90fbdb316bdba8923c89d4f606
swolchok and others added 27 commits November 28, 2023 12:10
pytorch#2138)

Summary:
X-link: pytorch/executorch#1226

Pull Request resolved: pytorch#2138

Landing non-PyTorch portions first; then the PyTorch portions of  pytorch/pytorch#101995 will land to Github.

Reviewed By: malfet

Differential Revision: D51355841

fbshipit-source-id: 4eed885733189f21e342613431b637de72979cb4
Summary:
Pull Request resolved: pytorch#2165

- Migrate embedding_bounds_check out of codegen and into split_embeddings_utils

Reviewed By: sryap

Differential Revision: D51607319

fbshipit-source-id: 55051c08b8041d8d3bba65caa0b991e6be779e9f
…orch#2163)

Summary:
Pull Request resolved: pytorch#2163

We observe an extremely long start up time before the first
`cudaLaunchKernel` after switching to CUDA 12.  This causes the
performance of TBE to degrade if `warmup_runs=0`.  This diff modifies
`benchmark_requests` which is used extensively in TBE benchmarks to
run at least one warm up iteration even when `warmup_runs=0` to
exclude the first kernel time from the profiling result.  Moreover, we
add `--warmup-runs` to every TBE benchmarks to allow the users to
increase the warmup iterations.

Reviewed By: q10

Differential Revision: D51603915

fbshipit-source-id: 0f8ea9c089d51735a6c58417b1d8872732327f34
…TBE tests

Summary:
X-link: pytorch/pytorch#114641

1. We stop using excess memory in generate_opcheck_tests. This is safe because
   all the individual test utils already ensure that they do not modify the
   inputs.
2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open
   source). They were previously removed because they OOM'ed when run serially;
   (1) and (3) cut down the memory usage to ~20gb peak.
3. I needed to skip some newly failing generated tests and also some that had an
   impact on the memory usage.

bypass-github-export-checks

Reviewed By: sryap, williamwen42

Differential Revision: D51601964

fbshipit-source-id: e3fa9be771ea87b5fcd59b9ea3935506642ad75a
Differential Revision:
D51607319

Original commit changeset: 55051c08b804

Original Phabricator Diff: D51607319

fbshipit-source-id: 1453b9ae4278c64b9055c74b4fd301cedd2c475b
…#2168)

Summary:
Pull Request resolved: pytorch#2168

We revert a diff due to incompatibility between frontend and backend packages in code freeze. Here we only reland the backend part, which will not pick up by production package. The frontend part would be land after the code freeze.

Reviewed By: IvanKobzarev

Differential Revision: D51496816

fbshipit-source-id: beac54e7d629e8919d1b280161dba491ed8a3431
Summary:
Pull Request resolved: pytorch#2174

It needed an abstract impl.

Reviewed By: williamwen42

Differential Revision: D51647394

fbshipit-source-id: 127168a49026f1f1261ab4fe883c05d80450b5e3
Summary:
Pull Request resolved: pytorch#2173

See title.

Reviewed By: williamwen42

Differential Revision: D51647393

fbshipit-source-id: 2c18a70faea73545f1aa9eb4e190f114ec2fde91
Summary:
Pull Request resolved: pytorch#2172

- Added abstract impl in sparse_ops.py. I don't think it's worth splitting up
  the abstract impls into multiple .py files right now, unless someone comes to
  us with memory issues.

Reviewed By: williamwen42

Differential Revision: D51647392

fbshipit-source-id: 33743bb8f4b7755b5b34a83324822628b4b30e56
Summary:
Pull Request resolved: pytorch#2171

Also deleted two skips that were marked as flaky. Those don't appear to
actually be flaky.

Reviewed By: williamwen42

Differential Revision: D51647391

fbshipit-source-id: 4c70558e2e6d4cb7d2c8566e40f5da06050681d4
Summary: Pull Request resolved: pytorch#2159

Reviewed By: kinto0

Differential Revision: D51595408

fbshipit-source-id: 49bf60a11b37d6daf522f05bcbda5dca01c6a873
Summary:
Pull Request resolved: pytorch#2176

The PT2 compliancy tests call the op additional times and compare outputs. To make `FloatToFP8RowwiseQuantized` PT2 compliant, this diff initializes empty values for fp8 quantize to make output results deterministic.

Reviewed By: q10

Differential Revision: D51419189

fbshipit-source-id: bf36b51510810ff61912a4fd2d306065958b8cd8
…pytorch#2175)

Summary: Pull Request resolved: pytorch#2175

Reviewed By: azad-meta

Differential Revision: D51673332

fbshipit-source-id: ffad51a6bb2c8746b6595bd6828c5bb8d2bf9e87
Summary:
- Update AVX2 and AVX512 flags to account for nvcc as the front-end compiler

Pull Request resolved: pytorch#2167

Reviewed By: spcyppt

Differential Revision: D51681615

Pulled By: q10

fbshipit-source-id: 231aa051f121ff7a5f6aac56f335442bbd312a49
… more ops as pt2_compliant_tag" for otest failure (pytorch#2179)

Summary:
Pull Request resolved: pytorch#2179

This diff is reverting D51647391
D51647391: Mark some more ops as pt2_compliant_tag by zou3519 has been identified to be causing the following test failure:

Tests affected:
- [deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test - test_schema__test_permute_indices (deeplearning.fbgemm.fbgemm_gpu.test.sparse_ops_test.SparseOpsTest)](https://www.internalfb.com/intern/test/562950068029445/)

Here's the Multisect link:
https://www.internalfb.com/multisect/3631441
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Reviewed By: zou3519

Differential Revision: D51687883

fbshipit-source-id: 22d0bd6edd4206e1a6acb9b5f173846783b4479d
Summary:
Pull Request resolved: pytorch#2170

- The problem with the original op was that in the Autograd implementation, it
  needed to call Tensor.item(). This doesn't work with FakeTensors (maybe it
  can some day in the future).
- We create two new ops, `jagged_index_select_2d_forward_v2` and
  `jagged_index_add_2d_forward_v2` (which is effectively the backward) that do
  the Tensor.item() calls, and change fbgemm::jagged_index_select's Autograd
  implementation to call those.
- We add abstract impls for those two new ops.
- Finally, we move the fbgemm::jagged_index_select implementation to
  CompositeImplicitAutograd (and delete the CPU/CUDA impls, because those are
  redundant).

Reviewed By: williamwen42, aakhundov

Differential Revision: D51670069

fbshipit-source-id: b7ae86dcb02a993ec3bad94a839b707faf4f9098
…iant (pytorch#2180)

Summary:
Pull Request resolved: pytorch#2180

- We fix fbgemm::permute_1d_sparse_data's implementation to never return
  views. The schema for the operator promises that it is functional. This was
  causing flaky tests.
- fbgemm::permute_2d_sparse_data already has that fix applied, so we just
  remove the skip

Reviewed By: williamwen42

Differential Revision: D51711257

fbshipit-source-id: 388aa89d7b7be3036ec3a620eb1c239e2e02d0dc
Summary:
Pull Request resolved: pytorch#2181

These passed all generated opcheck tests.

Reviewed By: williamwen42

Differential Revision: D51711876

fbshipit-source-id: 26be06c5183c8de2036e455cf21df6a149bd6da8
…pytorch#2184)

Summary: Pull Request resolved: pytorch#2184

Reviewed By: azad-meta

Differential Revision: D51710423

fbshipit-source-id: a282a011278cff953d19bf0f969d982240e554c4
Summary: Title

Reviewed By: zou3519

Differential Revision: D51692611

fbshipit-source-id: 5b9353f8730352ce3a8d7f89124668a8450c8456
Summary:
Pull Request resolved: pytorch#2178

- Migrate embedding_bounds_check out of codegen and into split_embeddings_utils

Reviewed By: sryap

Differential Revision: D51688407

fbshipit-source-id: e021212f7c7123e770998cd6115f4a99461b1562
Differential Revision:
D51688407

Original commit changeset: e021212f7c71

Original Phabricator Diff: D51688407

fbshipit-source-id: fa2c01bbd0f48609561dc6a353f9088bb91f0177
Summary:
Pull Request resolved: pytorch#2186

Since `linearize_cache_indices` did not support the case where
`indices` and `offsets` have different types, we casted both `indices`
and `offsets` to the same type, that is `int64_t`.  In some cases,
this caused the memory requirment to surge causing the peak memory
requirement to increase.  This diff modifies the
`linearize_cache_indices` op to support when `indices` and `offsets`
have different types.

Reviewed By: ehsanardestani

Differential Revision: D51723551

fbshipit-source-id: f40c9cc6e8e4435a8cc01702edd85cf843835c07
…rse_features (pytorch#2150)

Summary:
Pull Request resolved: pytorch#2150

CPU version implementation D50868649

Reviewed By: sryap

Differential Revision: D51414310

fbshipit-source-id: 6b447911c902feff5039bd6fc2a69d029621b1cf
…ch.int64 for block_bucketize_pos (pytorch#2188)

Summary:
Pull Request resolved: pytorch#2188

When block_bucketize_pos using torch.int64, it would fail TorchRec uneven sharing test. This is fix resolving this issue

Reviewed By: sryap

Differential Revision: D51713675

fbshipit-source-id: 240e836587d89ed118a9d24e9bf4b267031333b8
…torch#2169)

Summary:
Pull Request resolved: pytorch#2169

CPU implementation D51288847

Reviewed By: tissue3

Differential Revision: D51533599

fbshipit-source-id: cf9c7fcbe7043f385e97901a95916287c9f618a5
@liligwu liligwu self-assigned this Dec 6, 2023
@liligwu
Copy link
Collaborator Author

liligwu commented Dec 7, 2023

Tests passed locally
test.log

@liligwu liligwu merged commit 61a7e50 into main Dec 7, 2023
38 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.