Pr50749 #365

tstellar · 2021-06-23T21:52:26Z

This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001 (cherry picked from commit 6bae597)

Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path. This is resubmit of D95024, with build break and overtighten assertion fixed. Test Plan: (cherry picked from commit 1645f46)

Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count. This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes. A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead. Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors. Reviewed By: wmi Differential Revision: https://reviews.llvm.org/D93264 (cherry picked from commit 3d89b3c)

OpenMP device compiler (similar to other SPMD compilers) assumes that functions are convergent by default to avoid invalid transformations, such as the bug (https://bugs.llvm.org/show_bug.cgi?id=49021). Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95971 (cherry picked from commit 0f0ce3c)

…quires EVEX. This is consistent with the VEX version. It also fixes a sorting issue in the matching table that caused the EVEX version to be prioritized over VEX in intel syntax. Fixes issue [2] from PR48991. (cherry picked from commit c691fe1)

… link errors" This reverts commit f5602e0.

…ard:cf Add a release note workaround for PR47463. Bug: https://bugs.llvm.org/show_bug.cgi?id=47463 Differential Revision: https://reviews.llvm.org/D95435

Fix when time profiling is enabled. Related to: D94855 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D95398 (cherry picked from commit bb40e67)

…rors Link error occurred when time profiling in libomp is enabled by default because `libomp` is assumed to be a C library but the dependence on `libLLVMSupport` for profiling is a C++ library. Currently the issue blocks all OpenMP tests in Phabricator. This patch set a new CMake option `OPENMP_ENABLE_LIBOMP_PROFILING` to enable/disable the feature. By default it is disabled. Note that once time profiling is enabled for `libomp`, it becomes a C++ library. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D95585 (cherry picked from commit c571b16)

As discussed in https://lists.llvm.org/pipermail/cfe-dev/2021-January/067524.html The flag has been removed on the main branch in D95876. Differential revision: https://reviews.llvm.org/D96016

…ines

This only affects the MemorySSA-based implementation.

MemorySSA currently treats lifetime.end intrinsics as not aliasing anything. This breaks MemorySSA-based MemCpyOpt, because we'll happily move a read of a pointer below a lifetime.end intrinsic, as no clobber is reported. I think the MemorySSA modelling here isn't correct: lifetime.end(p) has approximately the same effect as doing a memcpy(p, undef), and should be treated as a clobber. This patch removes the special handling of lifetime.end, leaving alias analysis to handle it appropriately. Differential Revision: https://reviews.llvm.org/D95763

…the kernel doesn't have any argument Currently if there is not kernel argument, device synchronization will be skipped. This can lead to two issues: 1. If there is any device error, it will not be captured; 2. The target region might end before the kernel is done, which is not spec conformant. The test added in this patch only runs on NVPTX platform, although it will not be executed by Phab at all. It also requires `not` which is not available on most systems. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D96067 (cherry picked from commit b68a6b0)

We do ship those headers, so the directory name should not be something that can potentially conflict with user-defined directories. This is a cherry-pick of b517568. Differential Revision: https://reviews.llvm.org/D96059

…zed memory Fixes usage of shared_ptr with CFI enabled, which is llvm.org/pr48993. (cherry pick of commit bab7486) Differential Revision: https://reviews.llvm.org/D96063

… defaults e.g. riscv32 Patch by Khem Raj. (cherry pick of commit 85b9c5c) Differential Revision: https://reviews.llvm.org/D96062

stella.stemenova mentioned in https://reviews.llvm.org/D93951 failures on Windows for this test. I'm fixing the macro definitions and disabling the tests for python versions lower than 3.7. I'll figure out that actual issue with python3.6 after the buildbots are fine again. (cherry picked from commit ab5591e)

@stella.stamenova found out that lldb-vscode's Win32 macros were failing when building on windows targetings POSIX platforms. I'm changing these macros for LLVM_ON_UNIX, which should be more accurate. (cherry picked from commit 0bca9a7)

@mstorsjo

@mstorsjo found a mistake that I made when trying to fix some Windows compilation errors encountered by @stella.stamenova. I was incorrectly using the LLVM_ON_UNIX macro. In any case, proper use of #if defined(_WIN32) should be the actual fix. Differential Revision: https://reviews.llvm.org/D96060 (cherry picked from commit 36496cc)

Differential Revision: https://reviews.llvm.org/D96092 (cherry picked from commit 96fb49c)

during the same evaluation. It looks like the only case for which this matters is determining whether mutable subobjects of a heap allocation can be modified during constant evaluation. (cherry picked from commit 21e8bb8)

…directory tree Currently Clang tidy provider searches from the root directory up to the target directory, this is the opposite of how clang-tidy searches for config files. The result of this is .clang-tidy files are ignored in any subdirectory of a directory containing a .clang-tidy file. Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D96204 (cherry picked from commit ba3ea9c)

…vertingCond If the G_BR + G_BRCOND in this combine use the same MBB, then it will infinite loop. Don't allow that to happen. Differential Revision: https://reviews.llvm.org/D95895 (cherry picked from commit 02d4b36)

variable's destruction if it didn't do so during construction. The standard doesn't give any guidance as to what to do here, but this approach seems reasonable and conservative, and has been proposed to the standard committee. (cherry picked from commit c945dc4)

…rgets As of commit 284f2bf, the DAG Combiner gets rid of the masking of the input to this node if the mask only keeps the bottom 16 bits. This is because the underlying library function does not use the high order bits. However, on PowerPC's ELFv2 ABI, it is the caller that is responsible for clearing the bits from the register. Therefore, the library implementation of __gnu_h2f_ieee will return an incorrect result if the bits aren't cleared. This combine is desired for ARM (and possibly other targets) so this patch adds a query to Target Lowering to check if this zeroing needs to be kept. Fixes: https://bugs.llvm.org/show_bug.cgi?id=49092 Differential revision: https://reviews.llvm.org/D96283 (cherry picked from commit a5222aa)

Due to a clerical error, the sdiv operation was mapping to vdivu and udiv to vdiv, when the opposite mapping is the correct one. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D95869 (cherry picked from commit b4106f9)

As noted in https://reviews.llvm.org/D93459, the formatting of multi-line descriptions of clEnumValN and the likes is unfavorable. Thus this patch adds support for correctly indenting these. Reviewed By: serge-sans-paille Differential Revision: https://reviews.llvm.org/D93494 (cherry picked from commit e3f0230)

The ldrexd/strexd instructions are not supported on M-class chips, see for example https://developer.arm.com/documentation/dui0489/e/arm-and-thumb-instructions/memory-access-instructions/ldrex-and-strex which says: > All these 32-bit Thumb instructions are available in ARMv6T2 and > above, except that LDREXD and STREXD are not available in the ARMv7-M > architecture. Looking at the ARMv8-M architecture, it appears that these instructions aren't supported either. The Architecture Reference Manual lists ldrex/strex but not ldrexd/strexd: https://developer.arm.com/documentation/ddi0553/bn/ Godbolt example on LLVM 11.0.0, which incorrectly emits ldrexd/strexd instructions: https://llvm.godbolt.org/z/5qqPnE Differential Revision: https://reviews.llvm.org/D95891 (cherry picked from commit aecdf15)

WenleiHe and others added 30 commits February 3, 2021 19:28

Revert "[OpenMP] Disabled profiling in libomp by default to unblock…

7fad20e

… link errors" This reverts commit f5602e0.

[clang][aarch64][WOA64][docs] Release note for longjmp crash with /gu…

bc2dad1

…ard:cf Add a release note workaround for PR47463. Bug: https://bugs.llvm.org/show_bug.cgi?id=47463 Differential Revision: https://reviews.llvm.org/D95435

[OpenMP] Fix building using LLVM_ENABLE_RUNTIMES

66c7b44

Fix when time profiling is enabled. Related to: D94855 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D95398 (cherry picked from commit bb40e67)

Add a release note about deprecating the clang-cl /fallback flag

72f1246

As discussed in https://lists.llvm.org/pipermail/cfe-dev/2021-January/067524.html The flag has been removed on the main branch in D95876. Differential revision: https://reviews.llvm.org/D96016

workflows: Update libclang-abi-tests to work with minor release basel…

4e79339

…ines

[MemCpyOpt] Add test for incorrect optimization across lifetime (NFC)

81febec

This only affects the MemorySSA-based implementation.

[🍒][libc++] Rename include/support to include/__support

395ef8d

We do ship those headers, so the directory name should not be something that can potentially conflict with user-defined directories. This is a cherry-pick of b517568. Differential Revision: https://reviews.llvm.org/D96059

[🍒]Disable CFI in __get_elem to allow casting a pointer to uninitiali…

bc39d53

…zed memory Fixes usage of shared_ptr with CFI enabled, which is llvm.org/pr48993. (cherry pick of commit bab7486) Differential Revision: https://reviews.llvm.org/D96063

[🍒][libc++] Fix libcxx build on 32bit architectures with 64bit time_t…

251f329

… defaults e.g. riscv32 Patch by Khem Raj. (cherry pick of commit 85b9c5c) Differential Revision: https://reviews.llvm.org/D96062

[AST] Update LVal before evaluating lambda decl fields.

c9fb4a9

Differential Revision: https://reviews.llvm.org/D96092 (cherry picked from commit 96fb49c)

Fix "not all control paths return a value" warning. NFCI.

b46924e

tstellar force-pushed the main branch 7 times, most recently from 4fa0a5d to 760ac9d Compare April 2, 2024 06:27

tstellar force-pushed the main branch 2 times, most recently from e122dac to 745c6e6 Compare April 12, 2024 01:51

tstellar force-pushed the main branch 4 times, most recently from 758197f to f8063ff Compare May 10, 2024 23:29

tstellar force-pushed the main branch 2 times, most recently from cec969f to 78a6a05 Compare May 23, 2024 23:43

tstellar force-pushed the main branch 2 times, most recently from 720e09b to d093e40 Compare May 31, 2024 18:11

tstellar force-pushed the main branch 4 times, most recently from 2cd0b2a to a88f2fc Compare June 10, 2024 21:14

tstellar force-pushed the main branch 3 times, most recently from f897b77 to 6e1d1c5 Compare June 18, 2024 14:04

tstellar force-pushed the main branch from 56d39c2 to e6d0fa3 Compare July 1, 2024 16:31

tstellar force-pushed the main branch from d84271d to 0162df0 Compare July 10, 2024 15:29

tstellar force-pushed the main branch 3 times, most recently from bf47e71 to be7f182 Compare July 20, 2024 00:31

tstellar closed this Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr50749 #365

Pr50749 #365

tstellar commented Jun 23, 2021

Pr50749 #365

Pr50749 #365

Conversation

tstellar commented Jun 23, 2021