Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure(PID 13812 [0x000035f4], Thread: 14128 [0x3730]): promoted_bytes (heap_number) == promoted #100035

Closed
AndyAyersMS opened this issue Mar 20, 2024 · 16 comments
Assignees
Labels
area-GC-coreclr Known Build Error Use this to report build issues in the .NET Helix tab

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 20, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=610011
Build error leg or test failing: GC\API\NoGCRegion\Callback_Svr\Callback_Svr.cmd
Pull request: #99992

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "promoted_bytes (heap_number) == promoted",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

FYI @dotnet/gc

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=610011
Error message validated: [promoted_bytes (heap_number) == promoted]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 3/20/2024 7:47:36 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@AndyAyersMS AndyAyersMS added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Mar 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 20, 2024
@AndyAyersMS AndyAyersMS added area-GC-coreclr and removed untriaged New issue has not been triaged by the area owner area-Infrastructure-libraries labels Mar 20, 2024
@mangod9
Copy link
Member

mangod9 commented Mar 20, 2024

@markples since we were chatting about incorrect promoted bytes reporting today. Perhaps this is related?

@markples
Copy link
Contributor

The other one occurs when a normal GC interferes with the bookkeeping of a BGC and leads to a BGC assertion. This isn't doing that, but perhaps the reverse could occur.

@mangod9
Copy link
Member

mangod9 commented Mar 21, 2024

Adding @Maoni0 as well in case this is related to any recent change.

@mangod9
Copy link
Member

mangod9 commented Mar 22, 2024

@dotnet/runtime-infrastructure there are many CI job failures for this, but none has a dump attached. It did collect a stack trace locally, so not sure why the dump has not been uploaded?

@BruceForstall
Copy link
Member

@mangod9 @markples Can you disable this test today so pipeline runs are clean, and in particular, so all our outerloop pipelines don't have broad failures this weekend?

@markples
Copy link
Contributor

markples commented Mar 23, 2024

#100167

This appears to be a compiler bug on x64 windows. dprintfs are nops in this build. The relevant asm is extracted (way) below.

C++ (manually preprocessed a bit):

inline
size_t gc_heap::get_promoted_bytes()
{
    if (!survived_per_region)
    {
        dprintf (...);
        return 0;
    }

    dprintf (...);
    size_t promoted = 0;
    for (size_t i = 0; i < region_count; i++)
    {
        if (survived_per_region[i] > 0)
        {
            heap_segment* region = get_region_at_index (i);
            dprintf (...)

            promoted += survived_per_region[i];
        }
    }

    dprintf (...);
    assert (promoted_bytes (heap_number) == promoted);

    return promoted;

Full asm:

00007FFB1FBF51BF  mov         r8,qword ptr [SVR::gc_heap::region_count (07FFB202188A8h)]  
00007FFB1FBF51C6  mov         edi,ecx  
00007FFB1FBF51C8  mov         r12d,dword ptr [__isa_available (07FFB201FD958h)]  
00007FFB1FBF51CF  nop  
        {
            promoted_bytes_global += g_heaps[i]->get_promoted_bytes();
00007FFB1FBF51D0  mov         rax,qword ptr [SVR::gc_heap::g_heaps (07FFB20219198h)]  
00007FFB1FBF51D7  mov         rbx,rcx  
00007FFB1FBF51DA  mov         r10,qword ptr [rdi+rax]  
00007FFB1FBF51DE  mov         r9,qword ptr [r10+540h]  
00007FFB1FBF51E5  test        r9,r9  
00007FFB1FBF51E8  je          SVR::gc_heap::mark_phase+168Ch (07FFB1FBF52CCh)  
00007FFB1FBF51EE  mov         rdx,rcx  
00007FFB1FBF51F1  test        r8,r8  
00007FFB1FBF51F4  je          SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)  
00007FFB1FBF51F6  cmp         r8,4  
00007FFB1FBF51FA  jb          SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)  
00007FFB1FBF51FC  cmp         r12d,6  
00007FFB1FBF5200  jl          SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)  
00007FFB1FBF5202  mov         rax,r8  
00007FFB1FBF5205  xorps       xmm3,xmm3  
00007FFB1FBF5208  and         rax,0FFFFFFFFFFFFFFFCh  
00007FFB1FBF520C  xorps       xmm4,xmm4  
00007FFB1FBF520F  xorps       xmm2,xmm2  
00007FFB1FBF5212  movdqu      xmm0,xmmword ptr [r9+rdx*8]  
00007FFB1FBF5218  vpminuq     xmm1,xmm0,xmm4  
00007FFB1FBF521E  pcmpeqq     xmm1,xmm0  
00007FFB1FBF5223  andnps      xmm1,xmm6  
00007FFB1FBF5226  andnps      xmm1,xmm0  
00007FFB1FBF5229  movdqu      xmm0,xmmword ptr [r9+rdx*8+10h]  
00007FFB1FBF5230  add         rdx,4  
00007FFB1FBF5234  paddq       xmm3,xmm1  
00007FFB1FBF5238  vpminuq     xmm1,xmm0,xmm4  
00007FFB1FBF523E  pcmpeqq     xmm1,xmm0  
00007FFB1FBF5243  andnps      xmm1,xmm6  
00007FFB1FBF5246  andnps      xmm1,xmm0  
00007FFB1FBF5249  paddq       xmm2,xmm1  
00007FFB1FBF524D  cmp         rdx,rax  
00007FFB1FBF5250  jb          SVR::gc_heap::mark_phase+15D2h (07FFB1FBF5212h)  
00007FFB1FBF5252  paddq       xmm2,xmm3  
00007FFB1FBF5256  movdqa      xmm0,xmm2  
00007FFB1FBF525A  psrldq      xmm0,8  
00007FFB1FBF525F  paddq       xmm2,xmm0  
00007FFB1FBF5263  movq        rbx,xmm2  
00007FFB1FBF5268  cmp         rdx,r8  
00007FFB1FBF526B  jae         SVR::gc_heap::mark_phase+164Ah (07FFB1FBF528Ah)  
00007FFB1FBF526D  nop         dword ptr [rax]  
00007FFB1FBF5270  mov         rax,qword ptr [r9+rdx*8]  
00007FFB1FBF5274  test        rax,rax  
00007FFB1FBF5277  lea         rcx,[rbx+rax]  
00007FFB1FBF527B  cmove       rcx,rbx  
00007FFB1FBF527F  inc         rdx  
00007FFB1FBF5282  mov         rbx,rcx  
00007FFB1FBF5285  cmp         rdx,r8  
00007FFB1FBF5288  jb          SVR::gc_heap::mark_phase+1630h (07FFB1FBF5270h)  
00007FFB1FBF528A  mov         eax,dword ptr [r10+1598h]  
00007FFB1FBF5291  shl         eax,4  
00007FFB1FBF5294  movsxd      rcx,eax  
00007FFB1FBF5297  mov         rax,qword ptr [SVR::gc_heap::g_promoted (07FFB20218808h)]  
00007FFB1FBF529E  cmp         qword ptr [rax+rcx*8],rbx  
00007FFB1FBF52A2  je          SVR::gc_heap::mark_phase+168Ah (07FFB1FBF52CAh)  
00007FFB1FBF52A4  lea         r8,[string "promoted_bytes (heap_number) ==@"... (07FFB2008B450h)]  
00007FFB1FBF52AB  mov         edx,5EAAh  
00007FFB1FBF52B0  lea         rcx,[string "D:\\a\\_work\\1\\s\\src\\coreclr\\gc\\g@"... (07FFB20088190h)]  
00007FFB1FBF52B7  call        DbgAssertDialog (07FFB1FCEE980h)  

Relevant asm:

00007FFB1FBF5205  xorps       xmm3,xmm3  
00007FFB1FBF5208  and         rax,0FFFFFFFFFFFFFFFCh  
00007FFB1FBF520C  xorps       xmm4,xmm4  
00007FFB1FBF520F  xorps       xmm2,xmm2  
// loop
00007FFB1FBF5212  movdqu      xmm0,xmmword ptr [r9+rdx*8]  
00007FFB1FBF5218  vpminuq     xmm1,xmm0,xmm4  
00007FFB1FBF521E  pcmpeqq     xmm1,xmm0  
00007FFB1FBF5223  andnps      xmm1,xmm6  
00007FFB1FBF5226  andnps      xmm1,xmm0  
00007FFB1FBF5229  movdqu      xmm0,xmmword ptr [r9+rdx*8+10h]  
00007FFB1FBF5230  add         rdx,4  
00007FFB1FBF5234  paddq       xmm3,xmm1  
00007FFB1FBF5238  vpminuq     xmm1,xmm0,xmm4  
00007FFB1FBF523E  pcmpeqq     xmm1,xmm0  
00007FFB1FBF5243  andnps      xmm1,xmm6  
00007FFB1FBF5246  andnps      xmm1,xmm0  
00007FFB1FBF5249  paddq       xmm2,xmm1  
00007FFB1FBF524D  cmp         rdx,rax  
00007FFB1FBF5250  jb          SVR::gc_heap::mark_phase+15D2h (07FFB1FBF5212h)  

If I'm reading it correctly, the minimum is always going to be zero due to xmm4 being zero. The compare will generate all 1s if the array value is zero, all 0s otherwise. I haven't found the definition of xmm6 (but from other examples is likely to be all 1s) or what the double and-not is doing. Perhaps replacing both of those with andnps xmm1 xmm0 is the intention? (though also note that all of this is to prevent the addition of zero into the sum)

[edit: details in last paragraph]

@Maoni0
Copy link
Member

Maoni0 commented Mar 23, 2024

adding a bit of detail why we suspected this was a compiler issue to begin with - we could see the 2 values were indeed the same if the calculation in the loop was done correctly. I looked at multiple repros and if I added up what's in survived_per_region for that heap for all regions (which would be the value of promoted, it was always the same as promoted_bytes for that heap. an example from the dump here -

0:006> ?? coreclr!SVR::gc_heap::g_heaps[1]->survived_per_region
unsigned int64 * 0x000001fe`43dac1f0
0:006> dq 01fe`43dac1f0 01fe`43dac1f0+2a*8 <-- coreclr!SVR::gc_heap::region_count is always 2a
000001fe`43dac1f0  00000000`00000000 00000000`00000000
000001fe`43dac200  00000000`00000000 00000000`00000000
000001fe`43dac210  00000000`00000000 00000000`00000000
000001fe`43dac220  00000000`00000000 00000000`00000000
000001fe`43dac230  00000000`00000000 00000000`00000000
000001fe`43dac240  00000000`00000000 00000000`00000000
000001fe`43dac250  00000000`00000000 00000000`00000000
000001fe`43dac260  00000000`00000000 00000000`00000000
000001fe`43dac270  00000000`00000000 00000000`00000000
000001fe`43dac280  00000000`00000000 00000000`00000000
000001fe`43dac290  00000000`00000000 00000000`00000450 <-- in this particular case this is the only non-zero value, in some cases there are multiple regions with non-zero values.
000001fe`43dac2a0  00000000`00000000 00000000`00000000
000001fe`43dac2b0  00000000`00000000 00000000`00000000
000001fe`43dac2c0  00000000`00000000 00000000`00000000
000001fe`43dac2d0  00000000`00000000 00000000`00000000
000001fe`43dac2e0  00000000`00000000 00000000`00000000
000001fe`43dac2f0  00000000`00000000 00000000`00000000
000001fe`43dac300  00000000`00000000 00000000`00000000
000001fe`43dac310  00000000`00000000 00000000`00000000
000001fe`43dac320  00000000`00000000 00000000`00000000
000001fe`43dac330  00000000`00000000 00000000`00000000
000001fe`43dac340  00000000`00000000
 
0:006> x coreclr!*g_promoted
00007ffe`b652efb8 coreclr!WKS::gc_heap::g_promoted = 0
00007ffe`b6528808 coreclr!SVR::gc_heap::g_promoted = 0x000001bd`aca8e860
0:006> dq 01bd`aca8e860+10*8 l1
000001bd`aca8e8e0  00000000`00000450

but in the debugger promoted is 0. when it does this compare

cmp qword ptr [rax+rcx*8],rbx

rbx which is promoted is 0, so the assert was triggered.

@v-wenyuxu
Copy link

Failed in: runtime-coreclr jitstress 20240321.2

Failed tests:

coreclr windows x64 Checked disabler2r @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress2 @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress1_tiered @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress2_tiered @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitminopts @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress1 @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked tailcallstress @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd

Error message:

 
Assert failure(PID 17192 [0x00004328], Thread: 17220 [0x4344]): promoted_bytes (heap_number) == promoted

CORECLR! SVR::gc_heap::mark_phase + 0x167C (0x00007ffe`2f264ebc)
CORECLR! SVR::gc_heap::gc1 + 0x240 (0x00007ffe`2f256c40)
CORECLR! SVR::gc_heap::garbage_collect + 0x99B (0x00007ffe`2f2569ab)
CORECLR! SVR::gc_heap::gc_thread_function + 0x504 (0x00007ffe`2f258354)
CORECLR! SVR::gc_heap::gc_thread_stub + 0x7E (0x00007ffe`2f2587fe)
CORECLR! <lambda_db5353710d144cc7c2d19ec9a78e4018>::<lambda_invoker_cdecl> + 0x74 (0x00007ffe`2ef1fe04)
KERNEL32! BaseThreadInitThunk + 0x14 (0x00007ffe`600e84d4)
NTDLL! RtlUserThreadStart + 0x21 (0x00007ffe`601e1791)
    File: D:\a\_work\1\s\src\coreclr\gc\gc.cpp:24234
    Image: C:\h\w\B78B099A\p\corerun.exe


Return code:      1
Raw output file:      C:\h\w\B78B099A\w\B19C09A9\uploads\Reports\GC.API\NoGCRegion\Callback_Svr\Callback_Svr.output.txt
Raw output:
BEGIN EXECUTION
 "C:\h\w\B78B099A\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  Callback_Svr.dll 
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test failed. Trying to see if dump file was created in C:\cores since 3/22/2024 6:04:51 AM
Processing C:\cores\corerun.exe.17192.dmp
Invoking: C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe -c "$<C:\h\w\B78B099A\t\tmpmuespb.tmp" -z "C:\cores\corerun.exe.17192.dmp"
stdout: 
Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\cores\corerun.exe.17192.dmp]
User Mini Dump File with Full Memory: Only application data is available


************* Path validation summary **************
Response                         Time (ms)     Location
OK                                             C:\h\w\B78B099A\p\PDB
Symbol search path is: C:\h\w\B78B099A\p\PDB
Executable search path is: 
Windows 10 Version 14393 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
10.0.14393.6343 (rs1_release.230913-1727)
Machine Name:
Debug session time: Fri Mar 22 06:04:56.000 2024 (UTC + 0:00)
System Uptime: 0 days 0:26:43.404
Process Uptime: 0 days 0:00:04.000
..............................

************* Symbol Loading Error Summary **************
Module name            Error
ntdll                  The system cannot find the file specified

You can troubleshoot most symbol related issues by turning on symbol loading diagnostics (!sym noisy) and repeating the command that caused symbols to be loaded.
You should also verify that your symbol search path (.sympath) is correct.
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(4328.4344): Unknown exception - code c0000602 (first/second chance not available)
For analysis of this file, run !analyze -v
*** WARNING: Unable to verify checksum for coreclr.dll
KERNELBASE!RaiseFailFastException+0xae:
00007ffe`5d5d96ce 803d4d78120000  cmp     byte ptr [KERNELBASE!local_unwind+0x7fe2 (00007ffe`5d700f22)],0 ds:00007ffe`5d700f22=00
0:006> cdb: Reading initial command '$<C:\h\w\B78B099A\t\tmpmuespb.tmp'
0:006> .load C:\Users\runner\.dotnet\sos\sos.dll
0:006> ~*k

   0  Id: 4328.432c Suspend: 0 Teb: 000000de`0e18d000 Unfrozen
Child-SP          RetAddr           Call Site
000000de`0e37db48 00007ffe`5d536d1f ntdll!ZwWaitForSingleObject+0x14
000000de`0e37db50 00007ffe`2f27efc2 KERNELBASE!WaitForSingleObjectEx+0x8f
000000de`0e37dbf0 00007ffe`2f234929 coreclr!SVR::gc_heap::wait_for_gc_done+0x1b2
000000de`0e37dc50 00007ffe`2f27a921 coreclr!SVR::GCHeap::GarbageCollectGeneration+0x359
000000de`0e37dca0 00007ffe`2f233430 coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x71
(Inline Function) --------`-------- coreclr!SVR::gc_heap::try_allocate_more_space+0xca
(Inline Function) --------`-------- coreclr!SVR:

Stack trace:

   at GC_API._NoGCRegion_Callback_Svr_Callback_Svr_._NoGCRegion_Callback_Svr_Callback_Svr_cmd()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

More failures:
runtime-coreclr crossgen2 20240321.1

R2R-CG2 windows x64 Checked no_tiered_compilation @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
R2R-CG2 windows x64 Checked @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd

runtime-coreclr pgo 20240322.1

coreclr windows x64 Checked defaultpgo @ Windows.10.Amd64.Open
    - GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd

@AndyAyersMS
Copy link
Member Author

This is having fairly widespread impact, would appreciate it if we could get some kind of mitigation in place soon.

@BruceForstall
Copy link
Member

Did #100167 not work?

Oh, it looks like #100217 is the same symptom on a different test.

@mangod9
Copy link
Member

mangod9 commented Mar 26, 2024

yeah we will need to workaround by disabling optimizations for this method in the interim. Very likely this was triggered by the compiler update from last week.

@VSadov
Copy link
Member

VSadov commented Apr 5, 2024

This affects servicing branches as well. Seen in a 8.0-staging PR:

#100426

@mangod9
Copy link
Member

mangod9 commented Apr 5, 2024

we plan to move to Release version of the toolset. PR for backport: #100622

@jeffschwMSFT
Copy link
Member

removing blocking-clean-ci as it has not failed in 30 days

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

@markples
Copy link
Contributor

fixed - see #100598 and #101995

@github-actions github-actions bot locked and limited conversation to collaborators Jun 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

No branches or pull requests

8 participants