Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testAddDiskNFS troubleshooting #1909

Merged
merged 2 commits into from
Nov 16, 2024
Merged

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Nov 15, 2024

This fixes cockpit-project/bots#7095

Fixes #1900.

@martinpitt
Copy link
Member Author

martinpitt commented Nov 15, 2024

Ah! The only thing that shows me something is our automatically captured console log:

[  129.981339] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  129.982158] #PF: supervisor instruction fetch in kernel mode
[  129.982705] #PF: error_code(0x0010) - not-present page
[  129.983189] PGD 0 P4D 0 
[  129.983452] Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
[  129.983908] CPU: 0 UID: 0 PID: 686 Comm: systemd-journal Kdump: loaded Not tainted 6.11.0-29.3_1538040423.el10.x86_64 #1
[  129.984903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
[  129.985718] RIP: 0010:0x0
[  129.985983] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[  129.986574] RSP: 0018:ffffa9c6c07f3b88 EFLAGS: 00010286
[  129.987336] RAX: 0000000000000000 RBX: ffff8f91a291b700 RCX: 000000007fff0000
[  129.988158] RDX: ffff8f9186cc8000 RSI: ffffa9c6c0645048 RDI: ffffa9c6c07f3ba0
[  129.988931] RBP: 000000007fff0000 R08: 00000000000801c2 R09: 0000558cc06bb1b0
[  129.989708] R10: 00000000ffffff9c R11: 00000000ffffff9c R12: ffffa9c6c07f3ba0
[  129.990446] R13: 000000007fff0000 R14: ffffa9c6c0645000 R15: 0000000000000000
[  129.991185] FS:  00007fc22b264980(0000) GS:ffff8f91c7c00000(0000) knlGS:0000000000000000
[  129.992004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  129.992623] CR2: ffffffffffffffd6 CR3: 000000000a7c4004 CR4: 0000000000372ef0
[  129.993376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  129.994099] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  129.994817] Call Trace:
[  129.995155]  <TASK>
[  129.995475]  ? show_trace_log_lvl+0x1b0/0x2f0
[  129.995977]  ? show_trace_log_lvl+0x1b0/0x2f0
[  129.996453]  ? __seccomp_filter+0xdc/0x520
[  129.996929]  ? __die_body.cold+0x8/0x12
[  129.997390]  ? page_fault_oops+0x146/0x160
[  129.997864]  ? exc_page_fault+0x73/0x160
[  129.998318]  ? asm_exc_page_fault+0x26/0x30
[  129.998796]  __seccomp_filter+0xdc/0x520
[  129.999237]  syscall_trace_enter+0x92/0x1a0
[  129.999717]  ? syscall_exit_to_user_mode+0x32/0x190
[  130.000234]  do_syscall_64+0x12a/0x160
[  130.000680]  ? sock_poll+0x51/0xf0
[  130.001074]  ? ep_done_scan+0xf0/0x140
[  130.001506]  ? ep_send_events+0x288/0x2d0
[  130.001940]  ? ep_poll+0x2db/0x3f0
[  130.002348]  ? rseq_get_rseq_cs+0x1d/0x220
[  130.002798]  ? __pfx_ep_autoremove_wake_function+0x10/0x10
[  130.003352]  ? rseq_ip_fixup+0x8d/0x1d0
[  130.003779]  ? do_epoll_wait+0xc1/0xe0
[  130.004330]  ? switch_fpu_return+0x4e/0xd0
[  130.004816]  ? arch_exit_to_user_mode_prepare.isra.0+0x83/0xb0
[  130.005366]  ? syscall_exit_to_user_mode+0x32/0x190
[  130.005856]  ? do_syscall_64+0x89/0x160
[  130.006245]  ? __task_pid_nr_ns+0x97/0xb0
[  130.006665]  ? syscall_exit_to_user_mode+0x32/0x190
[  130.007115]  ? do_syscall_64+0x89/0x160
[  130.007493]  ? do_syscall_64+0x89/0x160
[  130.007874]  ? clear_bhb_loop+0x25/0x80
[  130.008250]  ? clear_bhb_loop+0x25/0x80
[  130.008619]  ? clear_bhb_loop+0x25/0x80
[  130.008998]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  130.009454] RIP: 0033:0x7fc22ad20458
[  130.009835] Code: 89 7c 24 18 44 89 54 24 0c e8 24 97 f9 ff 44 8b 54 24 0c 8b 54 24 1c 41 89 c0 48 8b 74 24 10 8b 7c 24 18 b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 0c e8 74 97 f9 ff 8b 44
[  130.011349] RSP: 002b:00007ffea415e680 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[  130.011995] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00007fc22ad20458
[  130.012601] RDX: 00000000000801c2 RSI: 0000558cc06bb1b0 RDI: 00000000ffffff9c
[  130.013211] RBP: 00000000ffffff9c R08: 0000000000000000 R09: 0000558cc06eba30
[  130.013819] R10: 0000000000000180 R11: 0000000000000293 R12: 0000558cc06bb1b0
[  130.014418] R13: 00007ffea415e7a8 R14: 00007ffea415e7a8 R15: 0000000000000000
[  130.015029]  </TASK>
[  130.015282] Modules linked in: tun nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace nfs_localio nft_masq nft_reject_ipv4 nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill nf_tables intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl vfat fat i2c_piix4 pcspkr virtio_balloon cirrus i2c_smbus joydev sg fuse loop dm_multipath nfnetlink xfs sr_mod cdrom ata_generic ata_piix libata virtio_net crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel net_failover virtio_scsi virtio_blk failover serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[  130.022733] CR2: 0000000000000000

and then it auto-reboots.

That is with the cockpit-project/bots#7095 image with the temporary kernel build. But it happens the same way with 6.11.0-29.el10.x86_64 on centos-10 (we just don't regularly test c-machines on our own centos-10 image):

[  105.053914] BUG: kernel NULL pointer dereference, address: 00000000000001f0
[  105.054878] #PF: supervisor read access in kernel mode
[  105.055536] #PF: error_code(0x0000) - not-present page
[  105.056084] PGD 0 P4D 0 
[  105.056420] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  105.057015] CPU: 0 UID: 0 PID: 31 Comm: kworker/u4:2 Kdump: loaded Not tainted 6.11.0-29.el10.x86_64 #1
[  105.058037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
[  105.059079] Workqueue: events_unbound flush_memcg_stats_dwork
[  105.059842] RIP: 0010:mem_cgroup_css_rstat_flush+0x289/0x340
[  105.060694] Code: 74 5d 49 89 f9 48 89 bc c6 f8 00 00 00 4d 29 c1 4c 01 8a f8 00 00 00 4c 01 c9 75 43 48 83 c0 01 48 83 c2 08 48 83 f8 1f 74 53 <48> 8b 8a f0 01 00 00 48 85 c9 75 ad 48 63 c8 4c 8b 84 c6 f8 00 00
[  105.063190] RSP: 0018:ffffb39d800fbde8 EFLAGS: 00010086
[  105.063949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fbbb2a69000
[  105.064833] RDX: 0000000000000000 RSI: ffff9fbbc7c00000 RDI: ffffffff8a19a0c0
[  105.065682] RBP: ffffd39d7fc16200 R08: 0000000000000018 R09: 0000000000000000
[  105.066527] R10: ffff9fbb831d4800 R11: 0000000000000000 R12: ffff9fbb83e84000
[  105.067371] R13: ffff9fbb84158000 R14: ffff9fbb83e84000 R15: ffff9fbb83e85000
[  105.068276] FS:  0000000000000000(0000) GS:ffff9fbbc7c00000(0000) knlGS:0000000000000000
[  105.069308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  105.069859] CR2: 00000000000001f0 CR3: 0000000006afa001 CR4: 0000000000372ef0
[  105.070621] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  105.071408] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  105.072188] Call Trace:
[  105.072492]  <TASK>
[  105.072767]  ? show_trace_log_lvl+0x1b0/0x2f0
[  105.073226]  ? show_trace_log_lvl+0x1b0/0x2f0
[  105.073648]  ? cgroup_rstat_flush_locked+0x1dc/0x2d0
[  105.074190]  ? __die_body.cold+0x8/0x12
[  105.074602]  ? page_fault_oops+0x146/0x160
[  105.075048]  ? exc_page_fault+0x73/0x160
[  105.075443]  ? asm_exc_page_fault+0x26/0x30
[  105.075899]  ? mem_cgroup_css_rstat_flush+0x289/0x340
[  105.076398]  ? mem_cgroup_css_rstat_flush+0x1f7/0x340
[  105.076982]  cgroup_rstat_flush_locked+0x1dc/0x2d0
[  105.077695]  cgroup_rstat_flush+0x27/0x80
[  105.078430]  flush_memcg_stats_dwork+0x26/0x50
[  105.078981]  process_one_work+0x174/0x330
[  105.079433]  worker_thread+0x252/0x390
[  105.079830]  ? __pfx_worker_thread+0x10/0x10
[  105.080136]  kthread+0xcf/0x100
[  105.080358]  ? __pfx_kthread+0x10/0x10
[  105.080758]  ret_from_fork+0x31/0x50
[  105.081131]  ? __pfx_kthread+0x10/0x10
[  105.081477]  ret_from_fork_asm+0x1a/0x30
[  105.081868]  </TASK>
[  105.082147] Modules linked in: tun nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace nfs_localio nft_masq nft_reject_ipv4 nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill nf_tables intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl i2c_piix4 pcspkr virtio_balloon i2c_smbus cirrus joydev sg fuse loop dm_multipath nfnetlink xfs sr_mod cdrom ata_generic ata_piix libata virtio_net crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel net_failover virtio_blk virtio_scsi failover serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod
[  105.089235] CR2: 00000000000001f0

I also confirmed it with our current centos-10 image with 6.11.0-27.el10.x86_64

Reported as https://issues.redhat.com/browse/RHEL-67841

@martinpitt martinpitt changed the title test: Disable setroubleshootd for testAddDiskNFS testAddDiskNFS troubleshooting Nov 15, 2024
@martinpitt martinpitt force-pushed the setrouble branch 2 times, most recently from 4745282 to 51ecaf5 Compare November 15, 2024 16:04
@martinpitt
Copy link
Member Author

I am running out of ideas. I skipped the test now on RHEL 10, which ought to fix the c10s TF run as well.

NFS produces dozens of permissive=1 SELinux issues; setroubleshoot runs
massively parallel on them, and eats up too much RAM and I/O.
Avoid this by disabling setroubleshootd for this test.

https://bugzilla.redhat.com/show_bug.cgi?id=2326499
RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM
without leaving any journal trace, only the QEMU console shows it.

It completely breaks Testing Farm runs (it cannot recover from reboots)
and also breaks our own CI in nasty ways, as after the reboot the
nondestructive recovery fails in all kinds of ways.

Fixes cockpit-project#1900

[1] https://issues.redhat.com/browse/RHEL-67841
@jelly jelly merged commit 423d16a into cockpit-project:main Nov 16, 2024
27 of 29 checks passed
@martinpitt martinpitt deleted the setrouble branch November 16, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

coredump in c10s testing farm
2 participants