-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q/BUG: process stuck in audit_backlog_wait for a longtime #144
Comments
On 2023-06-15 14:59, Chunwei Chen wrote:
System: centos 7
Kernel: Upstream 5.10.168 with some custom change unrelated to audit
auditd: 2.8.5-4.el7.x86_64
That's pretty old...
Hi, we had an issue where we were trying to kill a process to be able to umount some filesystem.
"umount some filesystem"... Would that happen to be a user or network
filesystem? #100
… However the process seemed to stuck in audit_backlog_wait for more than 20 seconds.
We have some code triggered a panic because it couldn't umount the filesystem.
|
@rgbriggs |
Unfortunately it's hard to draw any conclusions using a stacktrace from a custom kernel. Are you able to reproduce the problem on a modern kernel, or at the very least one without any custom patches? |
@pcmoore |
Sometimes unrelated changes can have a surprising impact :) For that reason, as a general rule we don't provide any upstream support for Linux Kernels with custom patches. Beyond the custom kernel changes, I'm also curious if a modern kernel resolves your problem as there have been some changes related to queue management since v5.10 was released. As Richard already mentioned, that kernel is quite old.
At the very least the line numbers of the stacktrace would be helpful.
There was some discussion about waking blocked processes recently on the audit mailing list: |
|
Ah, I see this is a Nutanix kernel build. Do you work for Nutanix or are you a customer/user? If the former, you should definitely read the mail archive link I posted above. |
Yeah, I noticed just now. I work for different team than Eiichi and we use different build, but it does seem very similar. |
When unbind and bind the device again, kernel will dump below warning: [ 173.972130] sysfs: cannot create duplicate filename '/devices/platform/soc/4c010010.usb/software_node' [ 173.981564] CPU: 2 UID: 0 PID: 536 Comm: sh Not tainted 6.12.0-rc6-06344-g2aed7c4a5c56 #144 [ 173.989923] Hardware name: NXP i.MX95 15X15 board (DT) [ 173.995062] Call trace: [ 173.997509] dump_backtrace+0x90/0xe8 [ 174.001196] show_stack+0x18/0x24 [ 174.004524] dump_stack_lvl+0x74/0x8c [ 174.008198] dump_stack+0x18/0x24 [ 174.011526] sysfs_warn_dup+0x64/0x80 [ 174.015201] sysfs_do_create_link_sd+0xf0/0xf8 [ 174.019656] sysfs_create_link+0x20/0x40 [ 174.023590] software_node_notify+0x90/0x100 [ 174.027872] device_create_managed_software_node+0xec/0x108 ... The '4c010010.usb' device is a platform device created during the initcall and is never removed, which causes its associated software node to persist indefinitely. The existing device_create_managed_software_node() does not provide a corresponding removal function. Replace device_create_managed_software_node() with the device_add_software_node() and device_remove_software_node() pair to ensure proper addition and removal of software nodes, addressing this issue. Fixes: a9400f1 ("usb: dwc3: imx8mp: add 2 software managed quirk properties for host mode") Cc: stable@vger.kernel.org Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Link: https://lore.kernel.org/r/20241126032841.2458338-1-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
System: centos 7
Kernel: Upstream 5.10.168 with some custom change unrelated to audit
auditd: 2.8.5-4.el7.x86_64
Hi, we had an issue where we were trying to kill a process to be able to umount some filesystem.
However the process seemed to stuck in audit_backlog_wait for more than 20 seconds.
We have some code triggered a panic because it couldn't umount the filesystem.
Here's the stack trace
The audit_queue seems quite empty at the time of panic
I think there were some issue with fairness at play here.
When processes enter audit_log_start and audit_queue.qlen is large, it will then decides to wait.
Then while kauditd is consuming the audit_queue, other threads entering audit_log_start might see audit_queue.qlen small and bypass the wait. So there's no guarantee when the process in the wait will be able to queue.
Another part of this issue is that kauditd will only wake up one process in each iteration when it process the whole queue. The comment says wake everyone but it uses wake_up not wake_up_all even though waiter uses add_wait_queue_exclusive. If the intention is wake everyone then should we change it to wake_up_all? I think if it is wake_up_all then the chances of our process stuck for 20 seconds would probably be lower.
The text was updated successfully, but these errors were encountered: