Free htab element out of bucket lock #8374

kernel-patches-daemon-bpf · 2025-01-17T10:49:34Z

Pull request for series with
subject: Free htab element out of bucket lock
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416

kernel-patches-daemon-bpf · 2025-01-17T10:49:35Z

Upstream branch: b53b63d
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416
version: 3

kernel-patches-daemon-bpf · 2025-01-17T12:36:20Z

Upstream branch: b53b63d
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416
version: 3

kernel-patches-daemon-bpf · 2025-01-17T12:45:33Z

Upstream branch: b53b63d
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416
version: 3

kernel-patches-daemon-bpf · 2025-01-17T14:23:22Z

Upstream branch: f8a0569
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416
version: 3

When bpf_timer is used in LRU hash map, calling check_and_free_fields() in htab_lru_map_delete_node() will invoke bpf_timer_cancel_and_free() to free the bpf_timer. If the timer is running on other CPUs, hrtimer_cancel() will invoke hrtimer_cancel_wait_running() to spin on current CPU to wait for the completion of the hrtimer callback. Considering that the deletion has already acquired a raw-spin-lock (bucket lock). To reduce the time holding the bucket lock, move the invocation of check_and_free_fields() out of bucket lock. However, because htab_lru_map_delete_node() is invoked with LRU raw spin lock being held, the freeing of special fields still happens in a locked scope. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>

Use goto statement to bail out early when the target element is not found, instead of using a large else branch to handle the more likely case. This change doesn't affect functionality and simply make the code cleaner. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>

The freeing of special fields in map value may acquire a spin-lock (e.g., the freeing of bpf_timer), however, the lookup_and_delete_elem procedure has already held a raw-spin-lock, which violates the lockdep rule. The running context of __htab_map_lookup_and_delete_elem() has already disabled the migration. Therefore, it is OK to invoke free_htab_elem() after unlocking the bucket lock. Fix the potential problem by freeing element after unlocking bucket lock in __htab_map_lookup_and_delete_elem(). Signed-off-by: Hou Tao <houtao1@huawei.com>

During the update procedure, when overwrite element in a pre-allocated htab, the freeing of old_element is protected by the bucket lock. The reason why the bucket lock is necessary is that the old_element has already been stashed in htab->extra_elems after alloc_htab_elem() returns. If freeing the old_element after the bucket lock is unlocked, the stashed element may be reused by concurrent update procedure and the freeing of old_element will run concurrently with the reuse of the old_element. However, the invocation of check_and_free_fields() may acquire a spin-lock which violates the lockdep rule because its caller has already held a raw-spin-lock (bucket lock). The following warning will be reported when such race happens: BUG: scheduling while atomic: test_progs/676/0x00000003 3 locks held by test_progs/676: #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830 #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500 #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0 Modules linked in: bpf_testmod(O) Preemption disabled at: [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500 CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11 Tainted: [W]=WARN, [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)... Call Trace: <TASK> dump_stack_lvl+0x57/0x70 dump_stack+0x10/0x20 __schedule_bug+0x120/0x170 __schedule+0x300c/0x4800 schedule_rtlock+0x37/0x60 rtlock_slowlock_locked+0x6d9/0x54c0 rt_spin_lock+0x168/0x230 hrtimer_cancel_wait_running+0xe9/0x1b0 hrtimer_cancel+0x24/0x30 bpf_timer_delete_work+0x1d/0x40 bpf_timer_cancel_and_free+0x5e/0x80 bpf_obj_free_fields+0x262/0x4a0 check_and_free_fields+0x1d0/0x280 htab_map_update_elem+0x7fc/0x1500 bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43 bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e bpf_prog_test_run_syscall+0x322/0x830 __sys_bpf+0x135d/0x3ca0 __x64_sys_bpf+0x75/0xb0 x64_sys_call+0x1b5/0xa10 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ... </TASK> It seems feasible to break the reuse and refill of per-cpu extra_elems into two independent parts: reuse the per-cpu extra_elems with bucket lock being held and refill the old_element as per-cpu extra_elems after the bucket lock is unlocked. However, it will make the concurrent overwrite procedures on the same CPU return unexpected -E2BIG error when the map is full. Therefore, the patch fixes the lock problem by breaking the cancelling of bpf_timer into two steps for PREEMPT_RT: 1) use hrtimer_try_to_cancel() and check its return value 2) if the timer is running, use hrtimer_cancel() through a kworker to cancel it again Considering that the current implementation of hrtimer_cancel() will try to acquire a being held softirq_expiry_lock when the current timer is running, these steps above are reasonable. However, it also has downside. When the timer is running, the cancelling of the timer is delayed when releasing the last map uref. The delay is also fixable (e.g., break the cancelling of bpf timer into two parts: one part in locked scope, another one in unlocked scope), it can be revised later if necessary. It is a bit hard to decide the right fix tag. One reason is that the problem depends on PREEMPT_RT which is enabled in v6.12. Considering the softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced in v5.15, the bpf_timer commit is used in the fixes tag and an extra depends-on tag is added to state the dependency on PREEMPT_RT. Fixes: b00628b ("bpf: Introduce bpf timers.") Depends-on: v6.12+ with PREEMPT_RT enabled Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>

The main purpose of the test is to demonstrate the lock problem for the free of bpf_timer under PREEMPT_RT. When freeing a bpf_timer which is running on other CPU in bpf_timer_cancel_and_free(), hrtimer_cancel() will try to acquire a spin-lock (namely softirq_expiry_lock), however the freeing procedure has already held a raw-spin-lock. The test first creates two threads: one to start timers and the other to free timers. The start-timers thread will start the timer and then wake up the free-timers thread to free these timers when the starts complete. After freeing, the free-timer thread will wake up the start-timer thread to complete the current iteration. A loop of 10 iterations is used. Signed-off-by: Hou Tao <houtao1@huawei.com>

kernel-patches-daemon-bpf · 2025-01-17T14:50:37Z

Upstream branch: e055a46
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=926416
version: 3

kernel-patches-daemon-bpf bot added new bpf-next V3 V3-ci-pass labels Jan 17, 2025

kernel-patches-daemon-bpf bot force-pushed the series/922809=>bpf-next branch from 661f350 to 6d463c1 Compare January 17, 2025 12:36

kernel-patches-daemon-bpf bot force-pushed the series/922809=>bpf-next branch from 6d463c1 to 6b9aaad Compare January 17, 2025 12:45

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch from 075d2f1 to cfe4aae Compare January 17, 2025 14:22

kernel-patches-daemon-bpf bot force-pushed the series/922809=>bpf-next branch from 6b9aaad to 7e8dda0 Compare January 17, 2025 14:23

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch from cfe4aae to c120dfb Compare January 17, 2025 14:49

Hou Tao added 5 commits January 17, 2025 06:50

kernel-patches-daemon-bpf bot force-pushed the series/922809=>bpf-next branch from 7e8dda0 to 051a168 Compare January 17, 2025 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free htab element out of bucket lock #8374

Free htab element out of bucket lock #8374

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

Free htab element out of bucket lock #8374

Are you sure you want to change the base?

Free htab element out of bucket lock #8374

Conversation

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025

kernel-patches-daemon-bpf bot commented Jan 17, 2025