Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
workqueue: fix subtle pool management issue which can stall whole wor…
…ker_pool commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream. A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Eric Sandeen <[email protected]> Link: http://lkml.kernel.org/g/[email protected] Cc: Dave Chinner <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/[email protected] v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Rabin Vincent <[email protected]> Cc: Tomeu Vizoso <[email protected]> Tested-by: Jesper Nilsson <[email protected]> Tested-by: Rabin Vincent <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: make sure delayed work run in local cpu commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream. My system keeps crashing with below message. vmstat_update() schedules a delayed work in current cpu and expects the work runs in the cpu. schedule_delayed_work() is expected to make delayed work run in local cpu. The problem is timer can be migrated with NO_HZ. __queue_work() queues work in timer handler, which could run in a different cpu other than where the delayed work is scheduled. The end result is the delayed work runs in different cpu. The patch makes __queue_delayed_work records local cpu earlier. Where the timer runs doesn't change where the work runs with the change. [ 28.010131] ------------[ cut here ]------------ [ 28.010609] kernel BUG at ../mm/vmstat.c:1392! [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [ 28.011860] Modules linked in: [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634 [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 28.014160] Workqueue: events vmstat_update [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000 [ 28.015445] RIP: 0010:[<ffffffff8115f921>] [<ffffffff8115f921>]vmstat_update+0x31/0x80 [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297 [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000 [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000 [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640 [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000 [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000 [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0 [ 28.020071] Stack: [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88 [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8 [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340 [ 28.020071] Call Trace: [ 28.020071] [<ffffffff8106bd88>] process_one_work+0x1c8/0x540 [ 28.020071] [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540 [ 28.020071] [<ffffffff8106c214>] worker_thread+0x114/0x460 [ 28.020071] [<ffffffff8106c100>] ? process_one_work+0x540/0x540 [ 28.020071] [<ffffffff81071bf8>] kthread+0xf8/0x110 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 [ 28.020071] [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: clear POOL_DISASSOCIATED in rebind_workers() a9ab775 ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: Fix workqueue stall issue after cpu down failure When the hotplug notifier call chain with CPU_DOWN_PREPARE is broken before reaching workqueue_cpu_down_callback(), rebind_workers() adds WORKER_REBOUND flag for running workers. Hence, the nr_running of the pool is not increased when scheduler wakes up the worker. The fix is skipping adding WORKER_REBOUND flag when the worker doesn't have WORKER_UNBOUND flag in CPU_DOWN_FAILED path. Change-Id: I2528e9154f4913d9ec14b63adbcbcd1eaa8a8452 Signed-off-by: Se Wang (Patrick) Oh <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar <[email protected]> Reviewed-by: Amit Kucheria <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> workqueue: Add system wide power_efficient workqueues This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> firmware: use power efficient workqueue for unloading and aborting fw load Allow the scheduler to select the most appropriate CPU for running the firmware load timeout routine and delayed routine for firmware unload. This extends idle residency times and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Ming Lei <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel, added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net: wireless: move regulatory timeout work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the timeout work of regulatory settings would be executed. This extends CPU idle residency time and saves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel. Added commit message.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Francisco Franco <[email protected]> rcu: Move SRCU grace period work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the SRCU grace period work would be scheduled. This improves idle residency time and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Lai Jiangshan <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Dipankar Sarma <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel version. Added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net/ipv4: queue work on power efficient wq Workqueue used in ipv4 layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Francisco Franco <[email protected]> block: queue work on power efficient wq Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> PHYLIB: queue work on system_power_efficient_wq Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces system_wq with system_power_efficient_wq for PHYLIB. Cc: David S. Miller <[email protected]> Cc: [email protected] Signed-off-by: Viresh Kumar <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: pcm: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Viresh Kumar <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: compress: Use power efficient workqueue There is no need for the power down work to be done on a per CPU workqueue especially considering the fairly long delay before powerdown. Signed-off-by: Mark Brown <[email protected]> Acked-by: Vinod Koul <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: jack: Use power efficient workqueue The accessory detect debounce work is not performance sensitive so let the scheduler run it wherever is most efficient rather than in a per CPU workqueue by using the system power efficient workqueue. Signed-off-by: Mark Brown <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net/neighbour: queue work on power efficient wq Workqueue used in neighbour layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Francisco Franco <[email protected]> timekeeping: Move clock sync work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the CMOS clock sync work would be scheduled. This improves idle residency time and conserver power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Added commit message. Aligned code.] Signed-off-by: Zoran Markovic <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net: rfkill: move poll work to power efficient workqueue This patch moves the rfkill poll_work to the power efficient workqueue. This work does not have to be bound to the CPU that scheduled it, hence the selection of CPU that executes it would be left to the scheduler. Net result is that CPU idle times would be extended, resulting in power savings. This behaviour is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel, added commit message. Fixed workqueue selection after suspend/resume cycle.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Francisco Franco <[email protected]> usb: move hub init and LED blink work to power efficient workqueue Allow the scheduler to select the best CPU to handle hub initalization and LED blinking work. This extends idle residency times on idle CPUs and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. [[email protected]: Rebased to latest kernel. Added commit message. Changed reference from system to power efficient workqueue for LEDs in check_highspeed() and hub_port_connect_change().] Acked-by: Alan Stern <[email protected]> Cc: Sarah Sharp <[email protected]> Cc: Xenia Ragiadakou <[email protected]> Cc: Julius Werner <[email protected]> Cc: Krzysztof Mazur <[email protected]> Cc: Matthias Beyer <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mathias Nyman <[email protected]> Cc: Thomas Pugliese <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Francisco Franco <[email protected]> block: remove WQ_POWER_EFFICIENT from kblockd blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit 695588f9454b, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Francisco Franco <[email protected]> regulator: core: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Viresh Kumar <[email protected]> Acked-by: Liam Girdwood <[email protected]> Signed-off-by: Francisco Franco <[email protected]> power: smb135x: queue work on system_power_efficient_wq There doesn't seem to be any real dependency of scheduling these on the cpu which scheduled them, so moving every *_delayed to the power efficient wq save potential needlessly idle cpu wake ups leaving the scheduler to decide the most appropriate cpus to wake up. Signed-off-by: Francisco Franco <[email protected]>
- Loading branch information