From d1590c818da617e769808c65de4c79d604a71a32 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 16 Jan 2015 14:21:16 -0500 Subject: [PATCH] workqueue: fix subtle pool management issue which can stall whole worker_pool MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream. A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo Reported-by: Eric Sandeen Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner Cc: Lai Jiangshan Signed-off-by: Greg Kroah-Hartman Signed-off-by: franciscofranco workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo Reported-by: Rabin Vincent Cc: Tomeu Vizoso Tested-by: Jesper Nilsson Tested-by: Rabin Vincent Signed-off-by: Greg Kroah-Hartman Signed-off-by: franciscofranco workqueue: make sure delayed work run in local cpu commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream. My system keeps crashing with below message. vmstat_update() schedules a delayed work in current cpu and expects the work runs in the cpu. schedule_delayed_work() is expected to make delayed work run in local cpu. The problem is timer can be migrated with NO_HZ. __queue_work() queues work in timer handler, which could run in a different cpu other than where the delayed work is scheduled. The end result is the delayed work runs in different cpu. The patch makes __queue_delayed_work records local cpu earlier. Where the timer runs doesn't change where the work runs with the change. [ 28.010131] ------------[ cut here ]------------ [ 28.010609] kernel BUG at ../mm/vmstat.c:1392! [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [ 28.011860] Modules linked in: [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634 [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 28.014160] Workqueue: events vmstat_update [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000 [ 28.015445] RIP: 0010:[] []vmstat_update+0x31/0x80 [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297 [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000 [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000 [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640 [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000 [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000 [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0 [ 28.020071] Stack: [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88 [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8 [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340 [ 28.020071] Call Trace: [ 28.020071] [] process_one_work+0x1c8/0x540 [ 28.020071] [] ? process_one_work+0x14b/0x540 [ 28.020071] [] worker_thread+0x114/0x460 [ 28.020071] [] ? process_one_work+0x540/0x540 [ 28.020071] [] kthread+0xf8/0x110 [ 28.020071] [] ?kthread_create_on_node+0x200/0x200 [ 28.020071] [] ret_from_fork+0x3f/0x70 [ 28.020071] [] ?kthread_create_on_node+0x200/0x200 Signed-off-by: Shaohua Li Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman Signed-off-by: franciscofranco workqueue: clear POOL_DISASSOCIATED in rebind_workers() a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan Signed-off-by: Tejun Heo Signed-off-by: franciscofranco workqueue: Fix workqueue stall issue after cpu down failure When the hotplug notifier call chain with CPU_DOWN_PREPARE is broken before reaching workqueue_cpu_down_callback(), rebind_workers() adds WORKER_REBOUND flag for running workers. Hence, the nr_running of the pool is not increased when scheduler wakes up the worker. The fix is skipping adding WORKER_REBOUND flag when the worker doesn't have WORKER_UNBOUND flag in CPU_DOWN_FAILED path. Change-Id: I2528e9154f4913d9ec14b63adbcbcd1eaa8a8452 Signed-off-by: Se Wang (Patrick) Oh Signed-off-by: franciscofranco workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar Reviewed-by: Amit Kucheria Signed-off-by: Tejun Heo Signed-off-by: Francisco Franco workqueue: Add system wide power_efficient workqueues This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar Signed-off-by: Tejun Heo Signed-off-by: Francisco Franco firmware: use power efficient workqueue for unloading and aborting fw load Allow the scheduler to select the most appropriate CPU for running the firmware load timeout routine and delayed routine for firmware unload. This extends idle residency times and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Ming Lei Cc: Greg Kroah-Hartman Signed-off-by: Shaibal Dutta [zoran.markovic@linaro.org: Rebased to latest kernel, added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic Signed-off-by: Greg Kroah-Hartman Signed-off-by: Francisco Franco net: wireless: move regulatory timeout work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the timeout work of regulatory settings would be executed. This extends CPU idle residency time and saves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" Cc: "David S. Miller" Signed-off-by: Shaibal Dutta [zoran.markovic@linaro.org: Rebased to latest kernel. Added commit message.] Signed-off-by: Zoran Markovic Signed-off-by: Johannes Berg Signed-off-by: Francisco Franco rcu: Move SRCU grace period work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the SRCU grace period work would be scheduled. This improves idle residency time and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Lai Jiangshan Cc: "Paul E. McKenney" Cc: Dipankar Sarma Signed-off-by: Shaibal Dutta [zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett Signed-off-by: Francisco Franco net/ipv4: queue work on power efficient wq Workqueue used in ipv4 layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar Signed-off-by: David S. Miller Signed-off-by: Francisco Franco block: queue work on power efficient wq Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe Signed-off-by: Viresh Kumar Signed-off-by: Tejun Heo Signed-off-by: Francisco Franco PHYLIB: queue work on system_power_efficient_wq Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces system_wq with system_power_efficient_wq for PHYLIB. Cc: David S. Miller Cc: netdev@vger.kernel.org Signed-off-by: Viresh Kumar Acked-by: David S. Miller Signed-off-by: Tejun Heo Signed-off-by: Francisco Franco ASoC: pcm: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown Reviewed-by: Viresh Kumar Signed-off-by: Francisco Franco ASoC: compress: Use power efficient workqueue There is no need for the power down work to be done on a per CPU workqueue especially considering the fairly long delay before powerdown. Signed-off-by: Mark Brown Acked-by: Vinod Koul Signed-off-by: Francisco Franco ASoC: jack: Use power efficient workqueue The accessory detect debounce work is not performance sensitive so let the scheduler run it wherever is most efficient rather than in a per CPU workqueue by using the system power efficient workqueue. Signed-off-by: Mark Brown Acked-by: Viresh Kumar Signed-off-by: Francisco Franco net/neighbour: queue work on power efficient wq Workqueue used in neighbour layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar Signed-off-by: David S. Miller Signed-off-by: Francisco Franco timekeeping: Move clock sync work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the CMOS clock sync work would be scheduled. This improves idle residency time and conserver power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Signed-off-by: Shaibal Dutta [zoran.markovic@linaro.org: Added commit message. Aligned code.] Signed-off-by: Zoran Markovic Cc: John Stultz Link: http://lkml.kernel.org/r/1391195904-12497-1-git-send-email-zoran.markovic@linaro.org Signed-off-by: Thomas Gleixner Signed-off-by: Francisco Franco net: rfkill: move poll work to power efficient workqueue This patch moves the rfkill poll_work to the power efficient workqueue. This work does not have to be bound to the CPU that scheduled it, hence the selection of CPU that executes it would be left to the scheduler. Net result is that CPU idle times would be extended, resulting in power savings. This behaviour is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" Cc: "David S. Miller" Signed-off-by: Shaibal Dutta [zoran.markovic@linaro.org: Rebased to latest kernel, added commit message. Fixed workqueue selection after suspend/resume cycle.] Signed-off-by: Zoran Markovic Signed-off-by: Johannes Berg Signed-off-by: Francisco Franco usb: move hub init and LED blink work to power efficient workqueue Allow the scheduler to select the best CPU to handle hub initalization and LED blinking work. This extends idle residency times on idle CPUs and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. [zoran.markovic@linaro.org: Rebased to latest kernel. Added commit message. Changed reference from system to power efficient workqueue for LEDs in check_highspeed() and hub_port_connect_change().] Acked-by: Alan Stern Cc: Sarah Sharp Cc: Xenia Ragiadakou Cc: Julius Werner Cc: Krzysztof Mazur Cc: Matthias Beyer Cc: Dan Williams Cc: Mathias Nyman Cc: Thomas Pugliese Signed-off-by: Shaibal Dutta Signed-off-by: Zoran Markovic Signed-off-by: Greg Kroah-Hartman Signed-off-by: Francisco Franco block: remove WQ_POWER_EFFICIENT from kblockd blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit 695588f9454b, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling Signed-off-by: Jens Axboe Signed-off-by: Francisco Franco regulator: core: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown Reviewed-by: Viresh Kumar Acked-by: Liam Girdwood Signed-off-by: Francisco Franco power: smb135x: queue work on system_power_efficient_wq There doesn't seem to be any real dependency of scheduling these on the cpu which scheduled them, so moving every *_delayed to the power efficient wq save potential needlessly idle cpu wake ups leaving the scheduler to decide the most appropriate cpus to wake up. Signed-off-by: Francisco Franco --- Documentation/kernel-parameters.txt | 15 +++ block/blk-ioc.c | 3 +- block/genhd.c | 12 ++- drivers/base/firmware_class.c | 7 +- drivers/net/phy/phy.c | 9 +- drivers/power/smb135x-charger.c | 50 +++++---- drivers/regulator/core.c | 5 +- drivers/usb/core/hub.c | 19 ++-- include/linux/workqueue.h | 38 ++++++- kernel/power/Kconfig | 20 ++++ kernel/srcu.c | 5 +- kernel/time/ntp.c | 5 +- kernel/workqueue.c | 158 +++++++++++++++++++--------- net/core/neighbour.c | 5 +- net/ipv4/devinet.c | 10 +- net/rfkill/core.c | 9 +- net/wireless/reg.c | 7 +- sound/soc/soc-compress.c | 5 +- sound/soc/soc-jack.c | 2 +- sound/soc/soc-pcm.c | 5 +- 20 files changed, 280 insertions(+), 109 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 56e972a8d48b..1e59d1eb37a2 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3349,6 +3349,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted. that this also can be controlled per-workqueue for workqueues visible under /sys/bus/workqueue/. + workqueue.power_efficient + Per-cpu workqueues are generally preferred because + they show better performance thanks to cache + locality; unfortunately, per-cpu workqueues tend to + be more power hungry than unbound workqueues. + + Enabling this makes the per-cpu workqueues which + were observed to contribute significantly to power + consumption unbound, leading to measurably lower + power usage at the cost of small performance + overhead. + + The default value of this parameter is determined by + the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT. + x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of default x2apic cluster mode on platforms supporting x2apic. diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 9c4bb8266bc8..4464c823cff2 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -144,7 +144,8 @@ void put_io_context(struct io_context *ioc) if (atomic_long_dec_and_test(&ioc->refcount)) { spin_lock_irqsave(&ioc->lock, flags); if (!hlist_empty(&ioc->icq_list)) - schedule_work(&ioc->release_work); + queue_work(system_power_efficient_wq, + &ioc->release_work); else free_ioc = true; spin_unlock_irqrestore(&ioc->lock, flags); diff --git a/block/genhd.c b/block/genhd.c index 6f612a747810..6acda145311f 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1506,9 +1506,11 @@ static void __disk_unblock_events(struct gendisk *disk, bool check_now) intv = disk_events_poll_jiffies(disk); set_timer_slack(&ev->dwork.timer, intv / 4); if (check_now) - queue_delayed_work(system_freezable_wq, &ev->dwork, 0); + queue_delayed_work(system_freezable_power_efficient_wq, + &ev->dwork, 0); else if (intv) - queue_delayed_work(system_freezable_wq, &ev->dwork, intv); + queue_delayed_work(system_freezable_power_efficient_wq, + &ev->dwork, intv); out_unlock: spin_unlock_irqrestore(&ev->lock, flags); } @@ -1551,7 +1553,8 @@ void disk_flush_events(struct gendisk *disk, unsigned int mask) spin_lock_irq(&ev->lock); ev->clearing |= mask; if (!ev->block) - mod_delayed_work(system_freezable_wq, &ev->dwork, 0); + mod_delayed_work(system_freezable_power_efficient_wq, + &ev->dwork, 0); spin_unlock_irq(&ev->lock); } @@ -1644,7 +1647,8 @@ static void disk_check_events(struct disk_events *ev, intv = disk_events_poll_jiffies(disk); if (!ev->block && intv) - queue_delayed_work(system_freezable_wq, &ev->dwork, intv); + queue_delayed_work(system_freezable_power_efficient_wq, + &ev->dwork, intv); spin_unlock_irq(&ev->lock); diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c index f0493028fc75..cc0c3b1491d9 100644 --- a/drivers/base/firmware_class.c +++ b/drivers/base/firmware_class.c @@ -1006,7 +1006,8 @@ static int _request_firmware_load(struct firmware_priv *fw_priv, bool uevent, dev_set_uevent_suppress(f_dev, false); dev_dbg(f_dev, "firmware: requesting %s\n", buf->fw_id); if (timeout != MAX_SCHEDULE_TIMEOUT) - schedule_delayed_work(&fw_priv->timeout_work, timeout); + queue_delayed_work(system_power_efficient_wq, + &fw_priv->timeout_work, timeout); kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD); } @@ -1699,8 +1700,8 @@ static void device_uncache_fw_images_work(struct work_struct *work) */ static void device_uncache_fw_images_delay(unsigned long delay) { - schedule_delayed_work(&fw_cache.work, - msecs_to_jiffies(delay)); + queue_delayed_work(system_power_efficient_wq, &fw_cache.work, + msecs_to_jiffies(delay)); } static int fw_pm_notify(struct notifier_block *notify_block, diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 38f0b312ff85..663d2d0448b7 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -439,7 +439,7 @@ void phy_start_machine(struct phy_device *phydev, { phydev->adjust_state = handler; - schedule_delayed_work(&phydev->state_queue, HZ); + queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, HZ); } /** @@ -500,7 +500,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat) disable_irq_nosync(irq); atomic_inc(&phydev->irq_disable); - schedule_work(&phydev->phy_queue); + queue_work(system_power_efficient_wq, &phydev->phy_queue); return IRQ_HANDLED; } @@ -655,7 +655,7 @@ static void phy_change(struct work_struct *work) /* reschedule state queue work to run as soon as possible */ cancel_delayed_work_sync(&phydev->state_queue); - schedule_delayed_work(&phydev->state_queue, 0); + queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, 0); return; @@ -918,7 +918,8 @@ void phy_state_machine(struct work_struct *work) if (err < 0) phy_error(phydev); - schedule_delayed_work(&phydev->state_queue, PHY_STATE_TIME * HZ); + queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, + PHY_STATE_TIME * HZ); } static inline void mmd_phy_indirect(struct mii_bus *bus, int prtad, int devad, diff --git a/drivers/power/smb135x-charger.c b/drivers/power/smb135x-charger.c index 4048c0a4b3a1..226e3d5756c5 100644 --- a/drivers/power/smb135x-charger.c +++ b/drivers/power/smb135x-charger.c @@ -1911,7 +1911,8 @@ static int smb135x_battery_set_property(struct power_supply *psy, smb_stay_awake(&chip->smb_wake_source); chip->bms_check = 1; cancel_delayed_work(&chip->heartbeat_work); - schedule_delayed_work(&chip->heartbeat_work, + queue_delayed_work(system_power_efficient_wq, + &chip->heartbeat_work, msecs_to_jiffies(0)); break; case POWER_SUPPLY_PROP_HEALTH: @@ -1921,7 +1922,8 @@ static int smb135x_battery_set_property(struct power_supply *psy, smb135x_set_chrg_path_temp(chip); chip->temp_check = 1; cancel_delayed_work(&chip->heartbeat_work); - schedule_delayed_work(&chip->heartbeat_work, + queue_delayed_work(system_power_efficient_wq, + &chip->heartbeat_work, msecs_to_jiffies(0)); break; /* Block from Fuel Gauge */ @@ -2598,7 +2600,8 @@ static void aicl_check_work(struct work_struct *work) chip->aicl_weak_detect = true; cancel_delayed_work(&chip->src_removal_work); - schedule_delayed_work(&chip->src_removal_work, + queue_delayed_work(system_power_efficient_wq, + &chip->src_removal_work, msecs_to_jiffies(3000)); if (!rc) { dev_dbg(chip->dev, "Reached Bottom IC!\n"); @@ -2678,8 +2681,9 @@ static void rate_check_work(struct work_struct *work) chip->rate_check_count++; if (chip->rate_check_count < 6) - schedule_delayed_work(&chip->rate_check_work, - msecs_to_jiffies(500)); + queue_delayed_work(system_power_efficient_wq, + &chip->rate_check_work, + msecs_to_jiffies(500)); } static void usb_insertion_work(struct work_struct *work) @@ -2734,8 +2738,9 @@ static void heartbeat_work(struct work_struct *work) smb135x_get_prop_batt_health(chip, &batt_health)) { dev_warn(chip->dev, "HB Failed to run resume = %d!\n", (int)chip->resume_completed); - schedule_delayed_work(&chip->heartbeat_work, - msecs_to_jiffies(1000)); + queue_delayed_work(system_power_efficient_wq, + &chip->heartbeat_work, + msecs_to_jiffies(1000)); return; } @@ -2833,8 +2838,9 @@ static void heartbeat_work(struct work_struct *work) power_supply_changed(&chip->batt_psy); - schedule_delayed_work(&chip->heartbeat_work, - msecs_to_jiffies(60000)); + queue_delayed_work(system_power_efficient_wq, + &chip->heartbeat_work, + msecs_to_jiffies(60000)); chip->hb_running = false; if (!usb_present && !dc_present) smb_relax(&chip->smb_wake_source); @@ -2945,7 +2951,8 @@ static int otg_oc_handler(struct smb135x_chg *chip, u8 rt_stat) return 0; } - schedule_delayed_work(&chip->ocp_clear_work, + queue_delayed_work(system_power_efficient_wq, + &chip->ocp_clear_work, msecs_to_jiffies(0)); pr_err("rt_stat = 0x%02x\n", rt_stat); @@ -2970,7 +2977,8 @@ static int handle_dc_removal(struct smb135x_chg *chip) static int handle_dc_insertion(struct smb135x_chg *chip) { if (chip->dc_psy_type == POWER_SUPPLY_TYPE_WIRELESS) - schedule_delayed_work(&chip->wireless_insertion_work, + queue_delayed_work(system_power_efficient_wq, + &chip->wireless_insertion_work, msecs_to_jiffies(DCIN_UNSUSPEND_DELAY_MS)); if (chip->dc_psy_type != -EINVAL) power_supply_set_online(&chip->dc_psy, @@ -3121,8 +3129,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip) smb_stay_awake(&chip->smb_wake_source); chip->apsd_rerun_cnt++; chip->usb_present = 0; - schedule_delayed_work(&chip->usb_insertion_work, - msecs_to_jiffies(1000)); + queue_delayed_work(system_power_efficient_wq, + &chip->usb_insertion_work, + msecs_to_jiffies(1000)); return 0; } @@ -3162,8 +3171,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip) chip->charger_rate = POWER_SUPPLY_CHARGE_RATE_NORMAL; chip->rate_check_count = 0; cancel_delayed_work(&chip->rate_check_work); - schedule_delayed_work(&chip->rate_check_work, - msecs_to_jiffies(500)); + queue_delayed_work(system_power_efficient_wq, + &chip->rate_check_work, + msecs_to_jiffies(500)); return 0; } @@ -3204,8 +3214,9 @@ static int usbin_uv_handler(struct smb135x_chg *chip, u8 rt_stat) if (rc < 0) pr_err("Failed to Disable USBIN UV IRQ\n"); cancel_delayed_work(&chip->aicl_check_work); - schedule_delayed_work(&chip->aicl_check_work, - msecs_to_jiffies(0)); + queue_delayed_work(system_power_efficient_wq, + &chip->aicl_check_work, + msecs_to_jiffies(0)); } return 0; } @@ -5435,8 +5446,9 @@ static int smb135x_charger_probe(struct i2c_client *client, if (rc < 0) pr_err("failed to set up voltage notifications: %d\n", rc); - schedule_delayed_work(&chip->heartbeat_work, - msecs_to_jiffies(60000)); + queue_delayed_work(system_power_efficient_wq, + &chip->heartbeat_work, + msecs_to_jiffies(60000)); dev_info(chip->dev, "SMB135X version = %s revision = %s successfully probed batt=%d dc = %d usb = %d\n", version_str[chip->version], diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c index bdc00a196659..7423b8335622 100644 --- a/drivers/regulator/core.c +++ b/drivers/regulator/core.c @@ -1925,8 +1925,9 @@ int regulator_disable_deferred(struct regulator *regulator, int ms) rdev->deferred_disables++; mutex_unlock(&rdev->mutex); - ret = schedule_delayed_work(&rdev->disable_work, - msecs_to_jiffies(ms)); + ret = queue_delayed_work(system_power_efficient_wq, + &rdev->disable_work, + msecs_to_jiffies(ms)); if (ret < 0) return ret; else diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index 96c82a98cd50..b05d7fb135ac 100644 --- a/drivers/usb/core/hub.c +++ b/drivers/usb/core/hub.c @@ -506,7 +506,8 @@ static void led_work (struct work_struct *work) changed++; } if (changed) - schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD); + queue_delayed_work(system_power_efficient_wq, + &hub->leds, LED_CYCLE_PERIOD); } /* use a short timeout for hub/port status fetches */ @@ -1057,7 +1058,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type) goto init2; #endif PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func2); - schedule_delayed_work(&hub->init_work, + queue_delayed_work(system_power_efficient_wq, + &hub->init_work, msecs_to_jiffies(delay)); /* Suppress autosuspend until init is done */ @@ -1218,7 +1220,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type) /* Don't do a long sleep inside a workqueue routine */ if (type == HUB_INIT2) { PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func3); - schedule_delayed_work(&hub->init_work, + queue_delayed_work(system_power_efficient_wq, + &hub->init_work, msecs_to_jiffies(delay)); device_unlock(hub->intfdev); return; /* Continues at init3: below */ @@ -1233,7 +1236,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type) if (status < 0) dev_err(hub->intfdev, "activate --> %d\n", status); if (hub->has_indicators && blinkenlights) - schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD); + queue_delayed_work(system_power_efficient_wq, + &hub->leds, LED_CYCLE_PERIOD); /* Scan all ports that need attention */ kick_khubd(hub); @@ -4396,7 +4400,8 @@ check_highspeed (struct usb_hub *hub, struct usb_device *udev, int port1) /* hub LEDs are probably harder to miss than syslog */ if (hub->has_indicators) { hub->indicator[port1-1] = INDICATOR_GREEN_BLINK; - schedule_delayed_work (&hub->leds, 0); + queue_delayed_work(system_power_efficient_wq, + &hub->leds, 0); } } kfree(qual); @@ -4626,7 +4631,9 @@ static void hub_port_connect_change(struct usb_hub *hub, int port1, if (hub->has_indicators) { hub->indicator[port1-1] = INDICATOR_AMBER_BLINK; - schedule_delayed_work (&hub->leds, 0); + queue_delayed_work( + system_power_efficient_wq, + &hub->leds, 0); } status = -ENOTCONN; /* Don't retry */ goto loop_disable; diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 623488fdc1f5..33cbe000424b 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -71,7 +71,8 @@ enum { /* data contains off-queue information when !WORK_STRUCT_PWQ */ WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT, - WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE), + __WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE, + WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING), /* * When a work item is off queue, its high bits point to the last @@ -303,6 +304,33 @@ enum { WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */ WQ_SYSFS = 1 << 6, /* visible in sysfs, see wq_sysfs_register() */ + /* + * Per-cpu workqueues are generally preferred because they tend to + * show better performance thanks to cache locality. Per-cpu + * workqueues exclude the scheduler from choosing the CPU to + * execute the worker threads, which has an unfortunate side effect + * of increasing power consumption. + * + * The scheduler considers a CPU idle if it doesn't have any task + * to execute and tries to keep idle cores idle to conserve power; + * however, for example, a per-cpu work item scheduled from an + * interrupt handler on an idle CPU will force the scheduler to + * excute the work item on that CPU breaking the idleness, which in + * turn may lead to more scheduling choices which are sub-optimal + * in terms of power consumption. + * + * Workqueues marked with WQ_POWER_EFFICIENT are per-cpu by default + * but become unbound if workqueue.power_efficient kernel param is + * specified. Per-cpu workqueues which are identified to + * contribute significantly to power-consumption are identified and + * marked with this flag and enabling the power_efficient mode + * leads to noticeable power saving at the cost of small + * performance disadvantage. + * + * http://thread.gmane.org/gmane.linux.kernel/1480396 + */ + WQ_POWER_EFFICIENT = 1 << 7, + __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ @@ -333,11 +361,19 @@ enum { * * system_freezable_wq is equivalent to system_wq except that it's * freezable. + * + * *_power_efficient_wq are inclined towards saving power and converted + * into WQ_UNBOUND variants if 'wq_power_efficient' is enabled; otherwise, + * they are same as their non-power-efficient counterparts - e.g. + * system_power_efficient_wq is identical to system_wq if + * 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info. */ extern struct workqueue_struct *system_wq; extern struct workqueue_struct *system_long_wq; extern struct workqueue_struct *system_unbound_wq; extern struct workqueue_struct *system_freezable_wq; +extern struct workqueue_struct *system_power_efficient_wq; +extern struct workqueue_struct *system_freezable_power_efficient_wq; static inline struct workqueue_struct * __deprecated __system_nrt_wq(void) { diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index c584a95ae531..4b8be16f853f 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -271,6 +271,26 @@ config PM_GENERIC_DOMAINS bool depends on PM +config WQ_POWER_EFFICIENT_DEFAULT + bool "Enable workqueue power-efficient mode by default" + depends on PM + default n + help + Per-cpu workqueues are generally preferred because they show + better performance thanks to cache locality; unfortunately, + per-cpu workqueues tend to be more power hungry than unbound + workqueues. + + Enabling workqueue.power_efficient kernel parameter makes the + per-cpu workqueues which were observed to contribute + significantly to power consumption unbound, leading to measurably + lower power usage at the cost of small performance overhead. + + This config option determines whether workqueue.power_efficient + is enabled by default. + + If in doubt, say N. + config PM_GENERIC_DOMAINS_SLEEP def_bool y depends on PM_SLEEP && PM_GENERIC_DOMAINS diff --git a/kernel/srcu.c b/kernel/srcu.c index 01d5ccb8bfe3..27b654950cc8 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -375,7 +375,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head, rcu_batch_queue(&sp->batch_queue, head); if (!sp->running) { sp->running = true; - schedule_delayed_work(&sp->work, 0); + queue_delayed_work(system_power_efficient_wq, &sp->work, 0); } spin_unlock_irqrestore(&sp->queue_lock, flags); } @@ -631,7 +631,8 @@ static void srcu_reschedule(struct srcu_struct *sp) } if (pending) - schedule_delayed_work(&sp->work, SRCU_INTERVAL); + queue_delayed_work(system_power_efficient_wq, + &sp->work, SRCU_INTERVAL); } /* diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index af8d1d4f3d55..419a52cecd20 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -514,12 +514,13 @@ static void sync_cmos_clock(struct work_struct *work) next.tv_sec++; next.tv_nsec -= NSEC_PER_SEC; } - schedule_delayed_work(&sync_cmos_work, timespec_to_jiffies(&next)); + queue_delayed_work(system_power_efficient_wq, + &sync_cmos_work, timespec_to_jiffies(&next)); } void ntp_notify_cmos_timer(void) { - schedule_delayed_work(&sync_cmos_work, 0); + queue_delayed_work(system_power_efficient_wq, &sync_cmos_work, 0); } #else diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 4611cb80e1a4..70a526972c61 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -273,6 +273,15 @@ static cpumask_var_t *wq_numa_possible_cpumask; static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); +/* see the comment above the definition of WQ_POWER_EFFICIENT */ +#ifdef CONFIG_WQ_POWER_EFFICIENT_DEFAULT +static bool wq_power_efficient = true; +#else +static bool wq_power_efficient; +#endif + +module_param_named(power_efficient, wq_power_efficient, bool, 0444); + static bool wq_numa_enabled; /* unbound NUMA affinity enabled */ /* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */ @@ -309,6 +318,10 @@ struct workqueue_struct *system_unbound_wq __read_mostly; EXPORT_SYMBOL_GPL(system_unbound_wq); struct workqueue_struct *system_freezable_wq __read_mostly; EXPORT_SYMBOL_GPL(system_freezable_wq); +struct workqueue_struct *system_power_efficient_wq __read_mostly; +EXPORT_SYMBOL_GPL(system_power_efficient_wq); +struct workqueue_struct *system_freezable_power_efficient_wq __read_mostly; +EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq); static int worker_thread(void *__worker); static void copy_workqueue_attrs(struct workqueue_attrs *to, @@ -1451,13 +1464,13 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, timer_stats_timer_set_start_info(&dwork->timer); dwork->wq = wq; + /* timer isn't guaranteed to run in this cpu, record earlier */ + if (cpu == WORK_CPU_UNBOUND) + cpu = raw_smp_processor_id(); dwork->cpu = cpu; timer->expires = jiffies + delay; - if (unlikely(cpu != WORK_CPU_UNBOUND)) - add_timer_on(timer, cpu); - else - add_timer(timer); + add_timer_on(timer, cpu); } /** @@ -1929,17 +1942,13 @@ static void pool_mayday_timeout(unsigned long __pool) * spin_lock_irq(pool->lock) which may be released and regrabbed * multiple times. Does GFP_KERNEL allocations. Called only from * manager. - * - * RETURNS: - * %false if no action was taken and pool->lock stayed locked, %true - * otherwise. */ -static bool maybe_create_worker(struct worker_pool *pool) +static void maybe_create_worker(struct worker_pool *pool) __releases(&pool->lock) __acquires(&pool->lock) { if (!need_to_create_worker(pool)) - return false; + return; restart: spin_unlock_irq(&pool->lock); @@ -1956,7 +1965,7 @@ __acquires(&pool->lock) start_worker(worker); if (WARN_ON_ONCE(need_to_create_worker(pool))) goto restart; - return true; + return; } if (!need_to_create_worker(pool)) @@ -1973,7 +1982,7 @@ __acquires(&pool->lock) spin_lock_irq(&pool->lock); if (need_to_create_worker(pool)) goto restart; - return true; + return; } /** @@ -1986,15 +1995,9 @@ __acquires(&pool->lock) * LOCKING: * spin_lock_irq(pool->lock) which may be released and regrabbed * multiple times. Called only from manager. - * - * RETURNS: - * %false if no action was taken and pool->lock stayed locked, %true - * otherwise. */ -static bool maybe_destroy_workers(struct worker_pool *pool) +static void maybe_destroy_workers(struct worker_pool *pool) { - bool ret = false; - while (too_many_workers(pool)) { struct worker *worker; unsigned long expires; @@ -2008,10 +2011,7 @@ static bool maybe_destroy_workers(struct worker_pool *pool) } destroy_worker(worker); - ret = true; } - - return ret; } /** @@ -2031,13 +2031,14 @@ static bool maybe_destroy_workers(struct worker_pool *pool) * multiple times. Does GFP_KERNEL allocations. * * RETURNS: - * spin_lock_irq(pool->lock) which may be released and regrabbed - * multiple times. Does GFP_KERNEL allocations. + * %false if the pool doesn't need management and the caller can safely + * start processing works, %true if management function was performed and + * the conditions that the caller verified before calling the function may + * no longer be true. */ static bool manage_workers(struct worker *worker) { struct worker_pool *pool = worker->pool; - bool ret = false; /* * Managership is governed by two mutexes - manager_arb and @@ -2061,7 +2062,7 @@ static bool manage_workers(struct worker *worker) * manager_mutex. */ if (!mutex_trylock(&pool->manager_arb)) - return ret; + return false; /* * With manager arbitration won, manager_mutex would be free in @@ -2071,7 +2072,6 @@ static bool manage_workers(struct worker *worker) spin_unlock_irq(&pool->lock); mutex_lock(&pool->manager_mutex); spin_lock_irq(&pool->lock); - ret = true; } pool->flags &= ~POOL_MANAGE_WORKERS; @@ -2080,12 +2080,12 @@ static bool manage_workers(struct worker *worker) * Destroy and then create so that may_start_working() is true * on return. */ - ret |= maybe_destroy_workers(pool); - ret |= maybe_create_worker(pool); + maybe_destroy_workers(pool); + maybe_create_worker(pool); mutex_unlock(&pool->manager_mutex); mutex_unlock(&pool->manager_arb); - return ret; + return true; } /** @@ -2853,19 +2853,57 @@ bool flush_work(struct work_struct *work) } EXPORT_SYMBOL_GPL(flush_work); +struct cwt_wait { + wait_queue_t wait; + struct work_struct *work; +}; + +static int cwt_wakefn(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct cwt_wait *cwait = container_of(wait, struct cwt_wait, wait); + + if (cwait->work != key) + return 0; + return autoremove_wake_function(wait, mode, sync, key); +} + static bool __cancel_work_timer(struct work_struct *work, bool is_dwork) { + static DECLARE_WAIT_QUEUE_HEAD(cancel_waitq); unsigned long flags; int ret; do { ret = try_to_grab_pending(work, is_dwork, &flags); /* - * If someone else is canceling, wait for the same event it - * would be waiting for before retrying. + * If someone else is already canceling, wait for it to + * finish. flush_work() doesn't work for PREEMPT_NONE + * because we may get scheduled between @work's completion + * and the other canceling task resuming and clearing + * CANCELING - flush_work() will return false immediately + * as @work is no longer busy, try_to_grab_pending() will + * return -ENOENT as @work is still being canceled and the + * other canceling task won't be able to clear CANCELING as + * we're hogging the CPU. + * + * Let's wait for completion using a waitqueue. As this + * may lead to the thundering herd problem, use a custom + * wake function which matches @work along with exclusive + * wait and wakeup. */ - if (unlikely(ret == -ENOENT)) - flush_work(work); + if (unlikely(ret == -ENOENT)) { + struct cwt_wait cwait; + + init_wait(&cwait.wait); + cwait.wait.func = cwt_wakefn; + cwait.work = work; + + prepare_to_wait_exclusive(&cancel_waitq, &cwait.wait, + TASK_UNINTERRUPTIBLE); + if (work_is_canceling(work)) + schedule(); + finish_wait(&cancel_waitq, &cwait.wait); + } } while (unlikely(ret < 0)); /* tell other tasks trying to grab @work to back off */ @@ -2874,6 +2912,16 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork) flush_work(work); clear_work_data(work); + + /* + * Paired with prepare_to_wait() above so that either + * waitqueue_active() is visible here or !work_is_canceling() is + * visible there. + */ + smp_mb(); + if (waitqueue_active(&cancel_waitq)) + __wake_up(&cancel_waitq, TASK_NORMAL, 1, work); + return ret; } @@ -4126,6 +4174,10 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, struct workqueue_struct *wq; struct pool_workqueue *pwq; + /* see the comment above the definition of WQ_POWER_EFFICIENT */ + if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient) + flags |= WQ_UNBOUND; + /* allocate wq and format name */ if (flags & WQ_UNBOUND) tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]); @@ -4569,10 +4621,13 @@ static void wq_unbind_fn(struct work_struct *work) /** * rebind_workers - rebind all workers of a pool to the associated CPU * @pool: pool of interest + * @force: if it is true, replace WORKER_UNBOUND with WORKER_REBOUND + * irrespective of flags of workers. Otherwise, replace the flags only + * when workers have WORKER_UNBOUND flag. * * @pool->cpu is coming online. Rebind all workers to the CPU. */ -static void rebind_workers(struct worker_pool *pool) +static void rebind_workers(struct worker_pool *pool, bool force) { struct worker *worker; int wi; @@ -4591,6 +4646,7 @@ static void rebind_workers(struct worker_pool *pool) pool->attrs->cpumask) < 0); spin_lock_irq(&pool->lock); + pool->flags &= ~POOL_DISASSOCIATED; for_each_pool_worker(worker, wi, pool) { unsigned int worker_flags = worker->flags; @@ -4621,10 +4677,12 @@ static void rebind_workers(struct worker_pool *pool) * fail incorrectly leading to premature concurrency * management operations. */ - WARN_ON_ONCE(!(worker_flags & WORKER_UNBOUND)); - worker_flags |= WORKER_REBOUND; - worker_flags &= ~WORKER_UNBOUND; - ACCESS_ONCE(worker->flags) = worker_flags; + if (force || (worker_flags & WORKER_UNBOUND)) { + WARN_ON_ONCE(!(worker_flags & WORKER_UNBOUND)); + worker_flags |= WORKER_REBOUND; + worker_flags &= ~WORKER_UNBOUND; + ACCESS_ONCE(worker->flags) = worker_flags; + } } spin_unlock_irq(&pool->lock); @@ -4693,15 +4751,12 @@ static int __cpuinit workqueue_cpu_up_callback(struct notifier_block *nfb, for_each_pool(pool, pi) { mutex_lock(&pool->manager_mutex); - if (pool->cpu == cpu) { - spin_lock_irq(&pool->lock); - pool->flags &= ~POOL_DISASSOCIATED; - spin_unlock_irq(&pool->lock); - - rebind_workers(pool); - } else if (pool->cpu < 0) { + if (pool->cpu == cpu) + rebind_workers(pool, + (action & ~CPU_TASKS_FROZEN) + != CPU_DOWN_FAILED); + else if (pool->cpu < 0) restore_unbound_workers_cpumask(pool, cpu); - } mutex_unlock(&pool->manager_mutex); } @@ -5035,8 +5090,15 @@ static int __init init_workqueues(void) WQ_UNBOUND_MAX_ACTIVE); system_freezable_wq = alloc_workqueue("events_freezable", WQ_FREEZABLE, 0); + system_power_efficient_wq = alloc_workqueue("events_power_efficient", + WQ_POWER_EFFICIENT, 0); + system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient", + WQ_FREEZABLE | WQ_POWER_EFFICIENT, + 0); BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq || - !system_unbound_wq || !system_freezable_wq); + !system_unbound_wq || !system_freezable_wq || + !system_power_efficient_wq || + !system_freezable_power_efficient_wq); return 0; } early_initcall(init_workqueues); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index cdd77161e99f..7aa5aa1f2fa9 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -826,7 +826,7 @@ static void neigh_periodic_work(struct work_struct *work) * ARP entry timeouts range from 1/2 base_reachable_time to 3/2 * base_reachable_time. */ - schedule_delayed_work(&tbl->gc_work, + queue_delayed_work(system_power_efficient_wq, &tbl->gc_work, tbl->parms.base_reachable_time >> 1); write_unlock_bh(&tbl->lock); } @@ -1544,7 +1544,8 @@ static void neigh_table_init_no_netlink(struct neigh_table *tbl) rwlock_init(&tbl->lock); INIT_DEFERRABLE_WORK(&tbl->gc_work, neigh_periodic_work); - schedule_delayed_work(&tbl->gc_work, tbl->parms.reachable_time); + queue_delayed_work(system_power_efficient_wq, &tbl->gc_work, + tbl->parms.reachable_time); setup_timer(&tbl->proxy_timer, neigh_proxy_process, (unsigned long)tbl); skb_queue_head_init_class(&tbl->proxy_queue, &neigh_table_proxy_queue_class); diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index b151e0ac7f27..5b85c29d7d98 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -469,7 +469,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh, inet_hash_insert(dev_net(in_dev->dev), ifa); cancel_delayed_work(&check_lifetime_work); - schedule_delayed_work(&check_lifetime_work, 0); + queue_delayed_work(system_power_efficient_wq, &check_lifetime_work, 0); /* Send message first, then call notifier. Notifier will trigger FIB update, so that @@ -678,7 +678,8 @@ static void check_lifetime(struct work_struct *work) if (time_before(next_sched, now + ADDRCONF_TIMER_FUZZ_MAX)) next_sched = now + ADDRCONF_TIMER_FUZZ_MAX; - schedule_delayed_work(&check_lifetime_work, next_sched - now); + queue_delayed_work(system_power_efficient_wq, &check_lifetime_work, + next_sched - now); } static void set_ifa_lifetime(struct in_ifaddr *ifa, __u32 valid_lft, @@ -834,7 +835,8 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh) ifa = ifa_existing; set_ifa_lifetime(ifa, valid_lft, prefered_lft); cancel_delayed_work(&check_lifetime_work); - schedule_delayed_work(&check_lifetime_work, 0); + queue_delayed_work(system_power_efficient_wq, + &check_lifetime_work, 0); rtmsg_ifa(RTM_NEWADDR, ifa, nlh, NETLINK_CB(skb).portid); blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa); } @@ -2299,7 +2301,7 @@ void __init devinet_init(void) register_gifconf(PF_INET, inet_gifconf); register_netdevice_notifier(&ip_netdev_notifier); - schedule_delayed_work(&check_lifetime_work, 0); + queue_delayed_work(system_power_efficient_wq, &check_lifetime_work, 0); rtnl_af_register(&inet_af_ops); diff --git a/net/rfkill/core.c b/net/rfkill/core.c index c099b4fffd93..9e94dcabfee1 100644 --- a/net/rfkill/core.c +++ b/net/rfkill/core.c @@ -800,7 +800,8 @@ void rfkill_resume_polling(struct rfkill *rfkill) if (!rfkill->ops->poll) return; - schedule_work(&rfkill->poll_work.work); + queue_delayed_work(system_power_efficient_wq, + &rfkill->poll_work, 0); } EXPORT_SYMBOL(rfkill_resume_polling); @@ -908,7 +909,8 @@ static void rfkill_poll(struct work_struct *work) */ rfkill->ops->poll(rfkill, rfkill->data); - schedule_delayed_work(&rfkill->poll_work, + queue_delayed_work(system_power_efficient_wq, + &rfkill->poll_work, round_jiffies_relative(POLL_INTERVAL)); } @@ -972,7 +974,8 @@ int __must_check rfkill_register(struct rfkill *rfkill) INIT_WORK(&rfkill->sync_work, rfkill_sync_work); if (rfkill->ops->poll) - schedule_delayed_work(&rfkill->poll_work, + queue_delayed_work(system_power_efficient_wq, + &rfkill->poll_work, round_jiffies_relative(POLL_INTERVAL)); if (!rfkill->persistent || rfkill_epo_lock_active) { diff --git a/net/wireless/reg.c b/net/wireless/reg.c index a45d07a4b47a..f515d2bde5d0 100644 --- a/net/wireless/reg.c +++ b/net/wireless/reg.c @@ -1581,8 +1581,8 @@ static void reg_process_hint(struct regulatory_request *reg_request, break; default: if (reg_initiator == NL80211_REGDOM_SET_BY_USER) - schedule_delayed_work(®_timeout, - msecs_to_jiffies(3142)); + queue_delayed_work(system_power_efficient_wq, + ®_timeout, msecs_to_jiffies(3142)); break; } } @@ -2184,7 +2184,8 @@ static int __set_regdom(const struct ieee80211_regdomain *rd) if (!request_wiphy && (lr->initiator == NL80211_REGDOM_SET_BY_DRIVER || lr->initiator == NL80211_REGDOM_SET_BY_COUNTRY_IE)) { - schedule_delayed_work(®_timeout, 0); + queue_delayed_work(system_power_efficient_wq, + ®_timeout, 0); return -ENODEV; } diff --git a/sound/soc/soc-compress.c b/sound/soc/soc-compress.c index 91a24bf07dad..4e20ea709cc6 100644 --- a/sound/soc/soc-compress.c +++ b/sound/soc/soc-compress.c @@ -246,8 +246,9 @@ static int soc_compr_free(struct snd_compr_stream *cstream) SND_SOC_DAPM_STREAM_STOP); } else { rtd->pop_wait = 1; - schedule_delayed_work(&rtd->delayed_work, - msecs_to_jiffies(rtd->pmdown_time)); + queue_delayed_work(system_power_efficient_wq, + &rtd->delayed_work, + msecs_to_jiffies(rtd->pmdown_time)); } } else { /* capture streams can be powered down now */ diff --git a/sound/soc/soc-jack.c b/sound/soc/soc-jack.c index 346991e39a14..b7973e494fa3 100644 --- a/sound/soc/soc-jack.c +++ b/sound/soc/soc-jack.c @@ -280,7 +280,7 @@ static irqreturn_t gpio_handler(int irq, void *data) if (device_may_wakeup(dev)) pm_wakeup_event(dev, gpio->debounce_time + 50); - schedule_delayed_work(&gpio->work, + queue_delayed_work(system_power_efficient_wq, &gpio->work, msecs_to_jiffies(gpio->debounce_time)); return IRQ_HANDLED; diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c index 42015dbc46b4..1699e62aaf9f 100644 --- a/sound/soc/soc-pcm.c +++ b/sound/soc/soc-pcm.c @@ -426,8 +426,9 @@ static int soc_pcm_close(struct snd_pcm_substream *substream) } else { /* start delayed pop wq here for playback streams */ rtd->pop_wait = 1; - schedule_delayed_work(&rtd->delayed_work, - msecs_to_jiffies(rtd->pmdown_time)); + queue_delayed_work(system_power_efficient_wq, + &rtd->delayed_work, + msecs_to_jiffies(rtd->pmdown_time)); } } else { /* capture streams can be powered down now */