workqueue: fix subtle pool management issue which can stall whole wor…

…ker_pool commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream. A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Eric Sandeen <[email protected]> Link: http://lkml.kernel.org/g/[email protected] Cc: Dave Chinner <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/[email protected] v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Rabin Vincent <[email protected]> Cc: Tomeu Vizoso <[email protected]> Tested-by: Jesper Nilsson <[email protected]> Tested-by: Rabin Vincent <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: make sure delayed work run in local cpu commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream. My system keeps crashing with below message. vmstat_update() schedules a delayed work in current cpu and expects the work runs in the cpu. schedule_delayed_work() is expected to make delayed work run in local cpu. The problem is timer can be migrated with NO_HZ. __queue_work() queues work in timer handler, which could run in a different cpu other than where the delayed work is scheduled. The end result is the delayed work runs in different cpu. The patch makes __queue_delayed_work records local cpu earlier. Where the timer runs doesn't change where the work runs with the change. [ 28.010131] ------------[ cut here ]------------ [ 28.010609] kernel BUG at ../mm/vmstat.c:1392! [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [ 28.011860] Modules linked in: [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634 [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 28.014160] Workqueue: events vmstat_update [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000 [ 28.015445] RIP: 0010:[<ffffffff8115f921>] [<ffffffff8115f921>]vmstat_update+0x31/0x80 [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297 [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000 [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000 [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640 [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000 [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000 [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0 [ 28.020071] Stack: [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88 [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8 [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340 [ 28.020071] Call Trace: [ 28.020071] [<ffffffff8106bd88>] process_one_work+0x1c8/0x540 [ 28.020071] [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540 [ 28.020071] [<ffffffff8106c214>] worker_thread+0x114/0x460 [ 28.020071] [<ffffffff8106c100>] ? process_one_work+0x540/0x540 [ 28.020071] [<ffffffff81071bf8>] kthread+0xf8/0x110 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 [ 28.020071] [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: clear POOL_DISASSOCIATED in rebind_workers() a9ab775 ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueue: Fix workqueue stall issue after cpu down failure When the hotplug notifier call chain with CPU_DOWN_PREPARE is broken before reaching workqueue_cpu_down_callback(), rebind_workers() adds WORKER_REBOUND flag for running workers. Hence, the nr_running of the pool is not increased when scheduler wakes up the worker. The fix is skipping adding WORKER_REBOUND flag when the worker doesn't have WORKER_UNBOUND flag in CPU_DOWN_FAILED path. Change-Id: I2528e9154f4913d9ec14b63adbcbcd1eaa8a8452 Signed-off-by: Se Wang (Patrick) Oh <[email protected]> Signed-off-by: franciscofranco <[email protected]> workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar <[email protected]> Reviewed-by: Amit Kucheria <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> workqueue: Add system wide power_efficient workqueues This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> firmware: use power efficient workqueue for unloading and aborting fw load Allow the scheduler to select the most appropriate CPU for running the firmware load timeout routine and delayed routine for firmware unload. This extends idle residency times and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Ming Lei <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel, added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net: wireless: move regulatory timeout work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the timeout work of regulatory settings would be executed. This extends CPU idle residency time and saves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel. Added commit message.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Francisco Franco <[email protected]> rcu: Move SRCU grace period work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the SRCU grace period work would be scheduled. This improves idle residency time and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: Lai Jiangshan <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Dipankar Sarma <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel version. Added commit message. Fixed code alignment.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net/ipv4: queue work on power efficient wq Workqueue used in ipv4 layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Francisco Franco <[email protected]> block: queue work on power efficient wq Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> PHYLIB: queue work on system_power_efficient_wq Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces system_wq with system_power_efficient_wq for PHYLIB. Cc: David S. Miller <[email protected]> Cc: [email protected] Signed-off-by: Viresh Kumar <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: pcm: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Viresh Kumar <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: compress: Use power efficient workqueue There is no need for the power down work to be done on a per CPU workqueue especially considering the fairly long delay before powerdown. Signed-off-by: Mark Brown <[email protected]> Acked-by: Vinod Koul <[email protected]> Signed-off-by: Francisco Franco <[email protected]> ASoC: jack: Use power efficient workqueue The accessory detect debounce work is not performance sensitive so let the scheduler run it wherever is most efficient rather than in a per CPU workqueue by using the system power efficient workqueue. Signed-off-by: Mark Brown <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net/neighbour: queue work on power efficient wq Workqueue used in neighbour layer have no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that an idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. This doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is enabled. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Francisco Franco <[email protected]> timekeeping: Move clock sync work to power efficient workqueue For better use of CPU idle time, allow the scheduler to select the CPU on which the CMOS clock sync work would be scheduled. This improves idle residency time and conserver power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Added commit message. Aligned code.] Signed-off-by: Zoran Markovic <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Francisco Franco <[email protected]> net: rfkill: move poll work to power efficient workqueue This patch moves the rfkill poll_work to the power efficient workqueue. This work does not have to be bound to the CPU that scheduled it, hence the selection of CPU that executes it would be left to the scheduler. Net result is that CPU idle times would be extended, resulting in power savings. This behaviour is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. Cc: "John W. Linville" <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> [[email protected]: Rebased to latest kernel, added commit message. Fixed workqueue selection after suspend/resume cycle.] Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Francisco Franco <[email protected]> usb: move hub init and LED blink work to power efficient workqueue Allow the scheduler to select the best CPU to handle hub initalization and LED blinking work. This extends idle residency times on idle CPUs and conserves power. This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected. [[email protected]: Rebased to latest kernel. Added commit message. Changed reference from system to power efficient workqueue for LEDs in check_highspeed() and hub_port_connect_change().] Acked-by: Alan Stern <[email protected]> Cc: Sarah Sharp <[email protected]> Cc: Xenia Ragiadakou <[email protected]> Cc: Julius Werner <[email protected]> Cc: Krzysztof Mazur <[email protected]> Cc: Matthias Beyer <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mathias Nyman <[email protected]> Cc: Thomas Pugliese <[email protected]> Signed-off-by: Shaibal Dutta <[email protected]> Signed-off-by: Zoran Markovic <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Francisco Franco <[email protected]> block: remove WQ_POWER_EFFICIENT from kblockd blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit 695588f9454b, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Francisco Franco <[email protected]> regulator: core: Use the power efficient workqueue for delayed powerdown There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Viresh Kumar <[email protected]> Acked-by: Liam Girdwood <[email protected]> Signed-off-by: Francisco Franco <[email protected]> power: smb135x: queue work on system_power_efficient_wq There doesn't seem to be any real dependency of scheduling these on the cpu which scheduled them, so moving every *_delayed to the power efficient wq save potential needlessly idle cpu wake ups leaving the scheduler to decide the most appropriate cpus to wake up. Signed-off-by: Francisco Franco <[email protected]>
franciscofranco · Oct 6, 2016 · d1590c8 · d1590c8
1 parent dc3cedb
commit d1590c8
Show file tree

Hide file tree

Showing 20 changed files with 280 additions and 109 deletions.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
@@ -3349,6 +3349,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that this also can be controlled per-workqueue for
 			workqueues visible under /sys/bus/workqueue/.
 
+	workqueue.power_efficient
+			Per-cpu workqueues are generally preferred because
+			they show better performance thanks to cache
+			locality; unfortunately, per-cpu workqueues tend to
+			be more power hungry than unbound workqueues.
+
+			Enabling this makes the per-cpu workqueues which
+			were observed to contribute significantly to power
+			consumption unbound, leading to measurably lower
+			power usage at the cost of small performance
+			overhead.
+
+			The default value of this parameter is determined by
+			the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
+
 	x2apic_phys	[X86-64,APIC] Use x2apic physical mode instead of
 			default x2apic cluster mode on platforms
 			supporting x2apic.

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
@@ -144,7 +144,8 @@ void put_io_context(struct io_context *ioc)
 	if (atomic_long_dec_and_test(&ioc->refcount)) {
 		spin_lock_irqsave(&ioc->lock, flags);
 		if (!hlist_empty(&ioc->icq_list))
-			schedule_work(&ioc->release_work);
+			queue_work(system_power_efficient_wq,
+					&ioc->release_work);
 		else
 			free_ioc = true;
 		spin_unlock_irqrestore(&ioc->lock, flags);

diff --git a/block/genhd.c b/block/genhd.c
@@ -1506,9 +1506,11 @@ static void __disk_unblock_events(struct gendisk *disk, bool check_now)
 	intv = disk_events_poll_jiffies(disk);
 	set_timer_slack(&ev->dwork.timer, intv / 4);
 	if (check_now)
-		queue_delayed_work(system_freezable_wq, &ev->dwork, 0);
+		queue_delayed_work(system_freezable_power_efficient_wq,
+				&ev->dwork, 0);
 	else if (intv)
-		queue_delayed_work(system_freezable_wq, &ev->dwork, intv);
+		queue_delayed_work(system_freezable_power_efficient_wq,
+				&ev->dwork, intv);
 out_unlock:
 	spin_unlock_irqrestore(&ev->lock, flags);
 }
@@ -1551,7 +1553,8 @@ void disk_flush_events(struct gendisk *disk, unsigned int mask)
 	spin_lock_irq(&ev->lock);
 	ev->clearing |= mask;
 	if (!ev->block)
-		mod_delayed_work(system_freezable_wq, &ev->dwork, 0);
+		mod_delayed_work(system_freezable_power_efficient_wq,
+				&ev->dwork, 0);
 	spin_unlock_irq(&ev->lock);
 }
 
@@ -1644,7 +1647,8 @@ static void disk_check_events(struct disk_events *ev,
 
 	intv = disk_events_poll_jiffies(disk);
 	if (!ev->block && intv)
-		queue_delayed_work(system_freezable_wq, &ev->dwork, intv);
+		queue_delayed_work(system_freezable_power_efficient_wq,
+				&ev->dwork, intv);
 
 	spin_unlock_irq(&ev->lock);
 

diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
@@ -1006,7 +1006,8 @@ static int _request_firmware_load(struct firmware_priv *fw_priv, bool uevent,
 		dev_set_uevent_suppress(f_dev, false);
 		dev_dbg(f_dev, "firmware: requesting %s\n", buf->fw_id);
 		if (timeout != MAX_SCHEDULE_TIMEOUT)
-			schedule_delayed_work(&fw_priv->timeout_work, timeout);
+			queue_delayed_work(system_power_efficient_wq,
+					   &fw_priv->timeout_work, timeout);
 
 		kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD);
 	}
@@ -1699,8 +1700,8 @@ static void device_uncache_fw_images_work(struct work_struct *work)
  */
 static void device_uncache_fw_images_delay(unsigned long delay)
 {
-	schedule_delayed_work(&fw_cache.work,
-			msecs_to_jiffies(delay));
+	queue_delayed_work(system_power_efficient_wq, &fw_cache.work,
+			   msecs_to_jiffies(delay));
 }
 
 static int fw_pm_notify(struct notifier_block *notify_block,

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
@@ -439,7 +439,7 @@ void phy_start_machine(struct phy_device *phydev,
 {
 	phydev->adjust_state = handler;
 
-	schedule_delayed_work(&phydev->state_queue, HZ);
+	queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, HZ);
 }
 
 /**
@@ -500,7 +500,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
 	disable_irq_nosync(irq);
 	atomic_inc(&phydev->irq_disable);
 
-	schedule_work(&phydev->phy_queue);
+	queue_work(system_power_efficient_wq, &phydev->phy_queue);
 
 	return IRQ_HANDLED;
 }
@@ -655,7 +655,7 @@ static void phy_change(struct work_struct *work)
 
 	/* reschedule state queue work to run as soon as possible */
 	cancel_delayed_work_sync(&phydev->state_queue);
-	schedule_delayed_work(&phydev->state_queue, 0);
+	queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, 0);
 
 	return;
 
@@ -918,7 +918,8 @@ void phy_state_machine(struct work_struct *work)
 	if (err < 0)
 		phy_error(phydev);
 
-	schedule_delayed_work(&phydev->state_queue, PHY_STATE_TIME * HZ);
+	queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
+			PHY_STATE_TIME * HZ);
 }
 
 static inline void mmd_phy_indirect(struct mii_bus *bus, int prtad, int devad,

diff --git a/drivers/power/smb135x-charger.c b/drivers/power/smb135x-charger.c
@@ -1911,7 +1911,8 @@ static int smb135x_battery_set_property(struct power_supply *psy,
 		smb_stay_awake(&chip->smb_wake_source);
 		chip->bms_check = 1;
 		cancel_delayed_work(&chip->heartbeat_work);
-		schedule_delayed_work(&chip->heartbeat_work,
+		queue_delayed_work(system_power_efficient_wq,
+			&chip->heartbeat_work,
 			msecs_to_jiffies(0));
 		break;
 	case POWER_SUPPLY_PROP_HEALTH:
@@ -1921,7 +1922,8 @@ static int smb135x_battery_set_property(struct power_supply *psy,
 		smb135x_set_chrg_path_temp(chip);
 		chip->temp_check = 1;
 		cancel_delayed_work(&chip->heartbeat_work);
-		schedule_delayed_work(&chip->heartbeat_work,
+		queue_delayed_work(system_power_efficient_wq,
+			&chip->heartbeat_work,
 			msecs_to_jiffies(0));
 		break;
 	/* Block from Fuel Gauge */
@@ -2598,7 +2600,8 @@ static void aicl_check_work(struct work_struct *work)
 		chip->aicl_weak_detect = true;
 
 	cancel_delayed_work(&chip->src_removal_work);
-	schedule_delayed_work(&chip->src_removal_work,
+	queue_delayed_work(system_power_efficient_wq,
+		&chip->src_removal_work,
 		msecs_to_jiffies(3000));
 	if (!rc) {
 		dev_dbg(chip->dev, "Reached Bottom IC!\n");
@@ -2678,8 +2681,9 @@ static void rate_check_work(struct work_struct *work)
 
 	chip->rate_check_count++;
 	if (chip->rate_check_count < 6)
-		schedule_delayed_work(&chip->rate_check_work,
-				      msecs_to_jiffies(500));
+		queue_delayed_work(system_power_efficient_wq,
+				&chip->rate_check_work,
+				msecs_to_jiffies(500));
 }
 
 static void usb_insertion_work(struct work_struct *work)
@@ -2734,8 +2738,9 @@ static void heartbeat_work(struct work_struct *work)
 	    smb135x_get_prop_batt_health(chip, &batt_health)) {
 		dev_warn(chip->dev, "HB Failed to run resume = %d!\n",
 			 (int)chip->resume_completed);
-		schedule_delayed_work(&chip->heartbeat_work,
-				      msecs_to_jiffies(1000));
+			queue_delayed_work(system_power_efficient_wq,
+				&chip->heartbeat_work,
+				msecs_to_jiffies(1000));
 		return;
 	}
 
@@ -2833,8 +2838,9 @@ static void heartbeat_work(struct work_struct *work)
 
 	power_supply_changed(&chip->batt_psy);
 
-	schedule_delayed_work(&chip->heartbeat_work,
-			      msecs_to_jiffies(60000));
+	queue_delayed_work(system_power_efficient_wq,
+		&chip->heartbeat_work,
+		msecs_to_jiffies(60000));
 	chip->hb_running = false;
 	if (!usb_present && !dc_present)
 		smb_relax(&chip->smb_wake_source);
@@ -2945,7 +2951,8 @@ static int otg_oc_handler(struct smb135x_chg *chip, u8 rt_stat)
 		return 0;
 	}
 
-	schedule_delayed_work(&chip->ocp_clear_work,
+	queue_delayed_work(system_power_efficient_wq,
+		&chip->ocp_clear_work,
 		msecs_to_jiffies(0));
 
 	pr_err("rt_stat = 0x%02x\n", rt_stat);
@@ -2970,7 +2977,8 @@ static int handle_dc_removal(struct smb135x_chg *chip)
 static int handle_dc_insertion(struct smb135x_chg *chip)
 {
 	if (chip->dc_psy_type == POWER_SUPPLY_TYPE_WIRELESS)
-		schedule_delayed_work(&chip->wireless_insertion_work,
+		queue_delayed_work(system_power_efficient_wq,
+			&chip->wireless_insertion_work,
 			msecs_to_jiffies(DCIN_UNSUSPEND_DELAY_MS));
 	if (chip->dc_psy_type != -EINVAL)
 		power_supply_set_online(&chip->dc_psy,
@@ -3121,8 +3129,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip)
 		smb_stay_awake(&chip->smb_wake_source);
 		chip->apsd_rerun_cnt++;
 		chip->usb_present = 0;
-		schedule_delayed_work(&chip->usb_insertion_work,
-				      msecs_to_jiffies(1000));
+		queue_delayed_work(system_power_efficient_wq,
+			&chip->usb_insertion_work,
+			msecs_to_jiffies(1000));
 		return 0;
 	}
 
@@ -3162,8 +3171,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip)
 	chip->charger_rate =  POWER_SUPPLY_CHARGE_RATE_NORMAL;
 	chip->rate_check_count = 0;
 	cancel_delayed_work(&chip->rate_check_work);
-	schedule_delayed_work(&chip->rate_check_work,
-			      msecs_to_jiffies(500));
+	queue_delayed_work(system_power_efficient_wq,
+		&chip->rate_check_work,
+		msecs_to_jiffies(500));
 	return 0;
 }
 
@@ -3204,8 +3214,9 @@ static int usbin_uv_handler(struct smb135x_chg *chip, u8 rt_stat)
 			if (rc < 0)
 				pr_err("Failed to Disable USBIN UV IRQ\n");
 			cancel_delayed_work(&chip->aicl_check_work);
-			schedule_delayed_work(&chip->aicl_check_work,
-					      msecs_to_jiffies(0));
+			queue_delayed_work(system_power_efficient_wq,
+				&chip->aicl_check_work,
+				msecs_to_jiffies(0));
 		}
 		return 0;
 	}
@@ -5435,8 +5446,9 @@ static int smb135x_charger_probe(struct i2c_client *client,
 	if (rc < 0)
 		pr_err("failed to set up voltage notifications: %d\n", rc);
 
-	schedule_delayed_work(&chip->heartbeat_work,
-			      msecs_to_jiffies(60000));
+	queue_delayed_work(system_power_efficient_wq,
+			&chip->heartbeat_work,
+			msecs_to_jiffies(60000));
 
 	dev_info(chip->dev, "SMB135X version = %s revision = %s successfully probed batt=%d dc = %d usb = %d\n",
 			version_str[chip->version],

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
@@ -1925,8 +1925,9 @@ int regulator_disable_deferred(struct regulator *regulator, int ms)
 	rdev->deferred_disables++;
 	mutex_unlock(&rdev->mutex);
 
-	ret = schedule_delayed_work(&rdev->disable_work,
-				    msecs_to_jiffies(ms));
+	ret = queue_delayed_work(system_power_efficient_wq,
+				 &rdev->disable_work,
+				 msecs_to_jiffies(ms));
 	if (ret < 0)
 		return ret;
 	else

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
@@ -506,7 +506,8 @@ static void led_work (struct work_struct *work)
 		changed++;
 	}
 	if (changed)
-		schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD);
+		queue_delayed_work(system_power_efficient_wq,
+				&hub->leds, LED_CYCLE_PERIOD);
 }
 
 /* use a short timeout for hub/port status fetches */
@@ -1057,7 +1058,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
 				goto init2;
 #endif
 			PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func2);
-			schedule_delayed_work(&hub->init_work,
+			queue_delayed_work(system_power_efficient_wq,
+					&hub->init_work,
 					msecs_to_jiffies(delay));
 
 			/* Suppress autosuspend until init is done */
@@ -1218,7 +1220,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
 		/* Don't do a long sleep inside a workqueue routine */
 		if (type == HUB_INIT2) {
 			PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func3);
-			schedule_delayed_work(&hub->init_work,
+			queue_delayed_work(system_power_efficient_wq,
+					&hub->init_work,
 					msecs_to_jiffies(delay));
 			device_unlock(hub->intfdev);
 			return;		/* Continues at init3: below */
@@ -1233,7 +1236,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
 	if (status < 0)
 		dev_err(hub->intfdev, "activate --> %d\n", status);
 	if (hub->has_indicators && blinkenlights)
-		schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD);
+		queue_delayed_work(system_power_efficient_wq,
+				&hub->leds, LED_CYCLE_PERIOD);
 
 	/* Scan all ports that need attention */
 	kick_khubd(hub);
@@ -4396,7 +4400,8 @@ check_highspeed (struct usb_hub *hub, struct usb_device *udev, int port1)
 		/* hub LEDs are probably harder to miss than syslog */
 		if (hub->has_indicators) {
 			hub->indicator[port1-1] = INDICATOR_GREEN_BLINK;
-			schedule_delayed_work (&hub->leds, 0);
+			queue_delayed_work(system_power_efficient_wq,
+					&hub->leds, 0);
 		}
 	}
 	kfree(qual);
@@ -4626,7 +4631,9 @@ static void hub_port_connect_change(struct usb_hub *hub, int port1,
 				if (hub->has_indicators) {
 					hub->indicator[port1-1] =
 						INDICATOR_AMBER_BLINK;
-					schedule_delayed_work (&hub->leds, 0);
+					queue_delayed_work(
+						system_power_efficient_wq,
+						&hub->leds, 0);
 				}
 				status = -ENOTCONN;	/* Don't retry */
 				goto loop_disable;

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
@@ -71,7 +71,8 @@ enum {
 	/* data contains off-queue information when !WORK_STRUCT_PWQ */
 	WORK_OFFQ_FLAG_BASE	= WORK_STRUCT_COLOR_SHIFT,
 
-	WORK_OFFQ_CANCELING	= (1 << WORK_OFFQ_FLAG_BASE),
+	__WORK_OFFQ_CANCELING	= WORK_OFFQ_FLAG_BASE,
+	WORK_OFFQ_CANCELING	= (1 << __WORK_OFFQ_CANCELING),
 
 	/*
 	 * When a work item is off queue, its high bits point to the last
@@ -303,6 +304,33 @@ enum {
 	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu instensive workqueue */
 	WQ_SYSFS		= 1 << 6, /* visible in sysfs, see wq_sysfs_register() */
 
+	/*
+	 * Per-cpu workqueues are generally preferred because they tend to
+	 * show better performance thanks to cache locality.  Per-cpu
+	 * workqueues exclude the scheduler from choosing the CPU to
+	 * execute the worker threads, which has an unfortunate side effect
+	 * of increasing power consumption.
+	 *
+	 * The scheduler considers a CPU idle if it doesn't have any task
+	 * to execute and tries to keep idle cores idle to conserve power;
+	 * however, for example, a per-cpu work item scheduled from an
+	 * interrupt handler on an idle CPU will force the scheduler to
+	 * excute the work item on that CPU breaking the idleness, which in
+	 * turn may lead to more scheduling choices which are sub-optimal
+	 * in terms of power consumption.
+	 *
+	 * Workqueues marked with WQ_POWER_EFFICIENT are per-cpu by default
+	 * but become unbound if workqueue.power_efficient kernel param is
+	 * specified.  Per-cpu workqueues which are identified to
+	 * contribute significantly to power-consumption are identified and
+	 * marked with this flag and enabling the power_efficient mode
+	 * leads to noticeable power saving at the cost of small
+	 * performance disadvantage.
+	 *
+	 * http://thread.gmane.org/gmane.linux.kernel/1480396
+	 */
+	WQ_POWER_EFFICIENT	= 1 << 7,
+
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 
@@ -333,11 +361,19 @@ enum {
  *
  * system_freezable_wq is equivalent to system_wq except that it's
  * freezable.
+ *
+ * *_power_efficient_wq are inclined towards saving power and converted
+ * into WQ_UNBOUND variants if 'wq_power_efficient' is enabled; otherwise,
+ * they are same as their non-power-efficient counterparts - e.g.
+ * system_power_efficient_wq is identical to system_wq if
+ * 'wq_power_efficient' is disabled.  See WQ_POWER_EFFICIENT for more info.
  */
 extern struct workqueue_struct *system_wq;
 extern struct workqueue_struct *system_long_wq;
 extern struct workqueue_struct *system_unbound_wq;
 extern struct workqueue_struct *system_freezable_wq;
+extern struct workqueue_struct *system_power_efficient_wq;
+extern struct workqueue_struct *system_freezable_power_efficient_wq;
 
 static inline struct workqueue_struct * __deprecated __system_nrt_wq(void)
 {