sched: Fix bug in average nr_running and nr_iowait calculation

sched_get_nr_running_avg() returns average nr_running and nr_iowait task count since it was last invoked. Fix several bugs in their calculation. * sched_update_nr_prod() needs to consider that nr_running count can change by more than 1 when CFS_BANDWIDTH feature is used * sched_get_nr_running_avg() needs to sum up nr_iowait count across all cpus, rather than just one * sched_get_nr_running_avg() could race with sched_update_nr_prod(), as a result of which it could use curr_time which is behind a cpu's 'last_time' value. That would lead to erroneous calculation of average nr_running or nr_iowait. While at it, fix also a bug in BUG_ON() check in sched_update_nr_prod() function and remove unnecessary nr_running argument to sched_update_nr_prod() function. Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2 Signed-off-by: Srivatsa Vaddagiri <[email protected]> Conflicts: kernel/sched/fair.c Signed-off-by: franciscofranco <[email protected]> sched: Remove one division operation in find_busiest_queue() Remove one division operation in find_busiest_queue() by using crosswise multiplication: wl_i / power_i > wl_j / power_j := wl_i * power_j > wl_j * power_i Signed-off-by: Joonsoo Kim <[email protected]> [ Expanded the changelog. ] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/fair: Optimize find_busiest_queue() Use for_each_cpu_and() and thereby avoid computing the capacity for CPUs we know we're not interested in. Reviewed-by: Paul Turner <[email protected]> Reviewed-by: Preeti U Murthy <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing Only one task can replace the waker. Signed-off-by: Kirill Tkhai <[email protected]> CC: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/idle: Avoid spurious wakeup IPIs Because mwait_idle_with_hints() gets called from !idle context it must call current_clr_polling(). This however means that resched_task() is very likely to send an IPI even when we were polling: CPU0 CPU1 if (current_set_polling_and_test()) goto out; __monitor(&ti->flags); if (!need_resched()) __mwait(eax, ecx); set_tsk_need_resched(p); smp_mb(); out: current_clr_polling(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); So while it is correct (extra IPIs aren't a problem, whereas a missed IPI would be) it is a performance problem (for some). Avoid this issue by using fetch_or() to atomically set NEED_RESCHED and test if POLLING_NRFLAG is set. Since a CPU stuck in mwait is unlikely to modify the flags word, contention on the cmpxchg is unlikely and thus we should mostly succeed in a single go. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched: Fix inaccurate accounting for real-time task It is possible that rq->clock_task was not updated in put_prev_task() in which case we can potentially overcharge a real-time task for time it did not run. This is because clock_task could be stale and not represent the exact time real-time task started running. Fix this by forcing update of rq->clock_task when real-time task starts running. Change-Id: I8320bb4e47924368583127b950d987925e8e6a6c Signed-off-by: Srivatsa Vaddagiri <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpumask: Fix cpumask leak in partition_sched_domains() If doms_new is NULL, partition_sched_domains() will reset ndoms_cur to 0, and free old sched domains with free_sched_domains(doms_cur, ndoms_cur). As ndoms_cur is 0, the cpumask will not be freed. Signed-off-by: Xiaotian Feng <[email protected]> Cc: Rusty Russell <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/fair: Implement fast idling of CPUs when the system is partially loaded When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Jason Low <[email protected]> Cc: "Paul E.McKenney" <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Alex Shi <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <[email protected]> (Patch adapted for 3.10) Signed-off-by: franciscofranco <[email protected]> sched/balancing: Prevent the reselection of a previous env.dst_cpu if some tasks are pinned Currently new_dst_cpu is prevented from being reselected actually, not dst_cpu. This can result in attempting to pull tasks to this_cpu twice. Signed-off-by: Vladimir Davydov <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/281f59b6e596c718dd565ad267fc38f5b8e5c995.1379265590.git.vdavydov@parallels.com Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched: Update rq clock before calling check_preempt_curr() check_preempt_curr() of fair class needs an uptodate sched clock value to update runtime stats of the current task of the target's rq. When a task is woken up, activate_task() is usually called right before ttwu_do_wakeup() unless the task is still in the runqueue. In the latter case we need to update the rq clock explicitly because activate_task() isn't here to do the job for us. Change-Id: I2dc5521223c89c0da4ad8e103d023e14ee8a6647 Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Li Zhong <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul Turner <[email protected]> Cc: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched: Force sleep on consecutive sched_yields If a task sched_yields to itself continuously, force the task to sleep in sched_yield. This will lower the CPU load of this task thereby lowering the cpu frequency and improving power. Added a stat variable to track how many times we sleep due these consecutive sched_yields. Also added sysctl knobs to control the number of consecutive sched_yields before which the sleep kicks in and the duration fo the sleep in us. Bug 1424617 Change-Id: Ie92412b8b900365816e17237fcbd0aac6e9c94ce Signed-off-by: Sai Gurrappadi <[email protected]> Reviewed-on: http://git-master/r/358455 Reviewed-by: Wen Yi <[email protected]> Reviewed-by: Peter Zu <[email protected]> GVS: Gerrit_Virtual_Submit Reviewed-by: Diwakar Tundlam <[email protected]> Conflicts: kernel/sched/core.c Signed-off-by: franciscofranco <[email protected]> sched: Fix reference to stale task_struct in try_to_wake_up() try_to_wake_up() currently drops p->pi_lock and later checks for need to notify cpufreq governor on task migrations or wakeups. However the woken task could exit between the time p->pi_lock is released and the time the test for notification is run. As a result, the test for notification could refer to an exited task. task_notify_on_migrate(p) could thus lead to invalid memory reference. Fix this by running the test for notification with task's pi_lock held. Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1 Signed-off-by: Srivatsa Vaddagiri <[email protected]> [francisco: slightly changed the original commit to make it compile and be useful at the same time] Signed-off-by: franciscofranco <[email protected]> sched: Fix race in idle_balance() The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8 Signed-off-by: Daniel Lezcano <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched: Avoid throttle_cfs_rq() racing with period_timer stopping throttle_cfs_rq() doesn't check to make sure that period_timer is running, and while update_curr/assign_cfs_runtime does, a concurrently running period_timer on another cpu could cancel itself between this cpu's update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running in the tg to restart the timer, this causes the cfs_rq to be stranded forever. Fix this by calling __start_cfs_bandwidth() in throttle if the timer is inactive. (Also add some sched_debug lines for cfs_bandwidth.) Tested: make a run/sleep task in a cgroup, loop switching the cgroup between 1ms/100ms quota and unlimited, checking for timer_active=0 and throttled=1 as a failure. With the throttle_cfs_rq() change commented out this fails, with the full patch it passes. Change-Id: I27ba9f78190ccfc2f50d1c147640d6d2aa31832f Signed-off-by: Ben Segall <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to force the period timer restart. It's not safe, because can lead to deadlock, described in commit 927b54fccbf0: "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock." Three CPUs must be involved: CPU0 CPU1 CPU2 take rq->lock period timer fired ... take cfs_b lock ... ... tg_set_cfs_bandwidth() throttle_cfs_rq() release cfs_b lock take cfs_b lock ... distribute_cfs_runtime() timer_active = 0 take cfs_b->lock wait for rq->lock ... __start_cfs_bandwidth() {wait for timer callback break if timer_active == 1} So, CPU0 and CPU1 are deadlocked. Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can wait for period timer callbacks (ignoring cfs_b->timer_active) and restart the timer explicitly. Change-Id: Ib108546ea8d144fb7d2c2ae59bea6ab3632b51f4 Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Ben Segall <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%[email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/balancing: Reduce the rate of needless idle load balancing The current no_hz idle load balancer do load balancing for *all* idle cpus, even though the time due to load balance for a particular idle cpu could be still a while in the future. This introduces a much higher load balancing rate than what is necessary. The patch changes the behavior by only doing idle load balancing on behalf of an idle cpu only when it is due for load balancing. On SGI's systems with over 3000 cores, the cpu responsible for idle balancing got overwhelmed with idle balancing, and introduces a lot of OS noise to workloads. This patch fixes the issue. Signed-off-by: Tim Chen <[email protected]> Acked-by: Russ Anderson <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Jason Low <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Len Brown <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hedi Berriche <[email protected]> Cc: Andi Kleen <[email protected]> Cc: MichelLespinasse <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/fair: Stop searching for tasks in newidle balance if there are runnable tasks It was found that when running some workloads (such as AIM7) on large systems with many cores, CPUs do not remain idle for long. Thus, tasks can wake/get enqueued while doing idle balancing. In this patch, while traversing the domains in idle balance, in addition to checking for pulled_task, we add an extra check for this_rq->nr_running for determining if we should stop searching for tasks to pull. If there are runnable tasks on this rq, then we will stop traversing the domains. This reduces the chance that idle balance delays a task from running. This patch resulted in approximately a 6% performance improvement when running a Java Server workload on an 8 socket machine. Signed-off-by: Jason Low <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> IKSWM-5880:sched/rt: really force updates rq clock in pick_next_task_rt In the original patch of "sched: Fix inaccurate accounting for real-time task", it will skip the update_rq_clock() if skip_clock_update is zero/-1. Fix this bug to make the original patch really force updates rq clock, then the realtime task's exec_start will be accurate. Change-Id: I235917648f1631f9f037fa12e8322488dcfa08a2 Signed-off-by: Jiangli Yuan <[email protected]> Reviewed-on: http://gerrit.mot.com/788261 SME-Granted: SME Approvals Granted SLTApproved: Slta Waiver <[email protected]> Tested-by: Jira Key <[email protected]> Reviewed-by: Zhi-Ming Yuan <[email protected]> Reviewed-by: Yi-Wei Zhao <[email protected]> Reviewed-by: Russell Knize <[email protected]> Reviewed-by: Christopher Fries <[email protected]> Submit-Approved: Jira Key <[email protected]> Signed-off-by: franciscofranco <[email protected]> sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0 CPU1 set_current_state(...) <preempt_schedule> context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); </preempt_schedule> schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: franciscofranco <[email protected]> cputime: Fix jiffies based cputime assumption on steal accounting The steal guest time accounting code assumes that cputime_t is based on jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t is based on nsecs, steal_account_process_tick() passes the delta in jiffies to account_steal_time() which then accounts it as if it's a value in nsecs. As a result, accounting 1 second of steal time (with HZ=100 that would be 100 jiffies) is spuriously accounted as 100 nsecs. As such /proc/stat may report 0 values of steal time even when two guests have run concurrently for a few seconds on the same host and same CPU. In order to fix this, lets convert the nsecs based steal delta to cputime instead of jiffies by using the right conversion API. Given that the steal time is stored in cputime_t and this type can have a smaller granularity than nsecs, we only account the rounded converted value and leave the remaining nsecs for the next deltas. Reported-by: Huiqingding <[email protected]> Reported-by: Marcelo Tosatti <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: franciscofranco <[email protected]>
franciscofranco · Oct 6, 2016 · 3472dd8 · 3472dd8
1 parent 29b0438
commit 3472dd8
Show file tree

Hide file tree

Showing 11 changed files with 252 additions and 83 deletions.
diff --git a/include/linux/sched.h b/include/linux/sched.h
@@ -103,7 +103,7 @@ extern unsigned long nr_iowait(void);
 extern unsigned long nr_iowait_cpu(int cpu);
 extern unsigned long this_cpu_load(void);
 
-extern void sched_update_nr_prod(int cpu, unsigned long nr, bool inc);
+extern void sched_update_nr_prod(int cpu, long delta, bool inc);
 extern void sched_get_nr_running_avg(int *avg, int *iowait_avg);
 
 extern void calc_global_load(unsigned long ticks);
@@ -1070,6 +1070,7 @@ struct task_struct {
 	atomic_t usage;
 	unsigned int flags;	/* per process flags, defined below */
 	unsigned int ptrace;
+	unsigned int yield_count;
 
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
@@ -59,7 +59,8 @@ extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
 extern unsigned int sysctl_sched_shares_window;
-
+extern unsigned int sysctl_sched_yield_sleep_duration;
+extern int sysctl_sched_yield_sleep_threshold;
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
@@ -299,6 +299,18 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+/*
+ * Number of sched_yield calls that result in a thread yielding
+ * to itself before a sleep is injected in its next sched_yield call
+ * Setting this to -1 will disable adding sleep in sched_yield
+ */
+const_debug int sysctl_sched_yield_sleep_threshold = 4;
+
+/*
+ * Sleep duration in us used when sched_yield_sleep_threshold
+ * is exceeded.
+ */
+const_debug unsigned int sysctl_sched_yield_sleep_duration = 50;
 
 /*
  * Maximum possible frequency across all cpus. Task demand and cpu
@@ -520,6 +532,39 @@ static inline void init_hrtick(void)
 }
 #endif	/* CONFIG_SCHED_HRTICK */
 
+/*
+ * cmpxchg based fetch_or, macro so it works for different integer types
+ */
+#define fetch_or(ptr, val)						\
+({	typeof(*(ptr)) __old, __val = *(ptr);				\
+ 	for (;;) {							\
+ 		__old = cmpxchg((ptr), __val, __val | (val));		\
+ 		if (__old == __val)					\
+ 			break;						\
+ 		__val = __old;						\
+ 	}								\
+ 	__old;								\
+})
+
+#ifdef TIF_POLLING_NRFLAG
+/*
+ * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * this avoids any races wrt polling state changes and thereby avoids
+ * spurious IPIs.
+ */
+static bool set_nr_and_not_polling(struct task_struct *p)
+{
+	struct thread_info *ti = task_thread_info(p);
+	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+}
+#else
+static bool set_nr_and_not_polling(struct task_struct *p)
+{
+	set_tsk_need_resched(p);
+	return true;
+}
+#endif
+
 /*
  * resched_task - mark a task 'to be rescheduled now'.
  *
@@ -537,15 +582,14 @@ void resched_task(struct task_struct *p)
 	if (test_tsk_need_resched(p))
 		return;
 
-	set_tsk_need_resched(p);
-
 	cpu = task_cpu(p);
-	if (cpu == smp_processor_id())
+
+	if (cpu == smp_processor_id()) {
+		set_tsk_need_resched(p);
 		return;
+	}
 
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(p))
+	if (set_nr_and_not_polling(p))
 		smp_send_reschedule(cpu);
 }
 
@@ -1522,6 +1566,8 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p);
 	if (p->on_rq) {
+		/* check_preempt_curr() may use rq clock */
+		update_rq_clock(rq);
 		ttwu_do_wakeup(rq, p, wake_flags);
 		ret = 1;
 	}
@@ -1633,6 +1679,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 	unsigned long src_cpu;
+	int notify = 0;
+	struct migration_notify_data mnd;
 
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
@@ -1653,6 +1701,25 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		goto stat;
 
 #ifdef CONFIG_SMP
+	/*
+	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
+	 * possible to, falsely, observe p->on_cpu == 0.
+	 *
+	 * One must be running (->on_cpu == 1) in order to remove oneself
+	 * from the runqueue.
+	 *
+	 *  [S] ->on_cpu = 1;	[L] ->on_rq
+	 *      UNLOCK rq->lock
+	 *			RMB
+	 *      LOCK   rq->lock
+	 *  [S] ->on_rq = 0;    [L] ->on_cpu
+	 *
+	 * Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
+	 * from the consecutive calls to schedule(); the first switching to our
+	 * task, the second putting it to sleep.
+	 */
+	smp_rmb();
+
 	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
@@ -1683,28 +1750,30 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	ttwu_queue(p, cpu);
 stat:
 	ttwu_stat(p, cpu, wake_flags);
-out:
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 	if (task_notify_on_migrate(p)) {
-		struct migration_notify_data mnd;
+                mnd.src_cpu = src_cpu;
+                mnd.dest_cpu = cpu;
+                mnd.load = pct_task_load(p);
+
+                /*
+                 * Call the migration notifier with mnd for foreground task
+                 * migrations as well as for wakeups if their load is above
+                 * sysctl_sched_wakeup_load_threshold. This would prompt the
+                 * cpu-boost to boost the CPU frequency on wake up of a heavy
+                 * weight foreground task
+                 */
+                if ((src_cpu != cpu) || (mnd.load >
+                                        sysctl_sched_wakeup_load_threshold))
+			notify = 1;
+        }
+out:
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
-		mnd.src_cpu = src_cpu;
-		mnd.dest_cpu = cpu;
-		mnd.load = pct_task_load(p);
+	if (notify)
+		atomic_notifier_call_chain(&migration_notifier_head,
+                                           0, (void *)&mnd);
 
-		/*
-		 * Call the migration notifier with mnd for foreground task
-		 * migrations as well as for wakeups if their load is above
-		 * sysctl_sched_wakeup_load_threshold. This would prompt the
-		 * cpu-boost to boost the CPU frequency on wake up of a heavy
-		 * weight foreground task
-		 */
-		if ((src_cpu != cpu) || (mnd.load >
-					sysctl_sched_wakeup_load_threshold))
-			atomic_notifier_call_chain(&migration_notifier_head,
-					   0, (void *)&mnd);
-	}
 	return success;
 }
 
@@ -3197,6 +3266,7 @@ static void __sched __schedule(void)
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
+		prev->yield_count = 0;
 		++*switch_count;
 
 		context_switch(rq, prev, next); /* unlocks the rq */
@@ -3208,8 +3278,10 @@ static void __sched __schedule(void)
 		 */
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
-	} else
+	} else {
+		prev->yield_count++;
 		raw_spin_unlock_irq(&rq->lock);
+	}
 
 	post_schedule(rq);
 
@@ -3424,7 +3496,7 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 	if (unlikely(!q))
 		return;
 
-	if (unlikely(!nr_exclusive))
+	if (unlikely(nr_exclusive != 1))
 		wake_flags = 0;
 
 	spin_lock_irqsave(&q->lock, flags);
@@ -4514,6 +4586,8 @@ SYSCALL_DEFINE0(sched_yield)
 	struct rq *rq = this_rq_lock();
 
 	schedstat_inc(rq, yld_count);
+	if (rq->curr->yield_count == sysctl_sched_yield_sleep_threshold)
+		schedstat_inc(rq, yield_sleep_count);
 	current->sched_class->yield_task(rq);
 
 	/*
@@ -4525,7 +4599,11 @@ SYSCALL_DEFINE0(sched_yield)
 	do_raw_spin_unlock(&rq->lock);
 	sched_preempt_enable_no_resched();
 
-	schedule();
+	if (rq->curr->yield_count == sysctl_sched_yield_sleep_threshold)
+		usleep_range(sysctl_sched_yield_sleep_duration,
+				sysctl_sched_yield_sleep_duration + 5);
+	else
+		schedule();
 
 	return 0;
 }
@@ -6965,16 +7043,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 		;
 	}
 
+	n = ndoms_cur;
 	if (doms_new == NULL) {
-		ndoms_cur = 0;
+		n = 0;
 		doms_new = &fallback_doms;
 		cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
 		WARN_ON_ONCE(dattr_new);
 	}
 
 	/* Build new domains */
 	for (i = 0; i < ndoms_new; i++) {
-		for (j = 0; j < ndoms_cur && !new_topology; j++) {
+		for (j = 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_new[i], doms_cur[j])
 			    && dattrs_equal(dattr_new, i, dattr_cur, j))
 				goto match2;
@@ -8117,8 +8196,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */
-		cfs_b->timer_active = 0;
-		__start_cfs_bandwidth(cfs_b);
+		__start_cfs_bandwidth(cfs_b, true);
 	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
@@ -269,16 +269,22 @@ static __always_inline bool steal_account_process_tick(void)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
-		u64 steal, st = 0;
+		u64 steal;
+		cputime_t steal_ct;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		st = steal_ticks(steal);
-		this_rq()->prev_steal_time += st * TICK_NSEC;
+		/*
+		 * cputime_t may be less precise than nsecs (eg: if it's
+		 * based on jiffies). Lets cast the result to cputime
+		 * granularity and account the rest on the next rounds.
+		 */
+		steal_ct = nsecs_to_cputime(steal);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct);
 
-		account_steal_time(st);
-		return st;
+		account_steal_time(steal_ct);
+		return steal_ct;
 	}
 #endif
 	return false;

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
@@ -224,6 +224,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
 			atomic_read(&cfs_rq->tg->runnable_avg));
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
+			cfs_rq->tg->cfs_bandwidth.timer_active);
+	SEQ_printf(m, "  .%-30s: %d\n", "throttled",
+			cfs_rq->throttled);
+	SEQ_printf(m, "  .%-30s: %d\n", "throttle_count",
+			cfs_rq->throttle_count);
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	SEQ_printf(m, "  .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
 			cfs_rq->tg->cfs_bandwidth.timer_active);
@@ -310,6 +318,7 @@ do {									\
 #define P64(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, rq->n);
 
 	P(yld_count);
+	P(yield_sleep_count);
 
 	P(sched_count);
 	P(sched_goidle);