Skip to content

Commit

Permalink
sched: Fix bug in average nr_running and nr_iowait calculation
Browse files Browse the repository at this point in the history
sched_get_nr_running_avg() returns average nr_running and nr_iowait
task count since it was last invoked. Fix several bugs in their
calculation.

* sched_update_nr_prod() needs to consider that nr_running count can
  change by more than 1 when CFS_BANDWIDTH feature is used

* sched_get_nr_running_avg() needs to sum up nr_iowait count across
  all cpus, rather than just one

* sched_get_nr_running_avg() could race with sched_update_nr_prod(),
  as a result of which it could use curr_time which is behind a cpu's
  'last_time' value. That would lead to erroneous calculation of
  average nr_running or nr_iowait.

While at it, fix also a bug in BUG_ON() check in
sched_update_nr_prod() function and remove unnecessary nr_running
argument to sched_update_nr_prod() function.

Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2
Signed-off-by: Srivatsa Vaddagiri <[email protected]>

Conflicts:

	kernel/sched/fair.c

Signed-off-by: franciscofranco <[email protected]>

sched: Remove one division operation in find_busiest_queue()

Remove one division operation in find_busiest_queue() by using
crosswise multiplication:

	wl_i / power_i > wl_j / power_j :=
	wl_i * power_j > wl_j * power_i

Signed-off-by: Joonsoo Kim <[email protected]>
[ Expanded the changelog. ]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

Signed-off-by: franciscofranco <[email protected]>

sched/fair: Optimize find_busiest_queue()

Use for_each_cpu_and() and thereby avoid computing the capacity for
CPUs we know we're not interested in.

Reviewed-by: Paul Turner <[email protected]>
Reviewed-by: Preeti U Murthy <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing

Only one task can replace the waker.

Signed-off-by: Kirill Tkhai <[email protected]>
CC: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/idle: Avoid spurious wakeup IPIs

Because mwait_idle_with_hints() gets called from !idle context it must
call current_clr_polling(). This however means that resched_task() is
very likely to send an IPI even when we were polling:

  CPU0					CPU1

  if (current_set_polling_and_test())
    goto out;

  __monitor(&ti->flags);
  if (!need_resched())
    __mwait(eax, ecx);
					set_tsk_need_resched(p);
					smp_mb();
out:
  current_clr_polling();
					if (!tsk_is_polling(p))
					  smp_send_reschedule(cpu);

So while it is correct (extra IPIs aren't a problem, whereas a missed
IPI would be) it is a performance problem (for some).

Avoid this issue by using fetch_or() to atomically set NEED_RESCHED
and test if POLLING_NRFLAG is set.

Since a CPU stuck in mwait is unlikely to modify the flags word,
contention on the cmpxchg is unlikely and thus we should mostly
succeed in a single go.

Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Nicolas Pitre <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched: Fix inaccurate accounting for real-time task

It is possible that rq->clock_task was not updated in put_prev_task()
in which case we can potentially overcharge a real-time task for time
it did not run. This is because clock_task could be stale and not
represent the exact time real-time task started running.

Fix this by forcing update of rq->clock_task when real-time task
starts running.

Change-Id: I8320bb4e47924368583127b950d987925e8e6a6c
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

cpumask: Fix cpumask leak in partition_sched_domains()

If doms_new is NULL, partition_sched_domains() will reset ndoms_cur
to 0, and free old sched domains with free_sched_domains(doms_cur, ndoms_cur).
As ndoms_cur is 0, the cpumask will not be freed.

Signed-off-by: Xiaotian Feng <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/fair: Implement fast idling of CPUs when the system is partially loaded

When a system is lightly loaded (i.e. no more than 1 job per cpu),
attempt to pull job to a cpu before putting it to idle is unnecessary and
can be skipped.  This patch adds an indicator so the scheduler can know
when there's no more than 1 active job is on any CPU in the system to
skip needless job pulls.

On a 4 socket machine with a request/response kind of workload from
clients, we saw about 0.13 msec delay when we go through a full load
balance to try pull job from all the other cpus.  While 0.1 msec was
spent on processing the request and generating a response, the 0.13 msec
load balance overhead was actually more than the actual work being done.
This overhead can be skipped much of the time for lightly loaded systems.

With this patch, we tested with a netperf request/response workload that
has the server busy with half the cpus in a 4 socket system.  We found
the patch eliminated 75% of the load balance attempts before idling a cpu.

The overhead of setting/clearing the indicator is low as we already gather
the necessary info while we call add_nr_running() and update_sd_lb_stats.()
We switch to full load balance load immediately if any cpu got more than
one job on its run queue in add_nr_running.  We'll clear the indicator
to avoid load balance when we detect no cpu's have more than one job
when we scan the work queues in update_sg_lb_stats().  We are aggressive
in turning on the load balance and opportunistic in skipping the load
balance.

Signed-off-by: Tim Chen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Jason Low <[email protected]>
Cc: "Paul E.McKenney" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Peter Hurley <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK
Signed-off-by: Ingo Molnar <[email protected]>

(Patch adapted for 3.10)
Signed-off-by: franciscofranco <[email protected]>

sched/balancing: Prevent the reselection of a previous env.dst_cpu if some tasks are pinned

Currently new_dst_cpu is prevented from being reselected actually, not
dst_cpu. This can result in attempting to pull tasks to this_cpu twice.

Signed-off-by: Vladimir Davydov <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/281f59b6e596c718dd565ad267fc38f5b8e5c995.1379265590.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched: Update rq clock before calling check_preempt_curr()

check_preempt_curr() of fair class needs an uptodate sched clock
value to update runtime stats of the current task of the target's rq.

When a task is woken up, activate_task() is usually called right before
ttwu_do_wakeup() unless the task is still in the runqueue. In the latter
case we need to update the rq clock explicitly because activate_task()
isn't here to do the job for us.

Change-Id: I2dc5521223c89c0da4ad8e103d023e14ee8a6647
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Li Zhong <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched: Force sleep on consecutive sched_yields

If a task sched_yields to itself continuously, force the task to
sleep in sched_yield. This will lower the CPU load of this task
thereby lowering the cpu frequency and improving power.

Added a stat variable to track how many times we sleep due these
consecutive sched_yields. Also added sysctl knobs to control the number
of consecutive sched_yields before which the sleep kicks in and the
duration fo the sleep in us.

Bug 1424617

Change-Id: Ie92412b8b900365816e17237fcbd0aac6e9c94ce
Signed-off-by: Sai Gurrappadi <[email protected]>
Reviewed-on: http://git-master/r/358455
Reviewed-by: Wen Yi <[email protected]>
Reviewed-by: Peter Zu <[email protected]>
GVS: Gerrit_Virtual_Submit
Reviewed-by: Diwakar Tundlam <[email protected]>

Conflicts:

	kernel/sched/core.c

Signed-off-by: franciscofranco <[email protected]>

sched: Fix reference to stale task_struct in try_to_wake_up()

try_to_wake_up() currently drops p->pi_lock and later checks for need
to notify cpufreq governor on task migrations or wakeups. However the
woken task could exit between the time p->pi_lock is released and the
time the test for notification is run. As a result, the test for
notification could refer to an exited task. task_notify_on_migrate(p)
could thus lead to invalid memory reference.

Fix this by running the test for notification with task's pi_lock
held.

Change-Id: I1c7a337473d2d8e79342a015a179174ce00702e1
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
[francisco: slightly changed the original commit to make it compile
and be useful at the same time]
Signed-off-by: franciscofranco <[email protected]>

sched: Fix race in idle_balance()

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But idle_balance() releases the rq->lock in order to look up the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Change-Id: I1ba126feae6e488509782c63050952ba4e5fc0b8
Signed-off-by: Daniel Lezcano <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched: Avoid throttle_cfs_rq() racing with period_timer stopping

throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.

(Also add some sched_debug lines for cfs_bandwidth.)

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.

Change-Id: I27ba9f78190ccfc2f50d1c147640d6d2aa31832f
Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock

tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccbf0:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."

Three CPUs must be involved:

  CPU0               CPU1                         CPU2
  take rq->lock      period timer fired
  ...                take cfs_b lock
  ...                ...                          tg_set_cfs_bandwidth()
  throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
  ...                distribute_cfs_runtime()     timer_active = 0
  take cfs_b->lock   wait for rq->lock            ...
  __start_cfs_bandwidth()
  {wait for timer callback
   break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.

Change-Id: Ib108546ea8d144fb7d2c2ae59bea6ab3632b51f4
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%[email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/balancing: Reduce the rate of needless idle load balancing

The current no_hz idle load balancer do load balancing for *all* idle cpus,
even though the time due to load balance for a particular
idle cpu could be still a while in the future.  This introduces a much
higher load balancing rate than what is necessary.  The patch
changes the behavior by only doing idle load balancing on
behalf of an idle cpu only when it is due for load balancing.

On SGI's systems with over 3000 cores, the cpu responsible for idle balancing
got overwhelmed with idle balancing, and introduces a lot of OS noise
to workloads.  This patch fixes the issue.

Signed-off-by: Tim Chen <[email protected]>
Acked-by: Russ Anderson <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Hedi Berriche <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: MichelLespinasse <[email protected]>
Cc: Peter Hurley <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/fair: Stop searching for tasks in newidle balance if there are runnable tasks

It was found that when running some workloads (such as AIM7) on large
systems with many cores, CPUs do not remain idle for long. Thus, tasks
can wake/get enqueued while doing idle balancing.

In this patch, while traversing the domains in idle balance, in
addition to checking for pulled_task, we add an extra check for
this_rq->nr_running for determining if we should stop searching for
tasks to pull. If there are runnable tasks on this rq, then we will
stop traversing the domains. This reduces the chance that idle balance
delays a task from running.

This patch resulted in approximately a 6% performance improvement when
running a Java Server workload on an 8 socket machine.

Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

IKSWM-5880:sched/rt: really force updates rq clock in pick_next_task_rt

 In the original patch of "sched: Fix inaccurate accounting for real-time
 task", it will skip the update_rq_clock() if skip_clock_update is zero/-1.

 Fix this bug to make the original patch really force updates rq clock,
 then the realtime task's exec_start will be accurate.

Change-Id: I235917648f1631f9f037fa12e8322488dcfa08a2
Signed-off-by: Jiangli Yuan <[email protected]>
Reviewed-on: http://gerrit.mot.com/788261
SME-Granted: SME Approvals Granted
SLTApproved: Slta Waiver <[email protected]>
Tested-by: Jira Key <[email protected]>
Reviewed-by: Zhi-Ming Yuan <[email protected]>
Reviewed-by: Yi-Wei Zhao <[email protected]>
Reviewed-by: Russell Knize <[email protected]>
Reviewed-by: Christopher Fries <[email protected]>
Submit-Approved: Jira Key <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: franciscofranco <[email protected]>

cputime: Fix jiffies based cputime assumption on steal accounting

The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.

As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.

As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.

In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.

Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.

Reported-by: Huiqingding <[email protected]>
Reported-by: Marcelo Tosatti <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: franciscofranco <[email protected]>
  • Loading branch information
Srivatsa Vaddagiri authored and franciscofranco committed Oct 6, 2016
1 parent 29b0438 commit 3472dd8
Show file tree
Hide file tree
Showing 11 changed files with 252 additions and 83 deletions.
3 changes: 2 additions & 1 deletion include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);

extern void sched_update_nr_prod(int cpu, unsigned long nr, bool inc);
extern void sched_update_nr_prod(int cpu, long delta, bool inc);
extern void sched_get_nr_running_avg(int *avg, int *iowait_avg);

extern void calc_global_load(unsigned long ticks);
Expand Down Expand Up @@ -1070,6 +1070,7 @@ struct task_struct {
atomic_t usage;
unsigned int flags; /* per process flags, defined below */
unsigned int ptrace;
unsigned int yield_count;

#ifdef CONFIG_SMP
struct llist_node wake_entry;
Expand Down
3 changes: 2 additions & 1 deletion include/linux/sched/sysctl.h
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
extern unsigned int sysctl_sched_shares_window;

extern unsigned int sysctl_sched_yield_sleep_duration;
extern int sysctl_sched_yield_sleep_threshold;
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
Expand Down
140 changes: 109 additions & 31 deletions kernel/sched/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,18 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

/*
* Number of sched_yield calls that result in a thread yielding
* to itself before a sleep is injected in its next sched_yield call
* Setting this to -1 will disable adding sleep in sched_yield
*/
const_debug int sysctl_sched_yield_sleep_threshold = 4;

/*
* Sleep duration in us used when sched_yield_sleep_threshold
* is exceeded.
*/
const_debug unsigned int sysctl_sched_yield_sleep_duration = 50;

/*
* Maximum possible frequency across all cpus. Task demand and cpu
Expand Down Expand Up @@ -520,6 +532,39 @@ static inline void init_hrtick(void)
}
#endif /* CONFIG_SCHED_HRTICK */

/*
* cmpxchg based fetch_or, macro so it works for different integer types
*/
#define fetch_or(ptr, val) \
({ typeof(*(ptr)) __old, __val = *(ptr); \
for (;;) { \
__old = cmpxchg((ptr), __val, __val | (val)); \
if (__old == __val) \
break; \
__val = __old; \
} \
__old; \
})

#ifdef TIF_POLLING_NRFLAG
/*
* Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
static bool set_nr_and_not_polling(struct task_struct *p)
{
struct thread_info *ti = task_thread_info(p);
return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
}
#else
static bool set_nr_and_not_polling(struct task_struct *p)
{
set_tsk_need_resched(p);
return true;
}
#endif

/*
* resched_task - mark a task 'to be rescheduled now'.
*
Expand All @@ -537,15 +582,14 @@ void resched_task(struct task_struct *p)
if (test_tsk_need_resched(p))
return;

set_tsk_need_resched(p);

cpu = task_cpu(p);
if (cpu == smp_processor_id())

if (cpu == smp_processor_id()) {
set_tsk_need_resched(p);
return;
}

/* NEED_RESCHED must be visible before we test polling */
smp_mb();
if (!tsk_is_polling(p))
if (set_nr_and_not_polling(p))
smp_send_reschedule(cpu);
}

Expand Down Expand Up @@ -1522,6 +1566,8 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)

rq = __task_rq_lock(p);
if (p->on_rq) {
/* check_preempt_curr() may use rq clock */
update_rq_clock(rq);
ttwu_do_wakeup(rq, p, wake_flags);
ret = 1;
}
Expand Down Expand Up @@ -1633,6 +1679,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;
unsigned long src_cpu;
int notify = 0;
struct migration_notify_data mnd;

/*
* If we are going to wake up a thread waiting for CONDITION we
Expand All @@ -1653,6 +1701,25 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
goto stat;

#ifdef CONFIG_SMP
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
*
* One must be running (->on_cpu == 1) in order to remove oneself
* from the runqueue.
*
* [S] ->on_cpu = 1; [L] ->on_rq
* UNLOCK rq->lock
* RMB
* LOCK rq->lock
* [S] ->on_rq = 0; [L] ->on_cpu
*
* Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
* from the consecutive calls to schedule(); the first switching to our
* task, the second putting it to sleep.
*/
smp_rmb();

/*
* If the owning (remote) cpu is still in the middle of schedule() with
* this task as prev, wait until its done referencing the task.
Expand Down Expand Up @@ -1683,28 +1750,30 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
ttwu_queue(p, cpu);
stat:
ttwu_stat(p, cpu, wake_flags);
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

if (task_notify_on_migrate(p)) {
struct migration_notify_data mnd;
mnd.src_cpu = src_cpu;
mnd.dest_cpu = cpu;
mnd.load = pct_task_load(p);

/*
* Call the migration notifier with mnd for foreground task
* migrations as well as for wakeups if their load is above
* sysctl_sched_wakeup_load_threshold. This would prompt the
* cpu-boost to boost the CPU frequency on wake up of a heavy
* weight foreground task
*/
if ((src_cpu != cpu) || (mnd.load >
sysctl_sched_wakeup_load_threshold))
notify = 1;
}
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

mnd.src_cpu = src_cpu;
mnd.dest_cpu = cpu;
mnd.load = pct_task_load(p);
if (notify)
atomic_notifier_call_chain(&migration_notifier_head,
0, (void *)&mnd);

/*
* Call the migration notifier with mnd for foreground task
* migrations as well as for wakeups if their load is above
* sysctl_sched_wakeup_load_threshold. This would prompt the
* cpu-boost to boost the CPU frequency on wake up of a heavy
* weight foreground task
*/
if ((src_cpu != cpu) || (mnd.load >
sysctl_sched_wakeup_load_threshold))
atomic_notifier_call_chain(&migration_notifier_head,
0, (void *)&mnd);
}
return success;
}

Expand Down Expand Up @@ -3197,6 +3266,7 @@ static void __sched __schedule(void)
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
prev->yield_count = 0;
++*switch_count;

context_switch(rq, prev, next); /* unlocks the rq */
Expand All @@ -3208,8 +3278,10 @@ static void __sched __schedule(void)
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
} else
} else {
prev->yield_count++;
raw_spin_unlock_irq(&rq->lock);
}

post_schedule(rq);

Expand Down Expand Up @@ -3424,7 +3496,7 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
if (unlikely(!q))
return;

if (unlikely(!nr_exclusive))
if (unlikely(nr_exclusive != 1))
wake_flags = 0;

spin_lock_irqsave(&q->lock, flags);
Expand Down Expand Up @@ -4514,6 +4586,8 @@ SYSCALL_DEFINE0(sched_yield)
struct rq *rq = this_rq_lock();

schedstat_inc(rq, yld_count);
if (rq->curr->yield_count == sysctl_sched_yield_sleep_threshold)
schedstat_inc(rq, yield_sleep_count);
current->sched_class->yield_task(rq);

/*
Expand All @@ -4525,7 +4599,11 @@ SYSCALL_DEFINE0(sched_yield)
do_raw_spin_unlock(&rq->lock);
sched_preempt_enable_no_resched();

schedule();
if (rq->curr->yield_count == sysctl_sched_yield_sleep_threshold)
usleep_range(sysctl_sched_yield_sleep_duration,
sysctl_sched_yield_sleep_duration + 5);
else
schedule();

return 0;
}
Expand Down Expand Up @@ -6965,16 +7043,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}

n = ndoms_cur;
if (doms_new == NULL) {
ndoms_cur = 0;
n = 0;
doms_new = &fallback_doms;
cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
WARN_ON_ONCE(dattr_new);
}

/* Build new domains */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < ndoms_cur && !new_topology; j++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j])
&& dattrs_equal(dattr_new, i, dattr_cur, j))
goto match2;
Expand Down Expand Up @@ -8117,8 +8196,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
/* restart the period timer (if active) to handle new period expiry */
if (runtime_enabled && cfs_b->timer_active) {
/* force a reprogram */
cfs_b->timer_active = 0;
__start_cfs_bandwidth(cfs_b);
__start_cfs_bandwidth(cfs_b, true);
}
raw_spin_unlock_irq(&cfs_b->lock);

Expand Down
16 changes: 11 additions & 5 deletions kernel/sched/cputime.c
Original file line number Diff line number Diff line change
Expand Up @@ -269,16 +269,22 @@ static __always_inline bool steal_account_process_tick(void)
{
#ifdef CONFIG_PARAVIRT
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal, st = 0;
u64 steal;
cputime_t steal_ct;

steal = paravirt_steal_clock(smp_processor_id());
steal -= this_rq()->prev_steal_time;

st = steal_ticks(steal);
this_rq()->prev_steal_time += st * TICK_NSEC;
/*
* cputime_t may be less precise than nsecs (eg: if it's
* based on jiffies). Lets cast the result to cputime
* granularity and account the rest on the next rounds.
*/
steal_ct = nsecs_to_cputime(steal);
this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct);

account_steal_time(st);
return st;
account_steal_time(steal_ct);
return steal_ct;
}
#endif
return false;
Expand Down
9 changes: 9 additions & 0 deletions kernel/sched/debug.c
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
atomic_read(&cfs_rq->tg->runnable_avg));
#endif
#ifdef CONFIG_CFS_BANDWIDTH
SEQ_printf(m, " .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
cfs_rq->tg->cfs_bandwidth.timer_active);
SEQ_printf(m, " .%-30s: %d\n", "throttled",
cfs_rq->throttled);
SEQ_printf(m, " .%-30s: %d\n", "throttle_count",
cfs_rq->throttle_count);
#endif
#ifdef CONFIG_CFS_BANDWIDTH
SEQ_printf(m, " .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
cfs_rq->tg->cfs_bandwidth.timer_active);
Expand Down Expand Up @@ -310,6 +318,7 @@ do { \
#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);

P(yld_count);
P(yield_sleep_count);

P(sched_count);
P(sched_goidle);
Expand Down
Loading

0 comments on commit 3472dd8

Please sign in to comment.