From 3f5f78a8878933d6c70d98fd1d9abaa5f30358ad Mon Sep 17 00:00:00 2001
From: Junjie Wu <junjiew@codeaurora.org>
Date: Mon, 20 Oct 2014 11:32:22 -0700
Subject: [PATCH] cpufreq: stats: Print error messages if table initialization
 fails

Failing to initialize cpufreq_stats table effectively disables
cpufreq_stats. Such failures should not be silent. Print error
messages when cpufreq_stats fails to create stats table.

Change-Id: I71cc0dd8262c7c6946e169f148ae39bd8f213a96
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: stats: Don't update cpufreq_stats_table if it's NULL

When cpufreq_stats_table is not initialized for policy->last_cpu in
CPUFREQ_UPDATE_POLICY_CPU callback, no updates are necessary. Current
implementation dereferences the table unconditionally, causing a crash
if the table is NULL.

Return directly in cpufreq_stats_update_policy_cpu() if
cpufreq_stats_table of policy->last_cpu is NULL.

Change-Id: Ic9ef8120557702791ba5b873532e47cc0a5dc027
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Protect against hotplug in cpufreq_register_driver()

cpufreq_register_driver() could race with CPU hotplug during
bootup. Since hotplug notification is not registered when
subsys_interface_register() is being executed, it's possible
cpufreq's view of online CPUs becomes stale before it registers
for hotplug notification.

Register for hotplug notification first and protect
subsys_interface_register() against hotplug using
get/put_online_cpus().

Change-Id: I26b2908f1d167c2becc4e8664c357bb7c6162406
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Return directly in __cpufreq_get if policy is NULL

__cpufreq_get() checks whether policy->cur matches frequency returned
by cpufreq device driver and acts accordingly. However, policy could
be NULL if the CPU is being hotplugged. Current implementation crashes
when trying to dereference the NULL policy.

Return the frequency provided by cpufreq device driver directly if
policy is not available.

Change-Id: I1f2ba4028c259392bf730db37dbec2d8c5ae3fa4
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Add if cpu is online check in show

Make sure CPU is online before proceeding with any "show"
ops. Without this check, the show can race with hotplug
and try to show the details of a stale or non-existent
policy.

CRs-Fixed: 689522
Change-Id: Ie791c73cb281bcfc4d722f7c8c10eee07cb11f2e
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: governor: Be friendly towards latency-sensitive bursty workloads

Cpufreq governors like the ondemand governor calculate the load on the CPU
periodically by employing deferrable timers. A deferrable timer won't fire
if the CPU is completely idle (and there are no other timers to be run), in
order to avoid unnecessary wakeups and thus save CPU power.

However, the load calculation logic is agnostic to all this, and this can
lead to the problem described below.

Time (ms)               CPU 1

100                Task-A running

110                Governor's timer fires, finds load as 100% in the last
                   10ms interval and increases the CPU frequency.

110.5              Task-A running

120		   Governor's timer fires, finds load as 100% in the last
		   10ms interval and increases the CPU frequency.

125		   Task-A went to sleep. With nothing else to do, CPU 1
		   went completely idle.

200		   Task-A woke up and started running again.

200.5		   Governor's deferred timer (which was originally programmed
		   to fire at time 130) fires now. It calculates load for the
		   time period 120 to 200.5, and finds the load is almost zero.
		   Hence it decreases the CPU frequency to the minimum.

210		   Governor's timer fires, finds load as 100% in the last
		   10ms interval and increases the CPU frequency.

So, after the workload woke up and started running, the frequency was suddenly
dropped to absolute minimum, and after that, there was an unnecessary delay of
10ms (sampling period) to increase the CPU frequency back to a reasonable value.
And this pattern repeats for every wake-up-from-cpu-idle for that workload.
This can be quite undesirable for latency- or response-time sensitive bursty
workloads. So we need to fix the governor's logic to detect such wake-up-from-
cpu-idle scenarios and start the workload at a reasonably high CPU frequency.

One extreme solution would be to fake a load of 100% in such scenarios. But
that might lead to undesirable side-effects such as frequency spikes (which
might also need voltage changes) especially if the previous frequency happened
to be very low.

We just want to avoid the stupidity of dropping down the frequency to a minimum
and then enduring a needless (and long) delay before ramping it up back again.
So, let us simply carry forward the previous load - that is, let us just pretend
that the 'load' for the current time-window is the same as the load for the
previous window. That way, the frequency and voltage will continue to be set
to whatever values they were set at previously. This means that bursty workloads
will get a chance to influence the CPU frequency at which they wake up from
cpu-idle, based on their past execution history. Thus, they might be able to
avoid suffering from slow wakeups and long response-times.

However, we should take care not to over-do this. For example, such a "copy
previous load" logic will benefit cases like this: (where # represents busy
and . represents idle)

but it will be detrimental in cases like the one shown below, because it will
retain the high frequency (copied from the previous interval) even in a mostly
idle system:

(i.e., the workload finished and the remaining tasks are such that their busy
periods are smaller than the sampling interval, which causes the timer to
always get deferred. So, this will make the copy-previous-load logic copy
the initial high load to subsequent idle periods over and over again, thus
keeping the frequency high unnecessarily).

So, we modify this copy-previous-load logic such that it is used only once
upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the
previous load won't get blindly copied over; cpufreq will freshly evaluate the
load in the second idle interval, thus ensuring that the system comes back to
its normal state.

[ The right way to solve this whole problem is to teach the CPU frequency
governors to also track load on a per-task basis, not just a per-CPU basis,
and then use both the data sources intelligently to set the appropriate
frequency on the CPUs. But that involves redesigning the cpufreq subsystem,
so this patch should make the situation bearable until then. ]

Experimental results:
+-------------------+

I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in
between its execution such that its total utilization can be a user-defined
value, say 10% or 20% (higher the utilization specified, lesser the amount of
sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8.

Behavior observed with tracing (sample taken from 40% utilization runs):
------------------------------------------------------------------------

Without patch:
~~~~~~~~~~~~~~
kworker/8:2-12137  416.335742: cpu_frequency: state=2061000 cpu_id=8
kworker/8:2-12137  416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40753  416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137  416.345744: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-12137  416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40753  416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
<snip>  ---------------------------------------------------------------------  <snip>
      <...>-40753  416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
     <idle>-0      416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
      <...>-40753  416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137  416.505739: cpu_frequency: state=2061000 cpu_id=8
kworker/8:2-12137  416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40753  416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-12137  416.515742: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-12137  416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy

Observation: Ebizzy went idle at 416.402202, and started running again at
416.502130. But cpufreq noticed the long idle period, and dropped the frequency
at 416.505739, only to increase it back again at 416.515742, realizing that the
workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency
for almost 13 milliseconds (almost 1 full sample period), and this pattern
repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite
a lot.

With patch:
~~~~~~~~~~~

kworker/8:2-29802  464.832535: cpu_frequency: state=2061000 cpu_id=8
<snip>  ---------------------------------------------------------------------  <snip>
kworker/8:2-29802  464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40738  464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802  464.972536: cpu_frequency: state=4123000 cpu_id=8
kworker/8:2-29802  464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40738  464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
<snip>  ---------------------------------------------------------------------  <snip>
kworker/8:2-29802  465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40738  465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802  465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40738  465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
     <idle>-0      465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
      <...>-40738  465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
kworker/8:2-29802  465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      <...>-40738  465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2

Observation: Ebizzy went idle at 465.035797, and started running again at
465.240178. Since ebizzy was the only real workload running on this CPU,
cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no
matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy
got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared
to the run without the patch) and this boost gave a modest improvement in total
throughput, as shown below.

Sleeping-ebizzy records-per-second:
-----------------------------------

Utilization  Without patch  With patch  Difference (Absolute and % values)
    10%         274767        277046        +  2279 (+0.829%)
    20%         543429        553484        + 10055 (+1.850%)
    40%        1090744       1107959        + 17215 (+1.578%)
    60%        1634908       1662018        + 27110 (+1.658%)

A rudimentary and somewhat approximately latency-sensitive workload such as
sleeping-ebizzy itself showed a consistent, noticeable performance improvement
with this patch. Hence, workloads that are truly latency-sensitive will benefit
quite a bit from this change. Moreover, this is an overall win-win since this
patch does not hurt power-savings at all (because, this patch does not reduce
the idle time or idle residency; and the high frequency of the CPU when it goes
to cpu-idle does not affect/hurt the power-savings of deep idle states).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled

When a CPU is hot removed we'll cancel all the delayed work items via
gov_cancel_work(). Sometimes the delayed work function determines that
it should adjust the delay for all other CPUs that the policy is
managing. If this scenario occurs, the canceling CPU will cancel its own
work but queue up the other CPUs works to run.

Commit 3617f2 (cpufreq: Fix timer/workqueue corruption due to double
queueing) has tried to fix this, but reading governor_enabled is not
protected by cpufreq_governor_lock. Even though od_dbs_timer() checks
governor_enabled before gov_queue_work(), this scenario may occur. For
example:

 CPU0                                        CPU1
 ----                                        ----
 cpu_down()
  ...                                        <work runs>
  __cpufreq_remove_dev()                     od_dbs_timer()
   __cpufreq_governor()                       policy->governor_enabled
    policy->governor_enabled = false;
    cpufreq_governor_dbs()
     case CPUFREQ_GOV_STOP:
      gov_cancel_work(dbs_data, policy);
       cpu0 work is canceled
        timer is canceled
        cpu1 work is canceled
        <waits for cpu1>
                                              gov_queue_work(*, *, true);
                                               cpu0 work queued
                                               cpu1 work queued
                                               cpu2 work queued
                                               ...
        cpu1 work is canceled
        cpu2 work is canceled
        ...

At the end of the GOV_STOP case cpu0 still has a work queued to
run although the code is expecting all of the works to be
canceled. __cpufreq_remove_dev() will then proceed to
re-initialize all the other CPUs works except for the CPU that is
going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs()
will trample over the queued work and debugobjects will spit out
a warning:

WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc()
ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x14
Modules linked in:
CPU: 1 PID: 1205 Comm: sh Tainted: G        W    3.10.0 #200
[<c01144f0>] (unwind_backtrace+0x0/0xf8) from [<c0111d98>] (show_stack+0x10/0x14)
[<c0111d98>] (show_stack+0x10/0x14) from [<c01272cc>] (warn_slowpath_common+0x4c/0x68)
[<c01272cc>] (warn_slowpath_common+0x4c/0x68) from [<c012737c>] (warn_slowpath_fmt+0x30/0x40)
[<c012737c>] (warn_slowpath_fmt+0x30/0x40) from [<c034c640>] (debug_print_object+0x94/0xbc)
[<c034c640>] (debug_print_object+0x94/0xbc) from [<c034c7f8>] (__debug_object_init+0xc8/0x3c0)
[<c034c7f8>] (__debug_object_init+0xc8/0x3c0) from [<c01360e0>] (init_timer_key+0x20/0x104)
[<c01360e0>] (init_timer_key+0x20/0x104) from [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c)
[<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) from [<c04833a8>] (__cpufreq_governor+0x80/0x1b0)
[<c04833a8>] (__cpufreq_governor+0x80/0x1b0) from [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380)
[<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) from [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c)
[<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) from [<c014fb40>] (notifier_call_chain+0x44/0x84)
[<c014fb40>] (notifier_call_chain+0x44/0x84) from [<c012ae44>] (__cpu_notify+0x2c/0x48)
[<c012ae44>] (__cpu_notify+0x2c/0x48) from [<c068dd40>] (_cpu_down+0x80/0x258)
[<c068dd40>] (_cpu_down+0x80/0x258) from [<c068df40>] (cpu_down+0x28/0x3c)
[<c068df40>] (cpu_down+0x28/0x3c) from [<c068e4c0>] (store_online+0x30/0x74)
[<c068e4c0>] (store_online+0x30/0x74) from [<c03a7308>] (dev_attr_store+0x18/0x24)
[<c03a7308>] (dev_attr_store+0x18/0x24) from [<c0256fe0>] (sysfs_write_file+0x100/0x180)
[<c0256fe0>] (sysfs_write_file+0x100/0x180) from [<c01fec9c>] (vfs_write+0xbc/0x184)
[<c01fec9c>] (vfs_write+0xbc/0x184) from [<c01ff034>] (SyS_write+0x40/0x68)
[<c01ff034>] (SyS_write+0x40/0x68) from [<c010e200>] (ret_fast_syscall+0x0/0x48)

In gov_queue_work(), lock cpufreq_governor_lock before gov_queue_work,
and unlock it after __gov_queue_work(). In this way, governor_enabled
is guaranteed not changed in gov_queue_work().

Signed-off-by: Jane Li <jiel@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: remove race while accessing cur_policy

While accessing cur_policy during executing events
CPUFREQ_GOV_START, CPUFREQ_GOV_STOP, CPUFREQ_GOV_LIMITS,
same mutex lock is not taken, dbs_data->mutex, which leads
to race and data corruption while running continious suspend
resume test. This is seen with ondemand governor with suspend
resume test using rtcwake.

 Unable to handle kernel NULL pointer dereference at virtual address 00000028
 pgd = ed610000
 [00000028] *pgd=adf11831, *pte=00000000, *ppte=00000000
 Internal error: Oops: 17 [#1] PREEMPT SMP ARM
 Modules linked in: nvhost_vi
 CPU: 1 PID: 3243 Comm: rtcwake Not tainted 3.10.24-gf5cf9e5 #1
 task: ee708040 ti: ed61c000 task.ti: ed61c000
 PC is at cpufreq_governor_dbs+0x400/0x634
 LR is at cpufreq_governor_dbs+0x3f8/0x634
 pc : [<c05652b8>] lr : [<c05652b0>] psr: 600f0013
 sp : ed61dcb0 ip : 000493e0 fp : c1cc14f0
 r10: 00000000 r9 : 00000000 r8 : 00000000
 r7 : eb725280 r6 : c1cc1560 r5 : eb575200 r4 : ebad7740
 r3 : ee708040 r2 : ed61dca8 r1 : 001ebd24 r0 : 00000000
 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
 Control: 10c5387d Table: ad61006a DAC: 00000015
 [<c05652b8>] (cpufreq_governor_dbs+0x400/0x634) from [<c055f700>] (__cpufreq_governor+0x98/0x1b4)
 [<c055f700>] (__cpufreq_governor+0x98/0x1b4) from [<c0560770>] (__cpufreq_set_policy+0x250/0x320)
 [<c0560770>] (__cpufreq_set_policy+0x250/0x320) from [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168)
 [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) from [<c0561ed0>] (cpu_freq_notify+0x68/0xdc)
 [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c)
 [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68)
 [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28)
 [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310)
 [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) from [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70)
 [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) from [<c004b4b8>] (tegra_pm_notify+0x114/0x134)
 [<c004b4b8>] (tegra_pm_notify+0x114/0x134) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c)
 [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68)
 [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28)
 [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34)
 [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) from [<c00ad38c>] (enter_state+0xec/0x128)
 [<c00ad38c>] (enter_state+0xec/0x128) from [<c00ad400>] (pm_suspend+0x38/0xa4)
 [<c00ad400>] (pm_suspend+0x38/0xa4) from [<c00ac114>] (state_store+0x70/0xc0)
 [<c00ac114>] (state_store+0x70/0xc0) from [<c027b1e8>] (kobj_attr_store+0x14/0x20)
 [<c027b1e8>] (kobj_attr_store+0x14/0x20) from [<c019cd9c>] (sysfs_write_file+0x104/0x184)
 [<c019cd9c>] (sysfs_write_file+0x104/0x184) from [<c0143038>] (vfs_write+0xd0/0x19c)
 [<c0143038>] (vfs_write+0xd0/0x19c) from [<c0143414>] (SyS_write+0x4c/0x78)
 [<c0143414>] (SyS_write+0x4c/0x78) from [<c000f080>] (ret_fast_syscall+0x0/0x30)
 Code: e1a00006 eb084346 e59b0020 e5951024 (e5903028)
 ---[ end trace 0488523c8f6b0f9d ]---

Signed-off-by: Bibek Basu <bbasu@nvidia.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'

'copy_prev_load' was recently added by commit: 18b46ab (cpufreq: governor: Be
friendly towards latency-sensitive bursty workloads).

It actually is a bit redundant as we also have 'prev_load' which can store any
integer value and can be used instead of 'copy_prev_load' by setting it zero.

True load can also turn out to be zero during long idle intervals (and hence the
actual value of 'prev_load' and the overloaded value can clash). However this is
not a problem because, if the true load was really zero in the previous
interval, it makes sense to evaluate the load afresh for the current interval
rather than copying the previous load.

So, drop 'copy_prev_load' and use 'prev_load' instead.

Update comments as well to make it more clear.

There is another change here which was probably missed by Srivatsa during the
last version of updates he made. The unlikely in the 'if' statement was covering
only half of the condition and the whole line should actually come under it.

Also checkpatch is made more silent as it was reporting this (--strict option):

CHECK: Alignment should match open parenthesis
+		if (unlikely(wall_time > (2 * sampling_rate) &&
+						j_cdbs->prev_load)) {

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Break out early when frequency equals target_freq

Many drivers keep frequencies in frequency table in ascending
or descending order. When governor tries to change to policy->min
or policy->max respectively then the cpufreq_frequency_table_target
could return on first iteration. This will save some iteration cycles.

So, break out early when a frequency in cpufreq_frequency_table
equals to target one.

Testing this during kernel compilation using ondemand governor
with a frequency table in ascending order, the
cpufreq_frequency_table_target returned early on the first
iteration at about 30% of times called.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

Revert "cpufreq: make the "scaling_cur_freq" sysfs entry pollable"

This reverts commit ee974af4c9a77eb1be717eb0934c1815681e3689.

Junjuie says:
    sysfs_notify of scaling_cur_freq is no longer used by userspace
    components. Since sysfs_notify contends on a single mutex, it could
    add long delay to CPU frequency scaling.

    Remove the sysfs_notify() to speed up CPU frequency scaling.

Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

qcom-cpufreq: Restore CPU frequency during resume

qcom-cpufreq blocks CPU frequency change request during suspend, because
its dependencies might be suspended. Thus a freq change request would
fail silently, and CPU clock won't change until first frequency update
is requested after system comes out of suspend. This creates a period
when thermal driver cannot perform frequency mitigation, even though
policy->min/max have been correctly updated.

Check each online CPU's policy during resume to correct any frequency
violation as soon as possible.

Change-Id: I3be79cf91e7d5e361314020c9806b770823c0b72
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Check current frequency in device driver

__cpufreq_driver_target() checks if policy->cur is same as target_freq
without holding any lock. This function is used by governor to
directly set CPU frequency. Governor calling this function can't hold
any CPUfreq framework locks due to deadlock possibility.

However, this results in a race condition where one thread could see
a stale policy->cur while another thread is changing CPU frequency.

Thread A: Governor calls __cpufreq_driver_target(), starts increasing
frequency but hasn't sent out CPUFREQ_POSTCHANGE notification yet.
Thread B: Some other driver (could be thermal mitigation) starts
limiting frequency using cpufreq_update_policy(). Every limits are
applied to policy->min/max and final policy->max happens to be same as
policy->cur. __cpufreq_driver_target() simply returns 0.
Thread A: Governor finish scaling and now policy->cur violates
policy->max and could last forever until next CPU frequency scaling
happens.

Shifting the responsibility of checking policy->cur and target_freq
to CPUfreq device driver would resolve the race as long as the device
driver holds a common mutex.

Change-Id: I6f943228e793a4a4300c58b3ae0143e09ed01d7d
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

IKSWL-14278 cpufreq: set policy with user_policy data

When we update/set cpufreq policy, we should set the new
policy with user_policy data, and the cpufreq notifier
will limit the min/max freq accordingly.

This is the right process to update the cpufreq policy.

But we miss it in store_scaling_governor func, which cause
the governor set failed if the policy.max is less than
user_policy.min.

Change-Id: Ie21593e146b8d06fa017c87434e85ff33456c20f
Signed-off-by: Lianwei Wang <a22439@motorola.com>
Reviewed-on: http://gerrit.mot.com/746985
SLTApproved: Slta Waiver <sltawvr@motorola.com>
SME-Granted: SME Approvals Granted
Tested-by: Jira Key <jirakey@motorola.com>
Reviewed-by: Ke Lu <a2435c@motorola.com>
Reviewed-by: Ravi Chebolu <arc095@motorola.com>
Reviewed-by: Christopher Fries <cfries@motorola.com>
Submit-Approved: Jira Key <jirakey@motorola.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

cpufreq: Use correct locking for cpufreq_cpu_data

Use write lock when updating cpufreq_cpu_data,
and read lock when getting the policy pointer.

CRs-Fixed: 689522
Change-Id: I454f0d575157b3411d369e04097386f50aeaaa1c
Signed-off-by: Maria Yu <aiquny@codeaurora.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

cpufreq: Disable light-weight init/teardown during suspend/resume

Light-weight init/teardown is introduced to preserve file permission and
reduce suspend/resume latency. However, it doesn't work correctly if
multiple CPUs controlled by same policy can all go offline.

Suspend and resume removes and adds back CPUs in the same order for
non-boot CPUs. Say CPU2 and CPU3 are both online when suspend starts.
CPU2 goes offline first, handing policy and sysfs over to CPU3. Then
CPU3 goes offline. Due to light-weight teardown, CPU3 still owns the
policy and sysfs files.

When CPU2 comes online after resume, it calls update_policy_cpu() to take
back the policy ownership, but sysfs is not touched. Now CPU2 is online,
with an active policy, but no sysfs folder. In additions, when CPU3 comes
online, it will attempt to create a symlink while owning the real sysfs
folder.

To really fix this issue, special handling for sysfs and symlinks is
required during suspend and resume. This requires non-trivial refactoring
of certain functions.

Temporarly disable light-weight init/teardown until a solution is found.

Change-Id: I485483244954129fa405bc5ef1a5e0da5c05a7d5
Signed-off-by: Junjie Wu <junjiew@codeaurora.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

cpufreq: Save state for entire cluster when the last CPU goes offline

When the last CPU within a cluster goes offline its cpufreq limits and
the current governor are saved so that it can start with the same
state (cpufreq min/max limits and scaling governor) when it comes back
online later. However if a different CPU from the same cluster comes
online after this it will restore the state that it saved when it went
offline and which can be different from the state of the CPU last
offlined. This can leave the system in incorrect (i.e. not the latest)
state since the rest of the CPUs in the same cluster brought back
online later will be associated with the CPU that first came online.
To prevent such a scenario save the state for all CPUs in a cluster
when the last CPU goes offline (since the last CPU holds the latest
updates) so that it can be properly restored by any CPU that first
comes online in that cluster.

Change-Id: I67cd6bb219b7cc4fd18507ffb9b43ca37dcf0ae7
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

cpufreq: force all cores to use the same governor

If BCL is enabled, and under certain conditions cpu2 and cpu3
are pegged to max freq when they come back online (if they've been
offlined by BCL). If this happens during boot cpu2 and cpu3 lose
their policy governor reference (I don't know why) and simply fall
back to the default performance governor.

Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

Nexus 6: user voltage control

Signed-off-by: flar2 <asegaert@gmail.com>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
---
 arch/arm/mach-msm/Kconfig           |   6 ++
 drivers/clk/qcom/clock-krait-8974.c |  71 +++++++++++++++++
 drivers/cpufreq/cpufreq.c           | 117 +++++++++++++++++++---------
 drivers/cpufreq/cpufreq_governor.c  |  79 ++++++++++++++++++-
 drivers/cpufreq/cpufreq_governor.h  |   9 +++
 drivers/cpufreq/cpufreq_stats.c     |  11 ++-
 drivers/cpufreq/freq_table.c        |   8 +-
 drivers/cpufreq/qcom-cpufreq.c      |  67 +++++++++++++++-
 8 files changed, 323 insertions(+), 45 deletions(-)

diff --git a/arch/arm/mach-msm/Kconfig b/arch/arm/mach-msm/Kconfig
index 83053b388372..a07708819615 100644
--- a/arch/arm/mach-msm/Kconfig
+++ b/arch/arm/mach-msm/Kconfig
@@ -768,6 +768,12 @@ config MSM_DEVFREQ_CPUBW
 	  agnostic interface to so that some of the devfreq governors can be
 	  shared across SoCs.
 
+config MSM_CPU_VOLTAGE_CONTROL
+	bool "Userspace CPU Voltage Control"
+	default y
+	help
+	  This enables userspace CPU Voltage Control.
+
 config MSM_AVS_HW
 	bool "Enable Adaptive Voltage Scaling (AVS)"
 	default n
diff --git a/drivers/clk/qcom/clock-krait-8974.c b/drivers/clk/qcom/clock-krait-8974.c
index e59bc2d587da..a1cf5f82e71b 100644
--- a/drivers/clk/qcom/clock-krait-8974.c
+++ b/drivers/clk/qcom/clock-krait-8974.c
@@ -28,11 +28,14 @@
 #include <soc/qcom/clock-local2.h>
 #include <soc/qcom/clock-krait.h>
 #include <mach/mmi_soc_info.h>
+#include <linux/cpufreq.h>
 
 #include <asm/cputype.h>
 
 #include "clock.h"
 
+
+
 /* Clock inputs coming into Krait subsystem */
 DEFINE_FIXED_DIV_CLK(hfpll_src_clk, 1, NULL);
 DEFINE_FIXED_DIV_CLK(acpu_aux_clk, 2, NULL);
@@ -693,6 +696,74 @@ module_param_string(table_name, table_name, sizeof(table_name), S_IRUGO);
 static unsigned int pvs_config_ver;
 module_param(pvs_config_ver, uint, S_IRUGO);
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+#define CPU_VDD_MAX	1200
+#define CPU_VDD_MIN	600
+
+extern int use_for_scaling(unsigned int freq);
+static unsigned int cnt;
+
+ssize_t show_UV_mV_table(struct cpufreq_policy *policy,
+			 char *buf)
+{
+	int i, freq, len = 0;
+	unsigned int cpu = 0;
+	unsigned int num_levels = cpu_clk[cpu]->vdd_class->num_levels;
+
+	if (!buf)
+		return -EINVAL;
+
+	for (i = 0; i < num_levels; i++) {
+		freq = use_for_scaling(cpu_clk[cpu]->fmax[i] / 1000);
+		if (freq < 0)
+			continue;
+
+		len += sprintf(buf + len, "%dmhz: %u mV\n", freq / 1000,
+			       cpu_clk[cpu]->vdd_class->vdd_uv[i] / 1000);
+	}
+
+	return len;
+}
+
+ssize_t store_UV_mV_table(struct cpufreq_policy *policy,
+			  char *buf, size_t count)
+{
+	int i, j;
+	int ret = 0;
+	unsigned int val, cpu = 0;
+	unsigned int num_levels = cpu_clk[cpu]->vdd_class->num_levels;
+	char size_cur[4];
+
+	if (cnt) {
+		cnt = 0;
+		return -EINVAL;
+	}
+
+	for (i = 0; i < num_levels; i++) {
+		if (use_for_scaling(cpu_clk[cpu]->fmax[i] / 1000) < 0)
+			continue;
+
+		ret = sscanf(buf, "%u", &val);
+		if (!ret)
+			return -EINVAL;
+
+		if (val > CPU_VDD_MAX)
+			val = CPU_VDD_MAX;
+		else if (val < CPU_VDD_MIN)
+			val = CPU_VDD_MIN;
+
+		for (j = 0; j < NR_CPUS; j++)
+			cpu_clk[j]->vdd_class->vdd_uv[i] = val * 1000;
+
+		ret = sscanf(buf, "%s", size_cur);
+		cnt = strlen(size_cur);
+		buf += cnt + 1;
+	}
+
+	return ret;
+}
+#endif
+
 static int clock_krait_8974_driver_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 221d73b50874..a3f60754d17f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -39,7 +39,7 @@ static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data_fallback);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_MUTEX(cpufreq_governor_lock);
+DEFINE_MUTEX(cpufreq_governor_lock);
 static LIST_HEAD(cpufreq_policy_list);
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -305,10 +305,8 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
-		if (likely(policy) && likely(policy->cpu == freqs->cpu)) {
+		if (likely(policy) && likely(policy->cpu == freqs->cpu))
 			policy->cur = freqs->new;
-			sysfs_notify(&policy->kobj, NULL, "scaling_cur_freq");
-		}
 		break;
 	}
 }
@@ -504,6 +502,9 @@ static ssize_t store_scaling_governor(struct cpufreq_policy *policy,
 						&new_policy.governor))
 		return -EINVAL;
 
+	new_policy.min = new_policy.user_policy.min;
+	new_policy.max = new_policy.user_policy.max;
+
 	ret = cpufreq_set_policy(policy, &new_policy);
 
 	policy->user_policy.policy = policy->policy;
@@ -625,6 +626,14 @@ static ssize_t show_bios_limit(struct cpufreq_policy *policy, char *buf)
 	return sprintf(buf, "%u\n", policy->cpuinfo.max_freq);
 }
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+extern ssize_t show_UV_mV_table(struct cpufreq_policy *policy,
+				char *buf);
+
+extern ssize_t store_UV_mV_table(struct cpufreq_policy *policy,
+				 const char *buf, size_t count);
+#endif
+
 cpufreq_freq_attr_ro_perm(cpuinfo_cur_freq, 0400);
 cpufreq_freq_attr_ro(cpuinfo_min_freq);
 cpufreq_freq_attr_ro(cpuinfo_max_freq);
@@ -639,6 +648,9 @@ cpufreq_freq_attr_rw(scaling_min_freq);
 cpufreq_freq_attr_rw(scaling_max_freq);
 cpufreq_freq_attr_rw(scaling_governor);
 cpufreq_freq_attr_rw(scaling_setspeed);
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+cpufreq_freq_attr_rw(UV_mV_table);
+#endif
 
 static struct attribute *default_attrs[] = {
 	&cpuinfo_min_freq.attr,
@@ -652,6 +664,9 @@ static struct attribute *default_attrs[] = {
 	&scaling_driver.attr,
 	&scaling_available_governors.attr,
 	&scaling_setspeed.attr,
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+	&UV_mV_table.attr,
+#endif
 	NULL
 };
 
@@ -662,10 +677,15 @@ static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
 {
 	struct cpufreq_policy *policy = to_policy(kobj);
 	struct freq_attr *fattr = to_attr(attr);
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
+
+	get_online_cpus();
+
+	if (!cpu_online(policy->cpu))
+		goto unlock;
 
 	if (!down_read_trylock(&cpufreq_rwsem))
-		return -EINVAL;
+		goto unlock;
 
 	down_read(&policy->rwsem);
 
@@ -676,7 +696,8 @@ static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
 
 	up_read(&policy->rwsem);
 	up_read(&cpufreq_rwsem);
-
+unlock:
+	put_online_cpus();
 	return ret;
 }
 
@@ -1200,6 +1221,27 @@ static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy,
 	return cpu_dev->id;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static void update_related_cpus(struct cpufreq_policy *policy)
+{
+	unsigned int j;
+
+	for_each_cpu(j, policy->related_cpus) {
+		if (!cpufreq_driver->setpolicy)
+			strlcpy(per_cpu(cpufreq_policy_save, j).gov,
+				policy->governor->name, CPUFREQ_NAME_LEN);
+		per_cpu(cpufreq_policy_save, j).min = policy->user_policy.min;
+		per_cpu(cpufreq_policy_save, j).max = policy->user_policy.max;
+		pr_debug("Saving CPU%d user policy min %d and max %d\n",
+		 j, policy->user_policy.min, policy->user_policy.max);
+	}
+}
+#else
+static void update_related_cpus(struct cpufreq_policy *policy)
+{
+}
+#endif
+
 static int __cpufreq_remove_dev_prepare(struct device *dev,
 					struct subsys_interface *sif,
 					bool frozen)
@@ -1235,20 +1277,13 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 		}
 	}
 
-#ifdef CONFIG_HOTPLUG_CPU
-	if (!cpufreq_driver->setpolicy)
-		strlcpy(per_cpu(cpufreq_policy_save, cpu).gov,
-			policy->governor->name, CPUFREQ_NAME_LEN);
-	per_cpu(cpufreq_policy_save, cpu).min = policy->user_policy.min;
-	per_cpu(cpufreq_policy_save, cpu).max = policy->user_policy.max;
-	pr_debug("Saving CPU%d user policy min %d and max %d\n",
-		 cpu, policy->user_policy.min, policy->user_policy.max);
-#endif
-
 	down_read(&policy->rwsem);
 	cpus = cpumask_weight(policy->cpus);
 	up_read(&policy->rwsem);
 
+	if (cpus == 1)
+		update_related_cpus(policy);
+
 	if (cpu != policy->cpu) {
 		sysfs_remove_link(&dev->kobj, "cpufreq");
 	} else if (cpus > 1) {
@@ -1275,10 +1310,10 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 	unsigned long flags;
 	struct cpufreq_policy *policy;
 
-	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	write_lock_irqsave(&cpufreq_driver_lock, flags);
 	policy = per_cpu(cpufreq_cpu_data, cpu);
 	per_cpu(cpufreq_cpu_data, cpu) = NULL;
-	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
 
 	if (!policy) {
 		pr_debug("%s: No cpu_data found\n", __func__);
@@ -1444,14 +1479,22 @@ EXPORT_SYMBOL(cpufreq_quick_get_max);
 
 static unsigned int __cpufreq_get(unsigned int cpu)
 {
-	struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_data, cpu);
+	struct cpufreq_policy *policy;
 	unsigned int ret_freq = 0;
+	unsigned long flags;
 
 	if (!cpufreq_driver->get)
 		return ret_freq;
 
+	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	policy = per_cpu(cpufreq_cpu_data, cpu);
+	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
 	ret_freq = cpufreq_driver->get(cpu);
 
+	if (!policy)
+		return ret_freq;
+
 	if (ret_freq && policy->cur &&
 		!(cpufreq_driver->flags & CPUFREQ_CONST_LOOPS)) {
 		/* verify no discrepancy between actual and
@@ -1473,12 +1516,17 @@ static unsigned int __cpufreq_get(unsigned int cpu)
  */
 unsigned int cpufreq_get(unsigned int cpu)
 {
-	struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_data, cpu);
+	struct cpufreq_policy *policy;
 	unsigned int ret_freq = 0;
+	unsigned long flags;
 
 	if (cpufreq_disabled() || !cpufreq_driver)
 		return -ENOENT;
 
+	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	policy = per_cpu(cpufreq_cpu_data, cpu);
+	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
 	BUG_ON(!policy);
 
 	if (!down_read_trylock(&cpufreq_rwsem))
@@ -1696,15 +1744,6 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy,
 	pr_debug("target for CPU %u: %u kHz, relation %u, requested %u kHz\n",
 			policy->cpu, target_freq, relation, old_target_freq);
 
-	/*
-	 * This might look like a redundant call as we are checking it again
-	 * after finding index. But it is left intentionally for cases where
-	 * exactly same freq is called again and so we can save on few function
-	 * calls.
-	 */
-	if (target_freq == policy->cur)
-		return 0;
-
 	if (cpufreq_driver->target)
 		retval = cpufreq_driver->target(policy, target_freq, relation);
 	else if (cpufreq_driver->target_index) {
@@ -1956,6 +1995,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 				struct cpufreq_policy *new_policy)
 {
 	int ret = 0, failed = 1;
+	struct cpufreq_policy *cpu0_policy = NULL;
 
 	pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu,
 		new_policy->min, new_policy->max);
@@ -1997,6 +2037,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	policy->max = new_policy->max;
 	trace_cpu_frequency_limits(policy->max, policy->min, policy->cpu);
 
+	if (new_policy->cpu)
+		cpu0_policy = cpufreq_cpu_get(0);
+
 	pr_debug("new min and max freqs are %u - %u kHz\n",
 					policy->min, policy->max);
 
@@ -2021,7 +2064,10 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 			}
 
 			/* start new governor */
-			policy->governor = new_policy->governor;
+			if (new_policy->cpu && cpu0_policy)
+				policy->governor = cpu0_policy->governor;
+			else
+				policy->governor = new_policy->governor;
 			if (!__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_INIT)) {
 				if (!__cpufreq_governor(policy, CPUFREQ_GOV_START)) {
 					failed = 0;
@@ -2120,9 +2166,6 @@ static int cpufreq_cpu_callback(struct notifier_block *nfb,
 	dev = get_cpu_device(cpu);
 	if (dev) {
 
-		if (action & CPU_TASKS_FROZEN)
-			frozen = true;
-
 		switch (action & ~CPU_TASKS_FROZEN) {
 		case CPU_ONLINE:
 			__cpufreq_add_dev(dev, NULL, frozen);
@@ -2189,7 +2232,11 @@ int cpufreq_register_driver(struct cpufreq_driver *driver_data)
 	cpufreq_driver = driver_data;
 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
 
+	register_hotcpu_notifier(&cpufreq_cpu_notifier);
+
+	get_online_cpus();
 	ret = subsys_interface_register(&cpufreq_interface);
+	put_online_cpus();
 	if (ret)
 		goto err_null_driver;
 
@@ -2212,13 +2259,13 @@ int cpufreq_register_driver(struct cpufreq_driver *driver_data)
 		}
 	}
 
-	register_hotcpu_notifier(&cpufreq_cpu_notifier);
 	pr_debug("driver %s up and running\n", driver_data->name);
 
 	return 0;
 err_if_unreg:
 	subsys_interface_unregister(&cpufreq_interface);
 err_null_driver:
+	unregister_hotcpu_notifier(&cpufreq_cpu_notifier);
 	write_lock_irqsave(&cpufreq_driver_lock, flags);
 	cpufreq_driver = NULL;
 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index e6be63561fa6..1b44496b2d2b 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -36,14 +36,29 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 	struct cpufreq_policy *policy;
+	unsigned int sampling_rate;
 	unsigned int max_load = 0;
 	unsigned int ignore_nice;
 	unsigned int j;
 
-	if (dbs_data->cdata->governor == GOV_ONDEMAND)
+	if (dbs_data->cdata->governor == GOV_ONDEMAND) {
+		struct od_cpu_dbs_info_s *od_dbs_info =
+				dbs_data->cdata->get_cpu_dbs_info_s(cpu);
+
+		/*
+		 * Sometimes, the ondemand governor uses an additional
+		 * multiplier to give long delays. So apply this multiplier to
+		 * the 'sampling_rate', so as to keep the wake-up-from-idle
+		 * detection logic a bit conservative.
+		 */
+		sampling_rate = od_tuners->sampling_rate;
+		sampling_rate *= od_dbs_info->rate_mult;
+
 		ignore_nice = od_tuners->ignore_nice_load;
-	else
+	} else {
+		sampling_rate = cs_tuners->sampling_rate;
 		ignore_nice = cs_tuners->ignore_nice_load;
+	}
 
 	policy = cdbs->cur_policy;
 
@@ -96,7 +111,46 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 		if (unlikely(!wall_time || wall_time < idle_time))
 			continue;
 
-		load = 100 * (wall_time - idle_time) / wall_time;
+		/*
+		 * If the CPU had gone completely idle, and a task just woke up
+		 * on this CPU now, it would be unfair to calculate 'load' the
+		 * usual way for this elapsed time-window, because it will show
+		 * near-zero load, irrespective of how CPU intensive that task
+		 * actually is. This is undesirable for latency-sensitive bursty
+		 * workloads.
+		 *
+		 * To avoid this, we reuse the 'load' from the previous
+		 * time-window and give this task a chance to start with a
+		 * reasonably high CPU frequency. (However, we shouldn't over-do
+		 * this copy, lest we get stuck at a high load (high frequency)
+		 * for too long, even when the current system load has actually
+		 * dropped down. So we perform the copy only once, upon the
+		 * first wake-up from idle.)
+		 *
+		 * Detecting this situation is easy: the governor's deferrable
+		 * timer would not have fired during CPU-idle periods. Hence
+		 * an unusually large 'wall_time' (as compared to the sampling
+		 * rate) indicates this scenario.
+		 *
+		 * prev_load can be zero in two cases and we must recalculate it
+		 * for both cases:
+		 * - during long idle intervals
+		 * - explicitly set to zero
+		 */
+		if (unlikely(wall_time > (2 * sampling_rate) &&
+			     j_cdbs->prev_load)) {
+			load = j_cdbs->prev_load;
+
+			/*
+			 * Perform a destructive copy, to ensure that we copy
+			 * the previous load only once, upon the first wake-up
+			 * from idle.
+			 */
+			j_cdbs->prev_load = 0;
+		} else {
+			load = 100 * (wall_time - idle_time) / wall_time;
+			j_cdbs->prev_load = load;
+		}
 
 		if (load > max_load)
 			max_load = load;
@@ -119,8 +173,9 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy,
 {
 	int i;
 
+	mutex_lock(&cpufreq_governor_lock);
 	if (!policy->governor_enabled)
-		return;
+		goto out_unlock;
 
 	if (!all_cpus) {
 		/*
@@ -135,6 +190,9 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy,
 		for_each_cpu(i, policy->cpus)
 			__gov_queue_work(i, dbs_data, delay);
 	}
+
+out_unlock:
+	mutex_unlock(&cpufreq_governor_lock);
 }
 EXPORT_SYMBOL_GPL(gov_queue_work);
 
@@ -314,11 +372,18 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		for_each_cpu(j, policy->cpus) {
 			struct cpu_dbs_common_info *j_cdbs =
 				dbs_data->cdata->get_cpu_cdbs(j);
+			unsigned int prev_load;
 
 			j_cdbs->cpu = j;
 			j_cdbs->cur_policy = policy;
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j,
 					       &j_cdbs->prev_cpu_wall, io_busy);
+
+			prev_load = (unsigned int)
+				(j_cdbs->prev_cpu_wall - j_cdbs->prev_cpu_idle);
+			j_cdbs->prev_load = 100 * prev_load /
+					(unsigned int) j_cdbs->prev_cpu_wall;
+
 			if (ignore_nice)
 				j_cdbs->prev_cpu_nice =
 					kcpustat_cpu(j).cpustat[CPUTIME_NICE];
@@ -362,6 +427,11 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		break;
 
 	case CPUFREQ_GOV_LIMITS:
+		mutex_lock(&dbs_data->mutex);
+		if (!cpu_cdbs->cur_policy) {
+			mutex_unlock(&dbs_data->mutex);
+			break;
+		}
 		mutex_lock(&cpu_cdbs->timer_mutex);
 		if (policy->max < cpu_cdbs->cur_policy->cur)
 			__cpufreq_driver_target(cpu_cdbs->cur_policy,
@@ -371,6 +441,7 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 					policy->min, CPUFREQ_RELATION_L);
 		dbs_check_cpu(dbs_data, cpu);
 		mutex_unlock(&cpu_cdbs->timer_mutex);
+		mutex_unlock(&dbs_data->mutex);
 		break;
 	}
 	return 0;
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index b5f2b8618949..cc401d147e72 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -134,6 +134,13 @@ struct cpu_dbs_common_info {
 	u64 prev_cpu_idle;
 	u64 prev_cpu_wall;
 	u64 prev_cpu_nice;
+	/*
+	 * Used to keep track of load in the previous interval. However, when
+	 * explicitly set to zero, it is used as a flag to ensure that we copy
+	 * the previous load to the current interval only once, upon the first
+	 * wake-up from idle.
+	 */
+	unsigned int prev_load;
 	struct cpufreq_policy *cur_policy;
 	struct delayed_work work;
 	/*
@@ -257,6 +264,8 @@ static ssize_t show_sampling_rate_min_gov_pol				\
 	return sprintf(buf, "%u\n", dbs_data->min_sampling_rate);	\
 }
 
+extern struct mutex cpufreq_governor_lock;
+
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 bool need_load_eval(struct cpu_dbs_common_info *cdbs,
 		unsigned int sampling_rate);
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c
index ecc619ca09d0..47c6393d22a9 100644
--- a/drivers/cpufreq/cpufreq_stats.c
+++ b/drivers/cpufreq/cpufreq_stats.c
@@ -364,12 +364,16 @@ static int __cpufreq_stats_create_table(struct cpufreq_policy *policy,
 	if (per_cpu(cpufreq_stats_table, cpu))
 		return -EBUSY;
 	stat = kzalloc(sizeof(*stat), GFP_KERNEL);
-	if ((stat) == NULL)
+	if ((stat) == NULL) {
+		pr_err("Failed to alloc cpufreq_stats table\n");
 		return -ENOMEM;
+	}
 
 	ret = sysfs_create_group(&policy->kobj, &stats_attr_group);
-	if (ret)
+	if (ret) {
+		pr_err("Failed to create cpufreq_stats sysfs\n");
 		goto error_out;
+	}
 
 	stat->cpu = cpu;
 	per_cpu(cpufreq_stats_table, cpu) = stat;
@@ -384,6 +388,7 @@ static int __cpufreq_stats_create_table(struct cpufreq_policy *policy,
 	stat->time_in_state = kzalloc(alloc_size, GFP_KERNEL);
 	if (!stat->time_in_state) {
 		ret = -ENOMEM;
+		pr_err("Failed to alloc cpufreq_stats table\n");
 		goto error_alloc;
 	}
 	stat->freq_table = (unsigned int *)(stat->time_in_state + count);
@@ -418,6 +423,8 @@ static void cpufreq_stats_update_policy_cpu(struct cpufreq_policy *policy)
 	struct cpufreq_stats *stat = per_cpu(cpufreq_stats_table,
 			policy->last_cpu);
 
+	if (!stat)
+		return;
 	pr_debug("Updating stats_table for new_cpu %u from last_cpu %u\n",
 			policy->cpu, policy->last_cpu);
 	per_cpu(cpufreq_stats_table, policy->cpu) = per_cpu(cpufreq_stats_table,
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 3458d27f63b4..12e2ad0e41cd 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -135,9 +135,13 @@ int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
 			continue;
 		if ((freq < policy->min) || (freq > policy->max))
 			continue;
+		if (freq == target_freq) {
+			optimal.driver_data = i;
+			break;
+		}
 		switch (relation) {
 		case CPUFREQ_RELATION_H:
-			if (freq <= target_freq) {
+			if (freq < target_freq) {
 				if (freq >= optimal.frequency) {
 					optimal.frequency = freq;
 					optimal.driver_data = i;
@@ -150,7 +154,7 @@ int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
 			}
 			break;
 		case CPUFREQ_RELATION_L:
-			if (freq >= target_freq) {
+			if (freq > target_freq) {
 				if (freq <= optimal.frequency) {
 					optimal.frequency = freq;
 					optimal.driver_data = i;
diff --git a/drivers/cpufreq/qcom-cpufreq.c b/drivers/cpufreq/qcom-cpufreq.c
index 322d8f02f1f8..1fe527169962 100644
--- a/drivers/cpufreq/qcom-cpufreq.c
+++ b/drivers/cpufreq/qcom-cpufreq.c
@@ -47,6 +47,9 @@ static struct clk *l2_clk;
 static unsigned int freq_index[NR_CPUS];
 static unsigned int max_freq_index;
 static struct cpufreq_frequency_table *freq_table;
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+static struct cpufreq_frequency_table *krait_freq_table;
+#endif
 static unsigned int *l2_khz;
 static bool is_sync;
 static unsigned long *mem_bw;
@@ -167,7 +170,7 @@ static int msm_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
-	int ret = -EFAULT;
+	int ret = 0;
 	int index;
 	struct cpufreq_frequency_table *table;
 
@@ -175,6 +178,9 @@ static int msm_cpufreq_target(struct cpufreq_policy *policy,
 
 	mutex_lock(&per_cpu(cpufreq_suspend, policy->cpu).suspend_mutex);
 
+	if (target_freq == policy->cur)
+		goto done;
+
 	if (per_cpu(cpufreq_suspend, policy->cpu).device_suspended) {
 		pr_debug("cpufreq: cpu%d scheduling frequency change "
 				"in suspend.\n", policy->cpu);
@@ -363,12 +369,35 @@ static int msm_cpufreq_suspend(void)
 
 static int msm_cpufreq_resume(void)
 {
-	int cpu;
+	int cpu, ret;
+	struct cpufreq_policy policy;
 
 	for_each_possible_cpu(cpu) {
 		per_cpu(cpufreq_suspend, cpu).device_suspended = 0;
 	}
 
+	/*
+	 * Freq request might be rejected during suspend, resulting
+	 * in policy->cur violating min/max constraint.
+	 * Correct the frequency as soon as possible.
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		ret = cpufreq_get_policy(&policy, cpu);
+		if (ret)
+			continue;
+		if (policy.cur <= policy.max && policy.cur >= policy.min)
+			continue;
+		ret = cpufreq_update_policy(cpu);
+		if (ret)
+			pr_info("cpufreq: Current frequency violates policy min/max for CPU%d\n",
+			       cpu);
+		else
+			pr_info("cpufreq: Frequency violation fixed for CPU%d\n",
+				cpu);
+	}
+	put_online_cpus();
+
 	return NOTIFY_DONE;
 }
 
@@ -495,6 +524,20 @@ static int cpufreq_parse_dt(struct device *dev)
 	freq_table[i].driver_data = i;
 	freq_table[i].frequency = CPUFREQ_TABLE_END;
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+	/* Create frequence table with unrounded values */
+	krait_freq_table = devm_kzalloc(dev, (nf + 1) * sizeof(*krait_freq_table),
+					GFP_KERNEL);
+	if (!krait_freq_table)
+		return -ENOMEM;
+
+	*krait_freq_table = *freq_table;
+
+	for (i = 0, j = 0; i < nf; i++, j += 3)
+		krait_freq_table[i].frequency = data[j];
+	krait_freq_table[i].frequency = CPUFREQ_TABLE_END;
+#endif
+
 	devm_kfree(dev, data);
 
 	return 0;
@@ -536,6 +579,26 @@ const struct file_operations msm_cpufreq_fops = {
 };
 #endif
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+int use_for_scaling(unsigned int freq)
+{
+	unsigned int i, cpu_freq;
+
+	if (!krait_freq_table)
+		return -EINVAL;
+
+	for (i = 0; krait_freq_table[i].frequency != CPUFREQ_TABLE_END; i++) {
+		cpu_freq = krait_freq_table[i].frequency;
+		if (cpu_freq == CPUFREQ_ENTRY_INVALID)
+			continue;
+		if (freq == cpu_freq)
+			return freq;
+	}
+
+	return -EINVAL;
+}
+#endif
+
 static int __init msm_cpufreq_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;