cpufreq: stats: Print error messages if table initialization fails

Failing to initialize cpufreq_stats table effectively disables cpufreq_stats. Such failures should not be silent. Print error messages when cpufreq_stats fails to create stats table. Change-Id: I71cc0dd8262c7c6946e169f148ae39bd8f213a96 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: stats: Don't update cpufreq_stats_table if it's NULL When cpufreq_stats_table is not initialized for policy->last_cpu in CPUFREQ_UPDATE_POLICY_CPU callback, no updates are necessary. Current implementation dereferences the table unconditionally, causing a crash if the table is NULL. Return directly in cpufreq_stats_update_policy_cpu() if cpufreq_stats_table of policy->last_cpu is NULL. Change-Id: Ic9ef8120557702791ba5b873532e47cc0a5dc027 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Protect against hotplug in cpufreq_register_driver() cpufreq_register_driver() could race with CPU hotplug during bootup. Since hotplug notification is not registered when subsys_interface_register() is being executed, it's possible cpufreq's view of online CPUs becomes stale before it registers for hotplug notification. Register for hotplug notification first and protect subsys_interface_register() against hotplug using get/put_online_cpus(). Change-Id: I26b2908f1d167c2becc4e8664c357bb7c6162406 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Return directly in __cpufreq_get if policy is NULL __cpufreq_get() checks whether policy->cur matches frequency returned by cpufreq device driver and acts accordingly. However, policy could be NULL if the CPU is being hotplugged. Current implementation crashes when trying to dereference the NULL policy. Return the frequency provided by cpufreq device driver directly if policy is not available. Change-Id: I1f2ba4028c259392bf730db37dbec2d8c5ae3fa4 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Add if cpu is online check in show Make sure CPU is online before proceeding with any "show" ops. Without this check, the show can race with hotplug and try to show the details of a stale or non-existent policy. CRs-Fixed: 689522 Change-Id: Ie791c73cb281bcfc4d722f7c8c10eee07cb11f2e Signed-off-by: Maria Yu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: governor: Be friendly towards latency-sensitive bursty workloads Cpufreq governors like the ondemand governor calculate the load on the CPU periodically by employing deferrable timers. A deferrable timer won't fire if the CPU is completely idle (and there are no other timers to be run), in order to avoid unnecessary wakeups and thus save CPU power. However, the load calculation logic is agnostic to all this, and this can lead to the problem described below. Time (ms) CPU 1 100 Task-A running 110 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 110.5 Task-A running 120 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 125 Task-A went to sleep. With nothing else to do, CPU 1 went completely idle. 200 Task-A woke up and started running again. 200.5 Governor's deferred timer (which was originally programmed to fire at time 130) fires now. It calculates load for the time period 120 to 200.5, and finds the load is almost zero. Hence it decreases the CPU frequency to the minimum. 210 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. So, after the workload woke up and started running, the frequency was suddenly dropped to absolute minimum, and after that, there was an unnecessary delay of 10ms (sampling period) to increase the CPU frequency back to a reasonable value. And this pattern repeats for every wake-up-from-cpu-idle for that workload. This can be quite undesirable for latency- or response-time sensitive bursty workloads. So we need to fix the governor's logic to detect such wake-up-from- cpu-idle scenarios and start the workload at a reasonably high CPU frequency. One extreme solution would be to fake a load of 100% in such scenarios. But that might lead to undesirable side-effects such as frequency spikes (which might also need voltage changes) especially if the previous frequency happened to be very low. We just want to avoid the stupidity of dropping down the frequency to a minimum and then enduring a needless (and long) delay before ramping it up back again. So, let us simply carry forward the previous load - that is, let us just pretend that the 'load' for the current time-window is the same as the load for the previous window. That way, the frequency and voltage will continue to be set to whatever values they were set at previously. This means that bursty workloads will get a chance to influence the CPU frequency at which they wake up from cpu-idle, based on their past execution history. Thus, they might be able to avoid suffering from slow wakeups and long response-times. However, we should take care not to over-do this. For example, such a "copy previous load" logic will benefit cases like this: (where # represents busy and . represents idle) but it will be detrimental in cases like the one shown below, because it will retain the high frequency (copied from the previous interval) even in a mostly idle system: (i.e., the workload finished and the remaining tasks are such that their busy periods are smaller than the sampling interval, which causes the timer to always get deferred. So, this will make the copy-previous-load logic copy the initial high load to subsequent idle periods over and over again, thus keeping the frequency high unnecessarily). So, we modify this copy-previous-load logic such that it is used only once upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the previous load won't get blindly copied over; cpufreq will freshly evaluate the load in the second idle interval, thus ensuring that the system comes back to its normal state. [ The right way to solve this whole problem is to teach the CPU frequency governors to also track load on a per-task basis, not just a per-CPU basis, and then use both the data sources intelligently to set the appropriate frequency on the CPUs. But that involves redesigning the cpufreq subsystem, so this patch should make the situation bearable until then. ] Experimental results: +-------------------+ I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in between its execution such that its total utilization can be a user-defined value, say 10% or 20% (higher the utilization specified, lesser the amount of sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8. Behavior observed with tracing (sample taken from 40% utilization runs): ------------------------------------------------------------------------ Without patch: ~~~~~~~~~~~~~~ kworker/8:2-12137 416.335742: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.345744: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> <...>-40753 416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40753 416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.505739: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.515742: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy Observation: Ebizzy went idle at 416.402202, and started running again at 416.502130. But cpufreq noticed the long idle period, and dropped the frequency at 416.505739, only to increase it back again at 416.515742, realizing that the workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency for almost 13 milliseconds (almost 1 full sample period), and this pattern repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite a lot. With patch: ~~~~~~~~~~~ kworker/8:2-29802 464.832535: cpu_frequency: state=2061000 cpu_id=8 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 464.972536: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-29802 464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40738 465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 Observation: Ebizzy went idle at 465.035797, and started running again at 465.240178. Since ebizzy was the only real workload running on this CPU, cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared to the run without the patch) and this boost gave a modest improvement in total throughput, as shown below. Sleeping-ebizzy records-per-second: ----------------------------------- Utilization Without patch With patch Difference (Absolute and % values) 10% 274767 277046 + 2279 (+0.829%) 20% 543429 553484 + 10055 (+1.850%) 40% 1090744 1107959 + 17215 (+1.578%) 60% 1634908 1662018 + 27110 (+1.658%) A rudimentary and somewhat approximately latency-sensitive workload such as sleeping-ebizzy itself showed a consistent, noticeable performance improvement with this patch. Hence, workloads that are truly latency-sensitive will benefit quite a bit from this change. Moreover, this is an overall win-win since this patch does not hurt power-savings at all (because, this patch does not reduce the idle time or idle residency; and the high frequency of the CPU when it goes to cpu-idle does not affect/hurt the power-savings of deep idle states). Signed-off-by: Srivatsa S. Bhat <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled When a CPU is hot removed we'll cancel all the delayed work items via gov_cancel_work(). Sometimes the delayed work function determines that it should adjust the delay for all other CPUs that the policy is managing. If this scenario occurs, the canceling CPU will cancel its own work but queue up the other CPUs works to run. Commit 3617f2 (cpufreq: Fix timer/workqueue corruption due to double queueing) has tried to fix this, but reading governor_enabled is not protected by cpufreq_governor_lock. Even though od_dbs_timer() checks governor_enabled before gov_queue_work(), this scenario may occur. For example: CPU0 CPU1 ---- ---- cpu_down() ... <work runs> __cpufreq_remove_dev() od_dbs_timer() __cpufreq_governor() policy->governor_enabled policy->governor_enabled = false; cpufreq_governor_dbs() case CPUFREQ_GOV_STOP: gov_cancel_work(dbs_data, policy); cpu0 work is canceled timer is canceled cpu1 work is canceled <waits for cpu1> gov_queue_work(*, *, true); cpu0 work queued cpu1 work queued cpu2 work queued ... cpu1 work is canceled cpu2 work is canceled ... At the end of the GOV_STOP case cpu0 still has a work queued to run although the code is expecting all of the works to be canceled. __cpufreq_remove_dev() will then proceed to re-initialize all the other CPUs works except for the CPU that is going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs() will trample over the queued work and debugobjects will spit out a warning: WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc() ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x14 Modules linked in: CPU: 1 PID: 1205 Comm: sh Tainted: G W 3.10.0 #200 [<c01144f0>] (unwind_backtrace+0x0/0xf8) from [<c0111d98>] (show_stack+0x10/0x14) [<c0111d98>] (show_stack+0x10/0x14) from [<c01272cc>] (warn_slowpath_common+0x4c/0x68) [<c01272cc>] (warn_slowpath_common+0x4c/0x68) from [<c012737c>] (warn_slowpath_fmt+0x30/0x40) [<c012737c>] (warn_slowpath_fmt+0x30/0x40) from [<c034c640>] (debug_print_object+0x94/0xbc) [<c034c640>] (debug_print_object+0x94/0xbc) from [<c034c7f8>] (__debug_object_init+0xc8/0x3c0) [<c034c7f8>] (__debug_object_init+0xc8/0x3c0) from [<c01360e0>] (init_timer_key+0x20/0x104) [<c01360e0>] (init_timer_key+0x20/0x104) from [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) from [<c04833a8>] (__cpufreq_governor+0x80/0x1b0) [<c04833a8>] (__cpufreq_governor+0x80/0x1b0) from [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) from [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) from [<c014fb40>] (notifier_call_chain+0x44/0x84) [<c014fb40>] (notifier_call_chain+0x44/0x84) from [<c012ae44>] (__cpu_notify+0x2c/0x48) [<c012ae44>] (__cpu_notify+0x2c/0x48) from [<c068dd40>] (_cpu_down+0x80/0x258) [<c068dd40>] (_cpu_down+0x80/0x258) from [<c068df40>] (cpu_down+0x28/0x3c) [<c068df40>] (cpu_down+0x28/0x3c) from [<c068e4c0>] (store_online+0x30/0x74) [<c068e4c0>] (store_online+0x30/0x74) from [<c03a7308>] (dev_attr_store+0x18/0x24) [<c03a7308>] (dev_attr_store+0x18/0x24) from [<c0256fe0>] (sysfs_write_file+0x100/0x180) [<c0256fe0>] (sysfs_write_file+0x100/0x180) from [<c01fec9c>] (vfs_write+0xbc/0x184) [<c01fec9c>] (vfs_write+0xbc/0x184) from [<c01ff034>] (SyS_write+0x40/0x68) [<c01ff034>] (SyS_write+0x40/0x68) from [<c010e200>] (ret_fast_syscall+0x0/0x48) In gov_queue_work(), lock cpufreq_governor_lock before gov_queue_work, and unlock it after __gov_queue_work(). In this way, governor_enabled is guaranteed not changed in gov_queue_work(). Signed-off-by: Jane Li <[email protected]> Acked-by: Viresh Kumar <[email protected]> Reviewed-by: Dmitry Torokhov <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: remove race while accessing cur_policy While accessing cur_policy during executing events CPUFREQ_GOV_START, CPUFREQ_GOV_STOP, CPUFREQ_GOV_LIMITS, same mutex lock is not taken, dbs_data->mutex, which leads to race and data corruption while running continious suspend resume test. This is seen with ondemand governor with suspend resume test using rtcwake. Unable to handle kernel NULL pointer dereference at virtual address 00000028 pgd = ed610000 [00000028] *pgd=adf11831, *pte=00000000, *ppte=00000000 Internal error: Oops: 17 [#1] PREEMPT SMP ARM Modules linked in: nvhost_vi CPU: 1 PID: 3243 Comm: rtcwake Not tainted 3.10.24-gf5cf9e5 #1 task: ee708040 ti: ed61c000 task.ti: ed61c000 PC is at cpufreq_governor_dbs+0x400/0x634 LR is at cpufreq_governor_dbs+0x3f8/0x634 pc : [<c05652b8>] lr : [<c05652b0>] psr: 600f0013 sp : ed61dcb0 ip : 000493e0 fp : c1cc14f0 r10: 00000000 r9 : 00000000 r8 : 00000000 r7 : eb725280 r6 : c1cc1560 r5 : eb575200 r4 : ebad7740 r3 : ee708040 r2 : ed61dca8 r1 : 001ebd24 r0 : 00000000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5387d Table: ad61006a DAC: 00000015 [<c05652b8>] (cpufreq_governor_dbs+0x400/0x634) from [<c055f700>] (__cpufreq_governor+0x98/0x1b4) [<c055f700>] (__cpufreq_governor+0x98/0x1b4) from [<c0560770>] (__cpufreq_set_policy+0x250/0x320) [<c0560770>] (__cpufreq_set_policy+0x250/0x320) from [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) from [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c) [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) from [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) from [<c004b4b8>] (tegra_pm_notify+0x114/0x134) [<c004b4b8>] (tegra_pm_notify+0x114/0x134) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c) [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) from [<c00ad38c>] (enter_state+0xec/0x128) [<c00ad38c>] (enter_state+0xec/0x128) from [<c00ad400>] (pm_suspend+0x38/0xa4) [<c00ad400>] (pm_suspend+0x38/0xa4) from [<c00ac114>] (state_store+0x70/0xc0) [<c00ac114>] (state_store+0x70/0xc0) from [<c027b1e8>] (kobj_attr_store+0x14/0x20) [<c027b1e8>] (kobj_attr_store+0x14/0x20) from [<c019cd9c>] (sysfs_write_file+0x104/0x184) [<c019cd9c>] (sysfs_write_file+0x104/0x184) from [<c0143038>] (vfs_write+0xd0/0x19c) [<c0143038>] (vfs_write+0xd0/0x19c) from [<c0143414>] (SyS_write+0x4c/0x78) [<c0143414>] (SyS_write+0x4c/0x78) from [<c000f080>] (ret_fast_syscall+0x0/0x30) Code: e1a00006 eb084346 e59b0020 e5951024 (e5903028) ---[ end trace 0488523c8f6b0f9d ]--- Signed-off-by: Bibek Basu <[email protected]> Acked-by: Viresh Kumar <[email protected]> Cc: 3.11+ <[email protected]> # 3.11+ Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info' 'copy_prev_load' was recently added by commit: 18b46ab (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads). It actually is a bit redundant as we also have 'prev_load' which can store any integer value and can be used instead of 'copy_prev_load' by setting it zero. True load can also turn out to be zero during long idle intervals (and hence the actual value of 'prev_load' and the overloaded value can clash). However this is not a problem because, if the true load was really zero in the previous interval, it makes sense to evaluate the load afresh for the current interval rather than copying the previous load. So, drop 'copy_prev_load' and use 'prev_load' instead. Update comments as well to make it more clear. There is another change here which was probably missed by Srivatsa during the last version of updates he made. The unlikely in the 'if' statement was covering only half of the condition and the whole line should actually come under it. Also checkpatch is made more silent as it was reporting this (--strict option): CHECK: Alignment should match open parenthesis + if (unlikely(wall_time > (2 * sampling_rate) && + j_cdbs->prev_load)) { Signed-off-by: Viresh Kumar <[email protected]> Reviewed-by: Srivatsa S. Bhat <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Break out early when frequency equals target_freq Many drivers keep frequencies in frequency table in ascending or descending order. When governor tries to change to policy->min or policy->max respectively then the cpufreq_frequency_table_target could return on first iteration. This will save some iteration cycles. So, break out early when a frequency in cpufreq_frequency_table equals to target one. Testing this during kernel compilation using ondemand governor with a frequency table in ascending order, the cpufreq_frequency_table_target returned early on the first iteration at about 30% of times called. Signed-off-by: Stratos Karafotis <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: franciscofranco <[email protected]> Revert "cpufreq: make the "scaling_cur_freq" sysfs entry pollable" This reverts commit ee974af. Junjuie says: sysfs_notify of scaling_cur_freq is no longer used by userspace components. Since sysfs_notify contends on a single mutex, it could add long delay to CPU frequency scaling. Remove the sysfs_notify() to speed up CPU frequency scaling. Signed-off-by: franciscofranco <[email protected]> qcom-cpufreq: Restore CPU frequency during resume qcom-cpufreq blocks CPU frequency change request during suspend, because its dependencies might be suspended. Thus a freq change request would fail silently, and CPU clock won't change until first frequency update is requested after system comes out of suspend. This creates a period when thermal driver cannot perform frequency mitigation, even though policy->min/max have been correctly updated. Check each online CPU's policy during resume to correct any frequency violation as soon as possible. Change-Id: I3be79cf91e7d5e361314020c9806b770823c0b72 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Check current frequency in device driver __cpufreq_driver_target() checks if policy->cur is same as target_freq without holding any lock. This function is used by governor to directly set CPU frequency. Governor calling this function can't hold any CPUfreq framework locks due to deadlock possibility. However, this results in a race condition where one thread could see a stale policy->cur while another thread is changing CPU frequency. Thread A: Governor calls __cpufreq_driver_target(), starts increasing frequency but hasn't sent out CPUFREQ_POSTCHANGE notification yet. Thread B: Some other driver (could be thermal mitigation) starts limiting frequency using cpufreq_update_policy(). Every limits are applied to policy->min/max and final policy->max happens to be same as policy->cur. __cpufreq_driver_target() simply returns 0. Thread A: Governor finish scaling and now policy->cur violates policy->max and could last forever until next CPU frequency scaling happens. Shifting the responsibility of checking policy->cur and target_freq to CPUfreq device driver would resolve the race as long as the device driver holds a common mutex. Change-Id: I6f943228e793a4a4300c58b3ae0143e09ed01d7d Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: franciscofranco <[email protected]> IKSWL-14278 cpufreq: set policy with user_policy data When we update/set cpufreq policy, we should set the new policy with user_policy data, and the cpufreq notifier will limit the min/max freq accordingly. This is the right process to update the cpufreq policy. But we miss it in store_scaling_governor func, which cause the governor set failed if the policy.max is less than user_policy.min. Change-Id: Ie21593e146b8d06fa017c87434e85ff33456c20f Signed-off-by: Lianwei Wang <[email protected]> Reviewed-on: http://gerrit.mot.com/746985 SLTApproved: Slta Waiver <[email protected]> SME-Granted: SME Approvals Granted Tested-by: Jira Key <[email protected]> Reviewed-by: Ke Lu <[email protected]> Reviewed-by: Ravi Chebolu <[email protected]> Reviewed-by: Christopher Fries <[email protected]> Submit-Approved: Jira Key <[email protected]> Signed-off-by: franciscofranco <[email protected]> cpufreq: Use correct locking for cpufreq_cpu_data Use write lock when updating cpufreq_cpu_data, and read lock when getting the policy pointer. CRs-Fixed: 689522 Change-Id: I454f0d575157b3411d369e04097386f50aeaaa1c Signed-off-by: Maria Yu <[email protected]> Signed-off-by: Francisco Franco <[email protected]> cpufreq: Disable light-weight init/teardown during suspend/resume Light-weight init/teardown is introduced to preserve file permission and reduce suspend/resume latency. However, it doesn't work correctly if multiple CPUs controlled by same policy can all go offline. Suspend and resume removes and adds back CPUs in the same order for non-boot CPUs. Say CPU2 and CPU3 are both online when suspend starts. CPU2 goes offline first, handing policy and sysfs over to CPU3. Then CPU3 goes offline. Due to light-weight teardown, CPU3 still owns the policy and sysfs files. When CPU2 comes online after resume, it calls update_policy_cpu() to take back the policy ownership, but sysfs is not touched. Now CPU2 is online, with an active policy, but no sysfs folder. In additions, when CPU3 comes online, it will attempt to create a symlink while owning the real sysfs folder. To really fix this issue, special handling for sysfs and symlinks is required during suspend and resume. This requires non-trivial refactoring of certain functions. Temporarly disable light-weight init/teardown until a solution is found. Change-Id: I485483244954129fa405bc5ef1a5e0da5c05a7d5 Signed-off-by: Junjie Wu <[email protected]> Signed-off-by: Francisco Franco <[email protected]> cpufreq: Save state for entire cluster when the last CPU goes offline When the last CPU within a cluster goes offline its cpufreq limits and the current governor are saved so that it can start with the same state (cpufreq min/max limits and scaling governor) when it comes back online later. However if a different CPU from the same cluster comes online after this it will restore the state that it saved when it went offline and which can be different from the state of the CPU last offlined. This can leave the system in incorrect (i.e. not the latest) state since the rest of the CPUs in the same cluster brought back online later will be associated with the CPU that first came online. To prevent such a scenario save the state for all CPUs in a cluster when the last CPU goes offline (since the last CPU holds the latest updates) so that it can be properly restored by any CPU that first comes online in that cluster. Change-Id: I67cd6bb219b7cc4fd18507ffb9b43ca37dcf0ae7 Signed-off-by: Rohit Gupta <[email protected]> Signed-off-by: Francisco Franco <[email protected]> cpufreq: force all cores to use the same governor If BCL is enabled, and under certain conditions cpu2 and cpu3 are pegged to max freq when they come back online (if they've been offlined by BCL). If this happens during boot cpu2 and cpu3 lose their policy governor reference (I don't know why) and simply fall back to the default performance governor. Signed-off-by: Francisco Franco <[email protected]> Nexus 6: user voltage control Signed-off-by: flar2 <[email protected]> Signed-off-by: franciscofranco <[email protected]>
franciscofranco · Oct 6, 2016 · 3f5f78a · 3f5f78a
1 parent 3472dd8
commit 3f5f78a
Show file tree

Hide file tree

Showing 8 changed files with 323 additions and 45 deletions.
diff --git a/arch/arm/mach-msm/Kconfig b/arch/arm/mach-msm/Kconfig
@@ -768,6 +768,12 @@ config MSM_DEVFREQ_CPUBW
 	  agnostic interface to so that some of the devfreq governors can be
 	  shared across SoCs.
 
+config MSM_CPU_VOLTAGE_CONTROL
+	bool "Userspace CPU Voltage Control"
+	default y
+	help
+	  This enables userspace CPU Voltage Control.
+
 config MSM_AVS_HW
 	bool "Enable Adaptive Voltage Scaling (AVS)"
 	default n

diff --git a/drivers/clk/qcom/clock-krait-8974.c b/drivers/clk/qcom/clock-krait-8974.c
@@ -28,11 +28,14 @@
 #include <soc/qcom/clock-local2.h>
 #include <soc/qcom/clock-krait.h>
 #include <mach/mmi_soc_info.h>
+#include <linux/cpufreq.h>
 
 #include <asm/cputype.h>
 
 #include "clock.h"
 
+
+
 /* Clock inputs coming into Krait subsystem */
 DEFINE_FIXED_DIV_CLK(hfpll_src_clk, 1, NULL);
 DEFINE_FIXED_DIV_CLK(acpu_aux_clk, 2, NULL);
@@ -693,6 +696,74 @@ module_param_string(table_name, table_name, sizeof(table_name), S_IRUGO);
 static unsigned int pvs_config_ver;
 module_param(pvs_config_ver, uint, S_IRUGO);
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+#define CPU_VDD_MAX	1200
+#define CPU_VDD_MIN	600
+
+extern int use_for_scaling(unsigned int freq);
+static unsigned int cnt;
+
+ssize_t show_UV_mV_table(struct cpufreq_policy *policy,
+			 char *buf)
+{
+	int i, freq, len = 0;
+	unsigned int cpu = 0;
+	unsigned int num_levels = cpu_clk[cpu]->vdd_class->num_levels;
+
+	if (!buf)
+		return -EINVAL;
+
+	for (i = 0; i < num_levels; i++) {
+		freq = use_for_scaling(cpu_clk[cpu]->fmax[i] / 1000);
+		if (freq < 0)
+			continue;
+
+		len += sprintf(buf + len, "%dmhz: %u mV\n", freq / 1000,
+			       cpu_clk[cpu]->vdd_class->vdd_uv[i] / 1000);
+	}
+
+	return len;
+}
+
+ssize_t store_UV_mV_table(struct cpufreq_policy *policy,
+			  char *buf, size_t count)
+{
+	int i, j;
+	int ret = 0;
+	unsigned int val, cpu = 0;
+	unsigned int num_levels = cpu_clk[cpu]->vdd_class->num_levels;
+	char size_cur[4];
+
+	if (cnt) {
+		cnt = 0;
+		return -EINVAL;
+	}
+
+	for (i = 0; i < num_levels; i++) {
+		if (use_for_scaling(cpu_clk[cpu]->fmax[i] / 1000) < 0)
+			continue;
+
+		ret = sscanf(buf, "%u", &val);
+		if (!ret)
+			return -EINVAL;
+
+		if (val > CPU_VDD_MAX)
+			val = CPU_VDD_MAX;
+		else if (val < CPU_VDD_MIN)
+			val = CPU_VDD_MIN;
+
+		for (j = 0; j < NR_CPUS; j++)
+			cpu_clk[j]->vdd_class->vdd_uv[i] = val * 1000;
+
+		ret = sscanf(buf, "%s", size_cur);
+		cnt = strlen(size_cur);
+		buf += cnt + 1;
+	}
+
+	return ret;
+}
+#endif
+
 static int clock_krait_8974_driver_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
@@ -39,7 +39,7 @@ static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data_fallback);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
-static DEFINE_MUTEX(cpufreq_governor_lock);
+DEFINE_MUTEX(cpufreq_governor_lock);
 static LIST_HEAD(cpufreq_policy_list);
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -305,10 +305,8 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
-		if (likely(policy) && likely(policy->cpu == freqs->cpu)) {
+		if (likely(policy) && likely(policy->cpu == freqs->cpu))
 			policy->cur = freqs->new;
-			sysfs_notify(&policy->kobj, NULL, "scaling_cur_freq");
-		}
 		break;
 	}
 }
@@ -504,6 +502,9 @@ static ssize_t store_scaling_governor(struct cpufreq_policy *policy,
 						&new_policy.governor))
 		return -EINVAL;
 
+	new_policy.min = new_policy.user_policy.min;
+	new_policy.max = new_policy.user_policy.max;
+
 	ret = cpufreq_set_policy(policy, &new_policy);
 
 	policy->user_policy.policy = policy->policy;
@@ -625,6 +626,14 @@ static ssize_t show_bios_limit(struct cpufreq_policy *policy, char *buf)
 	return sprintf(buf, "%u\n", policy->cpuinfo.max_freq);
 }
 
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+extern ssize_t show_UV_mV_table(struct cpufreq_policy *policy,
+				char *buf);
+
+extern ssize_t store_UV_mV_table(struct cpufreq_policy *policy,
+				 const char *buf, size_t count);
+#endif
+
 cpufreq_freq_attr_ro_perm(cpuinfo_cur_freq, 0400);
 cpufreq_freq_attr_ro(cpuinfo_min_freq);
 cpufreq_freq_attr_ro(cpuinfo_max_freq);
@@ -639,6 +648,9 @@ cpufreq_freq_attr_rw(scaling_min_freq);
 cpufreq_freq_attr_rw(scaling_max_freq);
 cpufreq_freq_attr_rw(scaling_governor);
 cpufreq_freq_attr_rw(scaling_setspeed);
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+cpufreq_freq_attr_rw(UV_mV_table);
+#endif
 
 static struct attribute *default_attrs[] = {
 	&cpuinfo_min_freq.attr,
@@ -652,6 +664,9 @@ static struct attribute *default_attrs[] = {
 	&scaling_driver.attr,
 	&scaling_available_governors.attr,
 	&scaling_setspeed.attr,
+#ifdef CONFIG_MSM_CPU_VOLTAGE_CONTROL
+	&UV_mV_table.attr,
+#endif
 	NULL
 };
 
@@ -662,10 +677,15 @@ static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
 {
 	struct cpufreq_policy *policy = to_policy(kobj);
 	struct freq_attr *fattr = to_attr(attr);
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
+
+	get_online_cpus();
+
+	if (!cpu_online(policy->cpu))
+		goto unlock;
 
 	if (!down_read_trylock(&cpufreq_rwsem))
-		return -EINVAL;
+		goto unlock;
 
 	down_read(&policy->rwsem);
 
@@ -676,7 +696,8 @@ static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
 
 	up_read(&policy->rwsem);
 	up_read(&cpufreq_rwsem);
-
+unlock:
+	put_online_cpus();
 	return ret;
 }
 
@@ -1200,6 +1221,27 @@ static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy,
 	return cpu_dev->id;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static void update_related_cpus(struct cpufreq_policy *policy)
+{
+	unsigned int j;
+
+	for_each_cpu(j, policy->related_cpus) {
+		if (!cpufreq_driver->setpolicy)
+			strlcpy(per_cpu(cpufreq_policy_save, j).gov,
+				policy->governor->name, CPUFREQ_NAME_LEN);
+		per_cpu(cpufreq_policy_save, j).min = policy->user_policy.min;
+		per_cpu(cpufreq_policy_save, j).max = policy->user_policy.max;
+		pr_debug("Saving CPU%d user policy min %d and max %d\n",
+		 j, policy->user_policy.min, policy->user_policy.max);
+	}
+}
+#else
+static void update_related_cpus(struct cpufreq_policy *policy)
+{
+}
+#endif
+
 static int __cpufreq_remove_dev_prepare(struct device *dev,
 					struct subsys_interface *sif,
 					bool frozen)
@@ -1235,20 +1277,13 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 		}
 	}
 
-#ifdef CONFIG_HOTPLUG_CPU
-	if (!cpufreq_driver->setpolicy)
-		strlcpy(per_cpu(cpufreq_policy_save, cpu).gov,
-			policy->governor->name, CPUFREQ_NAME_LEN);
-	per_cpu(cpufreq_policy_save, cpu).min = policy->user_policy.min;
-	per_cpu(cpufreq_policy_save, cpu).max = policy->user_policy.max;
-	pr_debug("Saving CPU%d user policy min %d and max %d\n",
-		 cpu, policy->user_policy.min, policy->user_policy.max);
-#endif
-
 	down_read(&policy->rwsem);
 	cpus = cpumask_weight(policy->cpus);
 	up_read(&policy->rwsem);
 
+	if (cpus == 1)
+		update_related_cpus(policy);
+
 	if (cpu != policy->cpu) {
 		sysfs_remove_link(&dev->kobj, "cpufreq");
 	} else if (cpus > 1) {
@@ -1275,10 +1310,10 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 	unsigned long flags;
 	struct cpufreq_policy *policy;
 
-	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	write_lock_irqsave(&cpufreq_driver_lock, flags);
 	policy = per_cpu(cpufreq_cpu_data, cpu);
 	per_cpu(cpufreq_cpu_data, cpu) = NULL;
-	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
 
 	if (!policy) {
 		pr_debug("%s: No cpu_data found\n", __func__);
@@ -1444,14 +1479,22 @@ EXPORT_SYMBOL(cpufreq_quick_get_max);
 
 static unsigned int __cpufreq_get(unsigned int cpu)
 {
-	struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_data, cpu);
+	struct cpufreq_policy *policy;
 	unsigned int ret_freq = 0;
+	unsigned long flags;
 
 	if (!cpufreq_driver->get)
 		return ret_freq;
 
+	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	policy = per_cpu(cpufreq_cpu_data, cpu);
+	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
 	ret_freq = cpufreq_driver->get(cpu);
 
+	if (!policy)
+		return ret_freq;
+
 	if (ret_freq && policy->cur &&
 		!(cpufreq_driver->flags & CPUFREQ_CONST_LOOPS)) {
 		/* verify no discrepancy between actual and
@@ -1473,12 +1516,17 @@ static unsigned int __cpufreq_get(unsigned int cpu)
  */
 unsigned int cpufreq_get(unsigned int cpu)
 {
-	struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_data, cpu);
+	struct cpufreq_policy *policy;
 	unsigned int ret_freq = 0;
+	unsigned long flags;
 
 	if (cpufreq_disabled() || !cpufreq_driver)
 		return -ENOENT;
 
+	read_lock_irqsave(&cpufreq_driver_lock, flags);
+	policy = per_cpu(cpufreq_cpu_data, cpu);
+	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
 	BUG_ON(!policy);
 
 	if (!down_read_trylock(&cpufreq_rwsem))
@@ -1696,15 +1744,6 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy,
 	pr_debug("target for CPU %u: %u kHz, relation %u, requested %u kHz\n",
 			policy->cpu, target_freq, relation, old_target_freq);
 
-	/*
-	 * This might look like a redundant call as we are checking it again
-	 * after finding index. But it is left intentionally for cases where
-	 * exactly same freq is called again and so we can save on few function
-	 * calls.
-	 */
-	if (target_freq == policy->cur)
-		return 0;
-
 	if (cpufreq_driver->target)
 		retval = cpufreq_driver->target(policy, target_freq, relation);
 	else if (cpufreq_driver->target_index) {
@@ -1956,6 +1995,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 				struct cpufreq_policy *new_policy)
 {
 	int ret = 0, failed = 1;
+	struct cpufreq_policy *cpu0_policy = NULL;
 
 	pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu,
 		new_policy->min, new_policy->max);
@@ -1997,6 +2037,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	policy->max = new_policy->max;
 	trace_cpu_frequency_limits(policy->max, policy->min, policy->cpu);
 
+	if (new_policy->cpu)
+		cpu0_policy = cpufreq_cpu_get(0);
+
 	pr_debug("new min and max freqs are %u - %u kHz\n",
 					policy->min, policy->max);
 
@@ -2021,7 +2064,10 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 			}
 
 			/* start new governor */
-			policy->governor = new_policy->governor;
+			if (new_policy->cpu && cpu0_policy)
+				policy->governor = cpu0_policy->governor;
+			else
+				policy->governor = new_policy->governor;
 			if (!__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_INIT)) {
 				if (!__cpufreq_governor(policy, CPUFREQ_GOV_START)) {
 					failed = 0;
@@ -2120,9 +2166,6 @@ static int cpufreq_cpu_callback(struct notifier_block *nfb,
 	dev = get_cpu_device(cpu);
 	if (dev) {
 
-		if (action & CPU_TASKS_FROZEN)
-			frozen = true;
-
 		switch (action & ~CPU_TASKS_FROZEN) {
 		case CPU_ONLINE:
 			__cpufreq_add_dev(dev, NULL, frozen);
@@ -2189,7 +2232,11 @@ int cpufreq_register_driver(struct cpufreq_driver *driver_data)
 	cpufreq_driver = driver_data;
 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
 
+	register_hotcpu_notifier(&cpufreq_cpu_notifier);
+
+	get_online_cpus();
 	ret = subsys_interface_register(&cpufreq_interface);
+	put_online_cpus();
 	if (ret)
 		goto err_null_driver;
 
@@ -2212,13 +2259,13 @@ int cpufreq_register_driver(struct cpufreq_driver *driver_data)
 		}
 	}
 
-	register_hotcpu_notifier(&cpufreq_cpu_notifier);
 	pr_debug("driver %s up and running\n", driver_data->name);
 
 	return 0;
 err_if_unreg:
 	subsys_interface_unregister(&cpufreq_interface);
 err_null_driver:
+	unregister_hotcpu_notifier(&cpufreq_cpu_notifier);
 	write_lock_irqsave(&cpufreq_driver_lock, flags);
 	cpufreq_driver = NULL;
 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);