git.osdn.net Git - android-x86/kernel.git/log

ignore command that the wifi firmware does not support

ignore 0x242 cmd fails, they are not supported in the wifi firmware

fixes for marvell mwifiex driver

This disables power save, fixes handling of amsdu packets, and adds
checking len on cmd complete.

ACPICA: Update for generic_serial_bus and attrib_raw_process_bytes protocol

Cleanup for this write-then-read protocol. The ACPI specification
is rather unclear for the entire generic_serial_bus, but this
change works correctly on the Surface 3.

Reported-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Erik Schmauss <erik.schmauss@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

platform/x86: surface3_power: MSHW0011 rev-eng implementation

The MSHW0011 device is a chip that replaces the battery firmware
by using ACPI operation regions on the Surface 3.
It is unclear whether or not the chip will be reused somewhere else
(under Windows, the chip is called "Surface Platform Power Driver"
and the driver is provided by Microsoft).

The values have been obtained by reverse engineering, and are subject to
errors. Looks like it works on overall pretty well.

I couldn't manage to get the IRQ correctly triggered, so I am using a
good old polling thread to check for changes. This is something
to be fixed in a later version.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106231
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Stephen Just <stephenjust@gmail.com>

Revert "media: staging: atomisp: Remove driver"

We need mt9m114 for ASUS T100.

This reverts commit 51b8dc5163d2ff2bf04019f8bf7e3bd0e75bb654.

Merge remote-tracking branch 'aosp/android-mainline-tracking' into kernel-4.18

Revert "x86/signal/64: Re-add support for SS in the 64-bit signal context"

This reverts commit 6c25da5ad55d48c41b8909bc1f4e3cd5d85bb499.

Conflicts:
arch/x86/kernel/signal.c
tools/testing/selftests/x86/Makefile

ASoC: add quirk for Surface 3 with bad DMI table

Some Microsoft Surface 3 owners including me encountered a strange issue
when play Android-x86 on the tablet. The DMI table was erased due to
unknown reason and sound doesn't work in Android-x86 (but it's normal
in Windows). See more details:

https://groups.google.com/d/msg/android-x86/z6GDuvV2oWk/mzyg0RQiCAAJ

Since the DMI table is incorrect, kernel won't enable quirk for it.
To workaround such an issue, add quirk for the bad data "OEMB".
It should not affect any product with correct DMI data.

rtc: cmos: Do not export alarm rtc_ops when we do not support alarms

When there is no IRQ configured for the RTC, the rtc-cmos code does not
support alarms, all alarm rtc_ops fail with -EIO / -EINVAL.

The rtc-core expects a rtc driver which does not support rtc alarms to
not have alarm ops at all. Otherwise the wakealarm sysfs attr will read
as empty rather then returning an error, making it impossible for
userspace to find out beforehand if alarms are supported.

A system without an IRQ for the RTC before this patch:
[root@localhost ~]# cat /sys/class/rtc/rtc0/wakealarm
[root@localhost ~]#

After this patch:
[root@localhost ~]# cat /sys/class/rtc/rtc0/wakealarm
cat: /sys/class/rtc/rtc0/wakealarm: No such file or directory
[root@localhost ~]#

This fixes gnome-session + systemd trying to use suspend-then-hibernate,
which causes systemd to abort the suspend when writing the RTC alarm fails.

BugLink: https://github.com/systemd/systemd/issues/9988
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

x86/crypto: aesni - fix crash in cryptomgr_test

Fix a freeze issue in early boot stages happening with x86_64
that could be avoided only with CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y

Bisecting gives bad commit is 1476db2d129d5
("crypto: aesni - Move HashKey computation from stack to gcm_context")

Use unaligned mov instructions to avoid general protection fault

Here follows the console log of the problem:

[    1.377775] general protection fault: 0000 [#1] PREEMPT SMP
[    1.378746] CPU: 3 PID: 958 Comm: cryptomgr_test Not tainted 4.18.0-rc8-android-x86_64-g1a7fa0435ab6 #24
[    1.378746] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[    1.378746] RIP: 0010:aesni_gcm_init+0x89/0x30f
[    1.378746] Code: 0f 6f ca 66 0f 73 fa 08 66 0f 73 d9 08 66 0f eb da 66 0f 70 d1 24 66 0f 76 15 83 11 03 01 66 0f db 15 6b 11 03 01 66 0f ef da <66> 0f 7f 5e 60 66 0f 6f eb 66 0f 70 cb 4e 66 0f ef cb 66 0f 7f 8e
[    1.378746] RSP: 0000:ffff9fef010377d8 EFLAGS: 00010246
[    1.378746] RAX: ffff9fef01037b98 RBX: 0000000000000010 RCX: ffff9ab4de49d050
[    1.378746] RDX: ffff9fef01037b98 RSI: ffff9fef010378c8 RDI: ffff9ab4de49d060
[    1.378746] RBP: 0000000000000000 R08: ffff9ab4de48e000 R09: 0000000000000008
[    1.378746] R10: ffff9ab55e0e1000 R11: 0000000000000000 R12: ffff9ab4de49d050
[    1.378746] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000008
[    1.378746] FS:  0000000000000000(0000) GS:ffff9ab4e3d80000(0000) knlGS:0000000000000000
[    1.378746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.378746] CR2: 0000000000000000 CR3: 000000009980a001 CR4: 00000000000606e0
[    1.378746] Call Trace:
[    1.378746]  ? gcmaes_crypt_by_sg+0x12c/0x610
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x40/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to+0x14d/0x4a0
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? cache_alloc_refill+0x600/0x8e0
[    1.378746]  ? gcmaes_encrypt+0x1b8/0x380
[    1.378746]  ? rfc4106_set_hash_subkey+0x65/0xb0
[    1.378746]  ? aesni_enc+0xf/0x14
[    1.378746]  ? rfc4106_set_hash_subkey+0x65/0xb0
[    1.378746]  ? crypto_aead_setkey+0xa6/0xe0
[    1.378746]  ? try_to_wake_up+0x4b0/0x4b0
[    1.378746]  ? crypto_aead_setkey+0xa6/0xe0
[    1.378746]  ? helper_rfc4106_encrypt+0x91/0xc0
[    1.378746]  ? __test_aead+0xd8c/0x1290
[    1.378746]  ? __kmalloc+0x126/0x200
[    1.378746]  ? crypto_create_tfm+0x39/0xe0
[    1.378746]  ? test_aead+0x33/0xd0
[    1.378746]  ? alg_test_aead+0x4d/0xb0
[    1.378746]  ? alg_test.part.11+0xd7/0x280
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? finish_task_switch+0x90/0x240
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __schedule+0x311/0x881
[    1.378746]  ? __wake_up_common+0x86/0x120
[    1.378746]  ? cryptomgr_probe+0xe0/0xe0
[    1.378746]  ? cryptomgr_test+0x40/0x50
[    1.378746]  ? kthread+0xfa/0x130
[    1.378746]  ? __switch_to_asm+0x34/0x70
[    1.378746]  ? __kthread_parkme+0x70/0x70
[    1.378746]  ? ret_from_fork+0x35/0x40
[    1.378746] Modules linked in:
[    1.490849] ---[ end trace db6c6409ac47aa26 ]---
[    1.491659] RIP: 0010:aesni_gcm_init+0x89/0x30f
[    1.492819] Code: 0f 6f ca 66 0f 73 fa 08 66 0f 73 d9 08 66 0f eb da 66 0f 70 d1 24 66 0f 76 15 83 11 03 01 66 0f db 15 6b 11 03 01 66 0f ef da <66> 0f 7f 5e 60 66 0f 6f eb 66 0f 70 cb 4e 66 0f ef cb 66 0f 7f 8e
[    1.497069] RSP: 0000:ffff9fef010377d8 EFLAGS: 00010246
[    1.498193] RAX: ffff9fef01037b98 RBX: 0000000000000010 RCX: ffff9ab4de49d050
[    1.499901] RDX: ffff9fef01037b98 RSI: ffff9fef010378c8 RDI: ffff9ab4de49d060
[    1.502322] RBP: 0000000000000000 R08: ffff9ab4de48e000 R09: 0000000000000008
[    1.505797] R10: ffff9ab55e0e1000 R11: 0000000000000000 R12: ffff9ab4de49d050
[    1.507597] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000008
[    1.509530] FS:  0000000000000000(0000) GS:ffff9ab4e3d80000(0000) knlGS:0000000000000000
[    1.511639] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.513132] CR2: 0000000000000000 CR3: 000000009980a001 CR4: 00000000000606e0
[    1.514915] note: cryptomgr_test[958] exited with preempt_count 2

Fixes: 1476db2d129d5 ("crypto: aesni - Move HashKey computation from stack to gcm_context")
Reported-by: Mauro Rossi <issor.oruam@gmail.com>
Tested-by: Mauro Rossi <issor.oruam@gmail.com>

drm/radeon: enable ABGR and XBGR formats (v2)

Add support for DRM_FORMAT_{A,X}BGR8888 in atombios_crtc
Swapping of red and blue channels is implemented for radeon chipsets:
DCE2/R6xx and later - crossbar registers are defined and used
DCE1/R5xx - AVIVO_D1GRPH_SWAP_RB bit is used

(v2) Set AVIVO_D1GRPH_SWAP_RB bit in fb_format, using bitwise OR for DCE1 path
Use bitwise OR where required for big endian settings in fb_swap
Use existing code style CHIP_R600 condition, fix typo in R600 blue crossbar

Signed-off-by: Mauro Rossi <issor.oruam@gmail.com>

drm/amdgpu: enable ABGR and XBGR formats (v2)

Add support for DRM_FORMAT_{A,X}BGR8888 in amdgpu with amd dc disabled

(v2) Crossbar registers are defined and used to swap red and blue channels,
keeping the existing coding style in each of the dce modules.
After setting crossbar bits in fb_swap, use bitwise OR for big endian

Signed-off-by: Mauro Rossi <issor.oruam@gmail.com>

drm/amd/display: enable ABGR and XBGR formats (v4)

SURFACE_PIXEL_FORMAT_GRPH_ABGR8888 is supported in amd/display/dc/dc_hw_types.h
and the necessary crossbars register controls to swap red and blue channels
are already implemented in drm/amd/display/dc/dce/dce_mem_input.c

(v4) Logic to handle new formats is added only in amdgpu_dm module.

Signed-off-by: Mauro Rossi <issor.oruam@gmail.com>

Revert "drm/amdgpu: Don't default to DC support for Kaveri and older"

This reverts commit d9fda248046ac035f18a6e663f2f9245b4bf9470.

ALSA: x86: modify modalias to match more devices

The driver declares a modalias platform:hdmi_lpe_audio. However, in
/sys/devices/pci0000:00/0000:00:02.0/hdmi-lpe-audio/modalias of
ASUS VivoStick PC (TS10) it is platform:hdmi-lpe-audio.

Extend the modalias pattern to match this device.

Tested-by: Chih-Wei Huang <cwhuang@linux.org.tw>
Signed-off-by: Chih-Wei Huang <cwhuang@linux.org.tw>

drm: amdgpu,radeon: force amdgpu for si,cik

Current default for si_support and cik_support prefers radeon,
we set the opposite in order to prefer amdgpu,
which means that radeon becomes by default disabled for SI, CIK parts

Comments are not updated as these changes are just for testing

brcmfmac: disable power saving to stabilize wifi connectivity

From https://drive.google.com/open?id=0B4DiU2o72Fbub0U2ZzJaUzl5OEE

brcmfmac: fix setp2p error

From https://drive.google.com/open?id=0B4DiU2o72Fbub0U2ZzJaUzl5OEE

i915: pm: Be less agressive with clockfreq changes on Bay Trail

Bay Trail devices are known to hang when changing the frequency often,
this is discussed in great length in:
https://bugzilla.kernel.org/show_bug.cgi?id=109051

Commit 6067a27d1f01 ("drm/i915: Avoid tweaking evaluation thresholds
on Baytrail v3") is an attempt to workaround this. Several users in
bko109051 report that an earlier version of this patch, v1:
https://bugzilla.kernel.org/attachment.cgi?id=251471

Works better for them and they still see hangs with the merged v3.

Comparing the 2 versions shows that they are indeed not equivalent,
v1 not only skips writing the GEN6_RP* registers from valleyview_set_rps,
as v3 does. It also contained these modifications to i915_irq.c:

     if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
         if (!vlv_c0_above(dev_priv,
                   &dev_priv->rps.down_ei, &now,
-                  dev_priv->rps.down_threshold))
+                  VLV_RP_DOWN_EI_THRESHOLD))
             events |= GEN6_PM_RP_DOWN_THRESHOLD;
         dev_priv->rps.down_ei = now;
     }

     if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
         if (vlv_c0_above(dev_priv,
                  &dev_priv->rps.up_ei, &now,
-                 dev_priv->rps.up_threshold))
+                 VLV_RP_UP_EI_THRESHOLD))
             events |= GEN6_PM_RP_UP_THRESHOLD;
         dev_priv->rps.up_ei = now;
     }

Which use less aggressive up/down thresholds, which results in less
GEN6_PM_RP_*_THRESHOLD events and thus in less calls to intel_set_rps() ->
valleyview_set_rps() -> vlv_punit_write(PUNIT_REG_GPU_FREQ_REQ).
With the last call being the likely cause of the hang.

This commit hardcodes the threshold_up and _down values for Bay Trail to
less aggressive values, reducing the amount of clock frequency changes,
thus avoiding the hangs some people are still seeing with the merged fix.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

intel_idle: Disable C6N and C6S on Bay Trail

It seems that Bay Trail SoCs sometimes have issues waking from C6,
a lot of users even report Bay Trail devices only being stable
when passing intel_idle.max_cstate=1 to the kernel.

This commits disables the C6 states while leaving the C7 states
available so that the cores can still reach deep sleep states.

There are several indicators that this is part of the solution for
all the users who need to pass intel_idle.max_cstate=1:

1) The "VLP52 EOI Transactions May Not be Sent if Software
   Enters Core C6 During an Interrupt Service Routine" errata.

2) Several users who need intel_idle.max_cstate=1 indicate in bko109051
   (which has over 800 comments!) that using a shell script which
   disables C6N and C6S through sysfs allows them to remove
   intel_idle.max_cstate=1 and still have a stable system which does
   use the C7 states for power-saving.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

drm/nouveau: expose atomic ioctl by default (kernel 4.18)

atomic module parameter current default is disabled,
we set to enabled for convenience and simplification

ANDROID: brcmfmac: Use monotonic boot time for BSS TSF (v2)

Reverts 8e6cffb3b42f "brmc80211: dont use jiffies for BSS TSF".
v2: Use monotonic boot time instead of jiffies to avoid overflow
after 5 minutes and problems after system goes into suspend.

Android uses the TSF as timestamp when scanning for WiFi networks
and discards scan results entirely if the TSF is before the time
the scan started.

Use the monotonic boot time to avoid discarding all scan results
when using brcmfmac on Android.

Fixes: 8e6cffb3b42f "brmc80211: dont use jiffies for BSS TSF"

staging: rtl8812au update Kconfig and Makefile

staging: rtl8812au: add updated driver

Cloned from https://github.com/lwfinger/rtl8812au
branch master commit 8a7beb9 (stable driver v4.2.2)
and removed .git/ folder and .gitignore file

staging: rtl8723bu update Kconfig and Makefile

staging: rtl8723bu: add updated driver

Cloned from https://github.com/lwfinger/rtl8723bu
branch master commit 8091632
removed .git/ folder and .gitignore file
forced git add for required .bin firmware files

Input: goodix - enable support for GDIX1002 parts

At least two vendors (Chuwi and Onda) have parts with GDIX1002 id

net: wireless: broadcom: wl: add patch for kernel 4.15

build.mk is modified to apply incremental patch linux-415.patch
which implements the use of timer_setup() instead of init_timer()
for kernel 4.15 and later, thus avoiding following error:

error: implicit declaration of function 'init_timer'

Similar problem arises for device/generic/common/tp_smapi/hdaps.c:782
which requires to use timer_setup() instead of init_timer()
for kernel 4.15 and later

Input: add a new driver for D-WAV MultiTouch Screen

net: wireless: broadcom: wl: add patch for kernel 4.12

build.mk is modified to apply incremental patch linux-412.patch
copied from https://aur.archlinux.org/cgit/aur.git/tree/?h=broadcom-wl

Porting of 0542e4e "add linux412.patch created by wichmannpas"

net: wireless: broadcom: wl: add patch for kernel 4.11

build.mk is modified to apply incremental patch linux-411.patch
copied from https://aur.archlinux.org/cgit/aur.git/tree/?h=broadcom-wl

Porting of 5224162 "add linux411.patch created by wichmannpas"

net: wireless: broadcom: wl: fix kernel >= 4.8 panic

linux-recent.patch was ok up till kernel 4.7
with kernel 4.8 (and 4.9) new changes are critical to avoid crash

build.mk is modified to apply incremental patch linux-48.patch
copied from https://aur.archlinux.org/cgit/aur.git/tree/?h=broadcom-wl

Reference: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=839629

v2: cat all patches to apply them once (cwhuang)

ANDROID: Add hold functionality to schedtune CPU boost

When tasks come and go from a runqueue quickly, this can lead to boost
being applied and removed quickly which sometimes means we cannot raise
the CPU frequency again when we need to (due to the rate limit on
frequency updates). This has proved to be a particular issue for RT tasks
and alternative methods have been used in the past to work around it.

This is an attempt to solve the issue for all task classes and cpufreq
governors by introducing a generic mechanism in schedtune to retain
the max boost level from task enqueue for a minimum period - defined
here as 50ms. This timeout was determined experimentally and is not
configurable.

A sched_feat guards the application of this to tasks - in the default
configuration, task boosting only applied to tasks which have RT
policy. Change SCHEDTUNE_BOOST_HOLD_ALL to true to apply it to all
tasks regardless of class.

It works like so:

Every task enqueue (in an allowed class) stores a cpu-local timestamp.
If the task is not a member of an allowed class (all or RT depending
upon feature selection), the timestamp is not updated.
The boost group will stay active regardless of tasks present until
50ms beyond the last timestamp stored. We also store the timestamp
of the active boost group to avoid unneccesarily revisiting the boost
groups when checking CPU boost level.

If the timestamp is more than 50ms in the past when we check boost then
we re-evaluate the boost groups for that CPU, taking into account the
timestamps associated with each group.

Idea based on rt-boost-retention patches from Joel.

Change-Id: I52cc2d2e82d1c5aa03550378c8836764f41630c1
Suggested-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[forward ported from android-4.9-eas-dev proposal]
(cherry picked from commit a485e8b7bf8e95759e600396feeb7bfb400b6e46)
[ - Trivial cherry-pick conflicts in include/trace/events/sched.h ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ANDROID: Revert "ANDROID: sched/tune: Initialize raw_spin_lock in boosted_groups"

This reverts commit c5616f2f874faa20b59b116177b99bf3948586df.

If we re-init the per-cpu boostgroup spinlock every time that
we add a new boosted cgroup, we can easily wipe out (reinit)
a spinlock struct while in a critical section. We should only
be setting up the per-cpu boostgroup data, and the spin_lock
initialization need only happen once - which we're already
doing in a postcore_initcall.

For example:

     -------- CPU 0 --------   | -------- CPU1 --------
cgroupX boost group added      |
schedtune_enqueue_task         |
  acquires(bg->lock)           | cgroupY boost group added
                               |  for_each_cpu()
                               |    raw_spin_lock_init(bg->lock)
  releases(bg->lock)           |
      BUG (already unlocked)   |
                               |

This results in the following BUG from the debug spinlock code:
BUG: spinlock already unlocked on CPU#5, rcuop/6/68

Bug: 32668852

Change-Id: I3016702780b461a0cd95e26c538cd18df27d6316
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>

ANDROID: sched/rt: Add schedtune accounting to rt task enqueue/dequeue

rt tasks are currently not eligible for schedtune boosting. Make it so
by adding enqueue/dequeue hooks.

For rt tasks, schedtune only acts as a frequency boosting framework, it
has no impact on placement decisions and the prefer_idle attribute is
not used.

Also prepare schedutil use of boosted util for rt task boosting

With this change, schedtune accounting will include rt class tasks,
however boosting currently only applies to the utilization provided by
fair class tasks. Sum up the tracked CPU utilization applying boost to
the aggregate util instead - this includes RT task util in the boosting
if any tasks are runnable.

Scenario 1, considering one CPU:
1x rt task running, util 250, boost 0
1x cfs task runnable, util 250, boost 50
previous util=250+(50pct_boosted_250) = 887
new      util=50_pct_boosted_500      = 762

Scenario 2, considering one CPU:
1x rt task running, util 250, boost 50
1x cfs task runnable, util 250, boost 0
previous util=250+250                 = 500
new      util=50_pct_boosted_500      = 762

Scenario 3, considering one CPU:
1x rt task running, util 250, boost 50
1x cfs task runnable, util 250, boost 50
previous util=250+(50pct_boosted_250) = 887
new      util=50_pct_boosted_500      = 762

Scenario 4:
1x rt task running, util 250, boost 50
previous util=250                 = 250
new      util=50_pct_boosted_250  = 637

Change-Id: Ie287cbd0692468525095b5024db9faac8b2f4878
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 8e266aebf737262aeca9662254a3e61ccc7f8dec)
[ - Fixed conflicts in sugov related to PELT signals of all classes
  - Fixed conflicts related to boosted_cpu_util being in a different
    location ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ANDROID: arm64: defconfig: Enable CONFIG_SCHED_TUNE

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Id31c1a0811a1ee45e5533d948dc2faef50624a5a

ANDROID: sched/events: Introduce overutilized trace event

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Andres Oportus <andresoportus@google.com>
(cherry picked from commit 8e45d941282039d5379f4e286e5bd0a2044e105c)
[ - Trivial cherry pick issues
- Changed commit title for consistency ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I78fb5e82223558def0cf16105c233591cda81d5c

ANDROID: sched/events: Introduce rt_rq load tracking trace event

We want to be able to track rt_rq signals same as we do for other RQs.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit 574a2d189695c334ae290f522b098f05398a3765)
[ - Fixed conflicts with the refactored RT util_avg tracking
- Changed commit title for consistency with other tracepoint patches ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I38b64baa50e1ff86019ca4b8b0a04af994880b35

ANDROID: sched/events: Introduce schedtune trace events

Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I2c43bcb37f57844a07aa36e339da00180e65b6c2

ANDROID: sched/events: Introduce find_best_target trace event

Adapated from the existing trace event from android-4.14.

Change-Id: I9785e692fb0af087c236906d7f47fed1b20690f5
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ANDROID: sched/events: Introduce util_est trace events

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Change-Id: I65e294c454369cbc15a29370d8a13ce358a95c39

ANDROID: sched/events: Introduce task_group load tracking trace event

The trace event key load is mapped to:

(1) load : cfs_rq->tg->load_avg

The cfs_rq owned by the task_group is used as the only parameter for the
trace event because it has a reference to the taskgroup and the cpu.
Using the taskgroup as a parameter instead would require the cpu as a
second parameter. A task_group is global and not per-cpu data. The cpu
key only tells on which cpu the value was gathered.

The following list shows examples of the key=value pairs for:

(1) a task group:

cpu=1 path=/tg1/tg11/tg111 load=517

(2) an autogroup:

cpu=1 path=/autogroup-10 load=1050

We don't maintain a load signal for a root task group.

The trace event is only defined if cfs group scheduling support
(CONFIG_FAIR_GROUP_SCHED) is enabled.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I7de38e6b30a99d7c9887c94c707ded26b383b5f8

ANDROID: sched/events: Introduce sched_entity load tracking trace event

The following trace event keys are mapped to:

(1) load     : se->avg.load_avg

(2) rbl_load : se->avg.runnable_load_avg

(3) util     : se->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for non-existent key=value pairs:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the
                   sched_entity represents a task.

comm = "(null)" : In case sched_entity represents a task_group.

pid = -1        : In case sched_entity represents a task_group.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a task:

     cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102  util=102

(2) a taskgroup:

     cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510

(3) an autogroup:

     cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48

(4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265

The trace event is only defined for CONFIG_SMP.

The helper functions __trace_sched_cpu(), __trace_sched_path() and
__trace_sched_id() are extended to deal with sched_entities as well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ Fixed issues related to the new pelt.c file ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Id2e4d1ddb79c13412c80e4fa4147b9df3b1e212a

ANDROID: sched/events: Introduce cfs_rq load tracking trace event

The following trace event keys are mapped to:

(1) load     : cfs_rq->avg.load_avg

(2) rbl_load : cfs_rq->avg.runnable_load_avg

(2) util     : cfs_rq->avg.util_avg

To let this trace event work for configurations w/ and w/o group
scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following
special handling is necessary for a non-existent key=value pair:

path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED.

The following list shows examples of the key=value pairs in different
configurations for:

(1) a root task_group:

     cpu=4 path=/ load=6 rbl_load=6 util=331

(2) a task_group:

     cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522

(3) an autogroup:

     cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517

(4) w/o CONFIG_FAIR_GROUP_SCHED:

     cpu=0 path=(null) load=314 rbl_load=314 util=289

The trace event is only defined for CONFIG_SMP.

The helper function __trace_sched_path() can be used to get the length
parameter of the dynamic array (path == NULL) and to copy the path into
it (path != NULL).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ Fixed issues related to the new pelt.c file ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I1107044c52b74ecb3df69f3a45c1e530f0e59b1b

ANDROID: sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG

Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If
CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be
available to be printed in the load tracking trace events provided by
this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Change-Id: Iecf5466fc837f428ee545ddabe70024c152ff38d

ANDROID: drivers: Introduce a legacy Energy Model loading driver

The Energy Aware Scheduler (EAS) used to rely on statically defined
Energy Models (EMs) in the device tree. Now that EAS uses the EM
framework, the old-style EMs are not usable by default.

To address this issue, introduce a driver able to read DT-based EMs and
to load them in the EM framework, hence making them available to EAS.
Since EAS now uses only the active costs of CPUs, the idle cost and
cluster cost of the old EM are ignored. The driver can be compiled in
using the CONFIG_LEGACY_ENERGY_MODEL_DT Kconfig option (off by default).

The implementation of the driver is highly inspired by the EM loading
code from android-4.14 and before (written by Robin Randhawa
<robin.randhawa@arm.com>), and the arch_topology driver (Juri Lelli
<juri.lelli@redhat.com>).

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4f525dfb45113ba63f01aaf8e1e809ae6b34dd52

ANDROID: cpufreq: scmi: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scmi-cpufreq driver which uses
the power costs provided by the firmware.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ie1f2d8da73c8305b8c13f17c26bed331ac415484

FROMLIST: firmware: arm_scmi: add a getter for power of performance states

The SCMI protocol can be used to get power estimates from firmware
corresponding to each performance state of a device. Although these power
costs are already managed by the SCMI firmware driver, they are not
exposed to any external subsystem yet.

Fix this by adding a new get_power() interface to the exisiting perf_ops
defined for the SCMI protocol.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I2ba654a59e9633dcb2ce4128dbd99f2317d0ece8

ANDROID: sched/fair: Also do misfit in overloaded groups

If we can classify the group as overloaded, that overrides
any classification as misfit but we may still have misfit
tasks present. Check the rq we're looking at to see if
this is the case.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Removed stray reference to rq_has_misfit]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Change-Id: Ida8eb66aa625e34de3fe2ee1b0dd8a78926273d8

ANDROID: sched/fair: Don't balance misfits if it would overload local group

When load balancing in a system with misfit tasks present, if we always
pull a misfit task to the local group this can lead to pulling a running
task from a smaller capacity CPUs to a bigger CPU which is busy. In this
situation, the pulled task is likely not to get a chance to run before
an idle balance on another small CPU pulls it back. This penalises the
pulled task as it is stopped for a short amount of time and then likely
relocated to a different CPU (since the original CPU just did a NEWLY_IDLE
balance and reset the periodic interval).

If we only do this unconditionally for NEWLY_IDLE balance, we can be
sure that any tasks and load which are present on the local group are
related to short-running tasks which we are happy to displace for a
longer running task in a system with misfit tasks present.

However, other balance types should only pull a task if we think
that the local group is underutilized - checking the number of tasks
gives us a conservative estimate here since if they were short tasks
we would have been doing NEWLY_IDLE balances instead.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I710add1ab1139482620b6addc8370ad194791beb

ANDROID: sched/fair: Attempt to improve throughput for asym cap systems

In some systems the capacity and group weights line up to defeat all the
small imbalance correction conditions in fix_small_imbalance, which can
cause bad task placement. Add a new condition if the existing code can't
see anything to fix:

If we have asymmetric capacity, and there are more tasks than CPUs in
the busiest group *and* there are less tasks than CPUs in the local group
then we try to pull something. There could be transient small tasks which
prevent this from working, but on the whole it is beneficial for those
systems with inconvenient capacity/cluster size relationships.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icf81cde215c082a61f816534b7990ccb70aee409

ANDROID: cpufreq/schedutil: Fix integration of schedtune CPU margin

Schedtune enables to boost CPUs to higher frequencies when boosted tasks
are running. However, this isn't accounted in schedutil yet, which means
that schedtune's margin is ignored when requesting OPPs.

Fix this by calling boosted_cpu_util() instead of cpu_util_cfs() from
schedutil.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Idd20f124c0c683b41502ad02ff80395d583c2077

ANDROID: cpufreq/schedutil: add up/down frequency transition rate limits

The rate-limit tunable in the schedutil governor applies to transitions
to both lower and higher frequencies. On several platforms it is not the
ideal tunable though, as it is difficult to get best power/performance
figures using the same limit in both directions.

It is common on mobile platforms with demanding user interfaces to want
to increase frequency rapidly for example but decrease slowly.

One of the example can be a case where we have short busy periods
followed by similar or longer idle periods. If we keep the rate-limit
high enough, we will not go to higher frequencies soon enough. On the
other hand, if we keep it too low, we will have too many frequency
transitions, as we will always reduce the frequency after the busy
period.

It would be very useful if we can set low rate-limit while increasing
the frequency (so that we can respond to the short busy periods quickly)
and high rate-limit while decreasing frequency (so that we don't reduce
the frequency immediately after the short busy period and that may avoid
frequency transitions before the next busy period).

Implement separate up/down transition rate limits. Note that the
governor avoids frequency recalculations for a period equal to minimum
of up and down rate-limit. A global mutex is also defined to protect
updates to min_rate_limit_us via two separate sysfs files.

Note that this wouldn't change behavior of the schedutil governor for
the platforms which wish to keep same values for both up and down rate
limits.

This is tested with the rt-app [1] on ARM Exynos, dual A15 processor
platform.

Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load
for X ms of busy period out of the total period of Y ms, i.e. Y - X ms
of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e
idle periods of 20, 30 and 50 ms respectively. These were tested against
values of up/down rate limits as: 10/10 ms and 10/40 ms.

For every test we noticed a performance increase of 5-10% with the
schedutil governor, which was very much expected.

[Viresh]: Simplified user interface and introduced min_rate_limit_us +
mutex, rewrote commit log and included test results.

[1] https://github.com/scheduler-tools/rt-app/

Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
(applied from https://marc.info/?l=linux-kernel&m=147936011103832&w=2)
[trivial adaptations]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[updated rate limiting & fixed conflicts]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 50c26fdb74563ec0cb4a83373d42667f4e83a23e)
[Trivial cherry-pick conflicts]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I18720a83855b196b8e21dcdc8deae79131635b84

ANDROID: cpufreq/schedutil: Select frequency using util_avg for RT

Schedutil always requests max frequency whenever a RT task is running.
Now that we have a better estimate of the utilization of RT runqueues,
it is possible to make a less conservative decision and scale frequency
according to the needs of the RT tasks.

To do so, protect the RT-go-to-max code with a new sched_feature.
The sched_feature is disabled by default, hence favoring energy savings
as required in mobile environments.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ic9f01c8703d4f843addaa0d684012a422fe9f3b8

FROMLIST: sched/fair: add support to tune PELT ramp/decay timings

The PELT half-life is the time [ms] required by the PELT signal to build
up a 50% load/utilization, starting from zero. This time is currently
hardcoded to be 32ms, a value which seems to make sense for most of the
workloads.

However, 32ms has been verified to be too long for certain classes of
workloads. For example, in the mobile space many tasks affecting the
user-experience run with a 16ms or 8ms cadence, since they need to match
the common 60Hz or 120Hz refresh rate of the graphics pipeline.
This contributed so fare to the idea that "PELT is too slow" to properly
track the utilization of interactive mobile workloads, especially
compared to alternative load tracking solutions which provides a
better representation of tasks demand in the range of 10-20ms.

A faster PELT ramp-up time could give some advantages to speed-up the
time required for the signal to stabilize and thus to better represent
task demands in the mobile space. As a downside, it also reduces the
decay time, and thus we forget the load/utilization of sleeping tasks
(or idle CPUs) faster.

Fortunately, since the integration of the utilization estimation
support in mainline kernel:

commit 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT")

a fast decay time is no longer an issue for tasks utilization estimation.
Although estimated utilization does not slow down the decay of blocked
utilization on idle CPUs, for mobile workloads this seems not to be a
major concern compared to the benefits in interactivity responsiveness.

Let's add a compile time option to choose the PELT speed which better
fits for a specific system. By default the current 32ms half-life is
used, but we can also compile a kernel to use a faster ramp-up time of
either 16ms or 8ms. These two configurations have been verified to give
PELT a further improvement in performance, compared to other out-of-tree
load tracking solutions, when it comes to track interactive workloads
thus better supporting both tasks placements and frequencies selections.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
[
backport from LKML:
Message-ID: <20180409165134.707-1-patrick.bellasi@arm.com>
]
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I50569748918b799ac4bf4e7d2b387253080a0fd2

FROMLIST: proc/sched: remove unused sched_time_avg_ms

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
Remove it

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: I23ff66bffd8142f02d45f04258d04e360c9d7f73

FROMLIST: sched: remove rt_avg code

rt_avg is no more used anywhere so we can remove all related code

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Fixed trivial conflicts in kernel/sched/sched.h ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I2d620d01574a3cac9054893f04578e4e4b8bde96

FROMLIST: sched: use pelt for scale_rt_capacity()

The utilization of the CPU by rt, dl and interrupts are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for cfs class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for cfs instead of a scaling factor because rt, dl and
interrupt provide now absolute utilization value.

The same formula as schedutil is used:
irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ - Fixed issue with the max freq capping in update_cpu_capacity()
- Fixed compile warning for !CONFIG_IRQ_TIME_ACCOUNTING ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4a25191bba3b7b19d075f5a95845caebdbcb9c24

FROMLIST: sched: schedutil: remove sugov_aggregate_util()

There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased after adding irq tracking and fixed some compilation errors]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: Ibe25e62b8aa108cacdab05dedfc15ed04c06f17b

FROMLIST: cpufreq/schedutil: take into account interrupt

The time spent under interrupt can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
rqs utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is :
  irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives:
iperf -c server_address -r -t 5

w/o patch w/ patch
Tx 276 Mbits/sec        304 Mbits/sec +10%
Rx 299 Mbits/sec        328 Mbits/sec +09%

8 iterations
stdev is lower than 1%
Only WFI idle state is enable (shallowest diel state)

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: I41ca4137b64a92337783233f2f97e33cb660cfce

FROMLIST: sched/irq: add irq utilization tracking

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is :
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: I01a71d28264fc3bb703bb0b0919bbc62845b4463

FROMLIST: cpufreq/schedutil: use dl utilization tracking

Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the dl bandwidth requirement to define the
utilization of the CPU

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Fixed conflict with EAS ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I12f532fc0f7ec2682d94f94d6e8e128dea36c26a

FROMLIST: sched/dl: add dl_rq utilization tracking

Similarly to what happens with rt tasks, cfs tasks can be preempted by dl
tasks and the cfs's utilization might no longer describes the real
utilization level.
Current dl bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the dl sched class. We track
dl class utilization to estimate the system utilization.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: I5251dc8e148dfebd8748579742673b6d87134bb5

FROMLIST: cpufreq/schedutil: use rt utilization tracking

Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
can preempt and steal cfs's running time.

rt util_avg is used to take into account the utilization of rt tasks
on the CPU when selecting OPP. If a rt task migrate, the rt utilization
will not migrate but will decay over time. On an overloaded CPU, cfs
utilization reflects the remaining utilization avialable on CPU. When rt
task migrates, the cfs utilization will increase when tasks will start to
use the newly available capacity. At the same pace, rt utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: Iaf97832858db8f857036f31b57058cee55e42cde

FROMLIST: sched/rt: add rt_rq utilization tracking

schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity but not what cfs want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of rt tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: Ia6f2c07a305943137c88d30d41e6df93b3a7473b

FROMLIST: sched/pelt: Move pelt related code in a dedicated file

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, pelt code is moved in a
dedicated file.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Change-Id: I9cc2a9259df95078684f8d8906fb87b6f11c9337

ANDROID: sched: Update max cpu capacity in case of max frequency constraints

Wakeup balancing uses cpu capacity awareness and needs to know the
system-wide maximum cpu capacity.

Patch "sched: Store system-wide maximum cpu capacity in root domain"
finds the system-wide maximum cpu capacity during scheduler domain
hierarchy setup. This is sufficient as long as maximum frequency
invariance is not enabled.

If it is enabled, the system-wide maximum cpu capacity can change
between scheduler domain hierarchy setups due to frequency capping.

The cpu capacity is changed in update_cpu_capacity() which is called in
load balance on the lowest scheduler domain hierarchy level. To be able
to know if a change in cpu capacity for a certain cpu also has an effect
on the system-wide maximum cpu capacity it is normally necessary to
iterate over all cpus. This would be way too costly. That's why this
patch follows a different approach.

The unsigned long max_cpu_capacity value in struct root_domain is
replaced with a struct max_cpu_capacity, containing value (the
max_cpu_capacity) and cpu (the cpu index of the cpu providing the
maximum cpu_capacity).

Changes to the system-wide maximum cpu capacity and the cpu index are
made if:

1 System-wide maximum cpu capacity < cpu capacity
2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu

There are no changes to the system-wide maximum cpu capacity in all
other cases.

Atomic read and write access to the pair (max_cpu_capacity.val,
max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock.

The access to max_cpu_capacity.val in task_fits_max() is still performed
without taking the max_cpu_capacity.lock.

The code to set max cpu capacity in build_sched_domains() has been
removed because the whole functionality is now provided by
update_cpu_capacity() instead.

This approach can introduce errors temporarily, e.g. in case the cpu
currently providing the max cpu capacity has its cpu capacity lowered
due to frequency capping and calls update_cpu_capacity() before any cpu
which might provide the max cpu now.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>*
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ - Fixed cherry-pick issues
- Fix CONFIG_SMP=n build ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Idaa7a16723001e222e476de34df332558e48dd13

ANDROID: arm: enable max frequency capping

Defines arch_scale_max_freq_capacity() to use the topology driver
scale function.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I79f444399ea3b2948364fde80ccee52a9ece5b9a

ANDROID: arm64: enable max frequency capping

Defines arch_scale_max_freq_capacity() to use the topology driver
scale function.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: If7565747ec862e42ac55196240522ef8d22ca67d

ANDROID: implement max frequency capping

Implements the Max Frequency Capping Engine (MFCE) getter function
topology_get_max_freq_scale() to provide the scheduler with a
maximum frequency scaling correction factor for more accurate cpu
capacity handling by being able to deal with max frequency capping.

This scaling factor describes the influence of running a cpu with a
current maximum frequency (policy) lower than the maximum possible
frequency (cpuinfo).

The factor is:

policy_max_freq(cpu) << SCHED_CAPACITY_SHIFT / cpuinfo_max_freq(cpu)

It also implements the MFCE setter function arch_set_max_freq_scale()
which is called from cpufreq_set_policy().

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[Trivial cherry-pick issue in cpufreq.c]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I59e52861ee260755ab0518fe1f7183a2e4e3d0fc

ANDROID: sched/fair: add arch scaling function for max frequency capping

To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.

Another subsystem (e.g. cpufreq) or architectural or platform specific
code can overwrite this default implementation, exactly as for frequency
and cpu invariance. It has to be enabled by the arch by defining
arch_scale_max_freq_capacity to the actual implementation.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ - Fixed trivial cherry-pick issue
- Fix CONFIG_SMP=n build ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I770a8b1f4f7340e9e314f71c64a765bf880f4b4d

ANDROID: sched/fair: Bypass energy computation for prefer_idle tasks

If the only pre-selected candidate CPU in find_energy_efficient_cpu()
happens to be prev_cpu, there is not point in computing the system
energy since we have nothing to compare it against, so we currently bail
out early. The same logic can be extended when prefer_idle tasks are
routed in the energy-aware wake-up path: if the only candidate is idle
for a prefer_idle task, just select it no matter what the energy impact
is. That should help speeding-up wake-ups of prefer_idle tasks, at least
when find_best_target() is used for them.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Idd0e387e4a766061cc05d2584df3a31e4dabfd09

ANDROID: sched: fair: Bypass energy-aware wakeup for prefer-idle tasks

Use the upstream slow path to find an idle cpu for prefer-idle tasks.
This slow-path is actually faster than the EAS path we are currently
going through (compute_energy()) which is really slow.

No performance degradation is seen with this and it reduces the delta
quite a bit between upstream and out of tree code.

It's not clear yet if using the mainline slow path task placement when
a task has the schedtune attribute prefer_idle=1 is the right thing to
do for products. Put the option to disable this behind a sched feature
so we can try out both options.

Signed-off-by: Joel Fernandes <joelaf@google.com>
(refactored for 4.14 version)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit c0ff131c88f68e4985793663144b6f9cf77be9d3)
[ - Refactored for 4.17 version
- Adjusted the commit header to the new function names ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Icf762a101c92c0e3f9e61df0370247fa15455581

FROMLIST: sched/fair: Use wake_q length as a hint for wake_wide

This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.

The wake_q mechanism is one case where this information is available.

select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.

                               * * *

The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

So, I'm trying to get want_affine = 0 for these tasks.

I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.

However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.

It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?

                               * * *

Final note..

In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:

     v CPU v           ->time->

                    --------------
   0  (big)         11111  /333  3
                    --------------
   1  (big)         22222   /444|4
                    --------------
   2  (LITTLE)      333333/      2
                    --------------
   3  (LITTLE)      444444/          <- Task 1 should go here
                    --------------

If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
( - Applied from https://patchwork.kernel.org/patch/9895261/
  - Fixed trivial conflict in kernel/sched/core.c
  - Fixed select_task_rq_idle, now in kernel/sched/idle.c
  - Fixed trivial conflict in select_task_rq_fair )
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I3cfc4bf48c3d7feef969db4d22449f4fbb4f795d

ANDROID: sched: Unconditionally honor sync flag for energy-aware wakeups

Since we don't do energy-aware wakeups when we are overutilized, always
honoring sync wakeups in this state does not prevent wake-wide mechanics
overruling the flag as normal.

This patch is based upon previous work to build EAS for android products.

sync-hint code taken from commit 4a5e890ec60d
"sched/fair: add tunable to force selection at cpu granularity" written
by Juri Lelli <juri.lelli@arm.com>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit f1ec666a62dec1083ed52fe1ddef093b84373aaf)
[ Moved the feature to find_energy_efficient_cpu() ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5

ANDROID: Add find_best_target to minimise energy calculation overhead

find_best_target started life as an optimised energy cpu selection
algorithm designed to be efficient and high performance and integrate
well with schedtune and userspace cpuset configuration. It has had many
many small tweaks over time, and the version added here is forward
ported from the version used in android-4.4 and android-4.9. The linkage
to the rest of EAS is slightly different, however.

This version is split into the three main use-cases and addresses
them in priority order:
A) latency sensitive tasks
B) non latency sensitive tasks on IDLE CPUs
C) non latency sensitive tasks on ACTIVE CPUs

Case A) Latency sensitive tasks

Unconditionally favoring tasks that prefer idle CPU to improve latency.

When we do enter here, we are looking for:
- an idle CPU, whatever its idle_state is, since the first CPUs we
   explore are more likely to be reserved for latency sensitive tasks.
- a non idle CPU where the task fits in its current capacity and has
   the maximum spare capacity.
- a non idle CPU with lower contention from other tasks and running at
   the lowest possible OPP.

The last two goals try to favor a non idle CPU where the task can run
as if it is "almost alone". A maximum spare capacity CPU is favored
since the task already fits into that CPU's capacity without waiting for
an OPP change.

For any case other than case A, we avoid CPUs which would become
overutilized if we placed the task there.

Case B) Non latency sensitive tasks on IDLE CPUs.

Find an optimal backup IDLE CPU for non latency sensitive tasks.

Here we are looking for:
- minimizing the capacity_orig,
   i.e. preferring LITTLE CPUs

If IDLE cpus are available, we prefer to choose one in order to spread
tasks and improve performance.

Case C) Non latency sensitive tasks on ACTIVE CPUs.

Pack tasks in the most energy efficient capacities.

This task packing strategy prefers more energy efficient CPUs (i.e.
pack on smaller maximum capacity CPUs) while also trying to spread
tasks to run them all at the lower OPP.

This assumes for example that it's more energy efficient to run two tasks
on two CPUs at a lower OPP than packing both on a single CPU but running
that CPU at an higher OPP.

This code has had many other contributors over the development listed
here as Cc.

Cc: Ke Wang <ke.wang@spreadtrum.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Srinath Sridharan <srinathsr@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit f240e44406558b17ff7765f252b0bcdcbc15126f)
[ - Removed tracepoints
  - Took capacity_curr_of() from "7f6fb82 ANDROID: sched: EAS: take
    cstate into account when selecting idle core"
  - Re-use sd_ea from find_energy_efficient_cpu() / removed start_cpu()
  - Mark candidates with a cpumask given by feec()
  - Squashed Ionela's tri-gear fbt fixes from android-4.14
  - Squashed Patrick's changes related to util_est from android-4.14 ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I9500d308c879dd53b08adeda8f988238e39cc392

ANDROID: sched/fair: Factor out CPU selection from find_energy_efficient_cpu

find_energy_efficient_cpu() is composed of two steps; we first look for
the CPU with the max spare capacity in each frequency domain, and then
the impact on energy of each candidate is estimated. In order to make it
easier to implement other CPU selection policies, let's factor the
candidate selection algorithm out of find_energy_efficient_cpu(), and
mark the candidates in a mask. This should result in no functional
difference.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I85a28880f01fcd11d7af28f9fbc1fe0cf4f197cf

ANDROID: sched: Introduce sysctl_sched_cstate_aware

Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, the scheduler can make use of the idle state
indexes in order to break the tie between potential CPU candidates.

This patch is based on 7f6fb825d6bc ("ANDROID: sched: EAS: take cstate
into account when selecting idle core") from android-4.14. All the
credits goes to the authors.

Change-Id: Ia076cf32faff91e90905291fa6f7924dc3dd6458
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ANDROID: sched, cpuidle: Track cpuidle state index in the scheduler

The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ib3d1178512735b0e314881f73fb8ccff5a69319f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit a732c97420e109956c20f34c70b91e6d06f5df31)
[ Fixed trivial cherry-pick conflict in kernel/sched/sched.h ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

ANDROID: sched: fair/tune: Add schedtune with cgroups interface

Schedtune is the framework we use in Android to allow userspace
task classification and provides a CGroup controller which has two
attributes per group.

* schedtune.boost
* schedtune.prefer_idle

Schedtune itself provides task and CPU utilization boosting. EAS in
the fair scheduler uses boosted utilization and prefer_idle status to
control the algorithm used for wakeup task placement.

Boosting:

The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.

Schedtune allows userspace to assign a percentage boost to each group
and this boost is used to calculate an additional utilization margin.
The margin added to the original utilization is:
1. computed based on the "boosting strategy" in use
2. proportional to boost value defined by the "taskgroup" value

The boosted signal is used by EAS for task placement, and boosted CPU
utilization (if boosted tasks are running) is given when schedutil
requests utilization.

Prefer_idle:

When this attribute is 1 for a group, this is used as a signal from
userspace that tasks in this group need to be serviced with the
minimum latency possible.

Previous versions of schedtune had much more functionality around
allowing a more tuneable tradeoff between performand and energy,
however this has not been used a lot up until now. If necessary,
we can easily resurrect it based upon old code.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit 159c14f0397790405b9e8435184366b31f2ed15b)
[ - Removed tracepoints (to be added in a separate patch)
- Integrated boosted_cpu_util() with cpu_util_cfs()
- Backported Patrick's util_est related fixes from android-4.14 ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ie2fd63d82f604f34bcbc7e1ca9b5af1bdcc037e0

FROMLIST: sched/fair: Fix util_avg of new tasks for asymmetric systems

When a new task wakes-up for the first time, its initial utilization
is set to half of the spare capacity of its CPU. The current
implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
directly as a capacity reference. As a result, on a big.LITTLE system, a
new task waking up on an idle little CPU will be given ~512 of util_avg,
even if the CPU's capacity is significantly less than that.

Fix this by computing the spare capacity with arch_scale_cpu_capacity().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I1018d8710d62789f9c02558729557b924a44ceec

ANDROID: cpufreq: arm_big_little: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which
can rely on the power estimation helper provided by PM_OPP.

Todo: Check if driver can handle -EPROBE_DEFER and if the call to
dev_pm_opp_get_opp_count() id realy necessary.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ Removed the dependency on dev_pm_opp_of_estimate_power() ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ia808262ef6c9f2cc7819a83e8eb2f602454edfa3

ANDROID: cpufreq: scpi: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which can
estimate power using the P = C * V^2 * f equation where C, V, and f
respectively are the capacitance of the CPU and the voltage and frequency
of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding, and the voltage and frequency values are obtained from PM_OPP.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ib4f97223bdedb6d8141d79060a530f754d380023

FROMLIST: cpufreq: dt: Register an Energy Model

The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ic264eb07941a1a16db909a79ff7c07a3d1e62a12

FROMLIST: arch_topology: Start Energy Aware Scheduling

Energy Aware Scheduling (EAS) starts when the scheduling domains are
built if the Energy Model (EM) is present. However, in the typical case
of Arm/Arm64 systems, the EM is provided after the scheduling domains
are first built at boot time, which results in EAS staying disabled.

Fix this issue by re-building the scheduling domain from the arch
topology driver, once CPUfreq is up and running and when the CPU
capacities have been updated to their final value.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Id286b3f4a6049269cfbfb0e391f63fd3da76f71c

ANDROID: arm: dts: vexpress-v2p-ca15_a7: Add dynamic-power-coefficient properties

The values are computed from measuring energy over a 10 secs sysbench
workload running at each frequency and affine to 1 or 2 A15's as well as
1 or 2 or 3 A7's. The Power values for individual cpus are calculated
from the Energy values divided by workload runtime by taking the
difference of the energy values between n+1 and n cpus.

P [mW] = C * freq [Mhz] * power [mV] * power [mV] / 1000000000

C = P [mW] / freq [Mhz] * power [mV] * power [mV] * 1000000000

The actual C (dynamic-power-coefficient) value is the mean value out of
all the C values of the OPP's.

A15:
     freq       power  voltage  dyn_pwr_coef
     [MhZ]      [mW]      [mV]
0   500.0   534.57550      900   1319.939506
1   600.0   547.15468      900   1125.832675
2   700.0   572.22060      900   1009.207407
3   800.0   607.76592      900    937.910370
4   900.0   648.50552      900    889.582332
5  1000.0   693.86776      900    856.626864
6  1100.0   916.51314      975    876.469442
7  1200.0  1198.57566     1050    905.952880

                            mean: 990

A7:

     freq      power  voltage  dyn_pwr_coef
     [MhZ]      [mW]     [mV]
0   350.0   40.17430      900    141.708289
1   400.0   42.68700      900    131.750000
2   500.0   54.25716      900    133.968296
3   600.0   64.09914      900    131.891235
4   700.0   74.09736      900    130.683175
5   800.0   82.69694      900    127.618735
6   900.0  113.71386      975    132.911225
7  1000.0  144.94124     1050    131.465977

                           mean: 133

The ratio between A15 and A7 is 990/113 = 7.44

This value (7.44) is very close to mean ratio between the power value of
A15 an A7 of the per sched-domain Energy Model (7.96):

  mV  Mhz  MhZ old EM      ratio
       A7  A15 core power
900  350  500 6997/1024   6.83
900  400  600 5177/761    6.80
900  500  700 3846/549    7.01
900  600  800 3524/447    7.88
900  700  900 3125/407    7.68
900  800 1000 2756/334    8.25
975  900 1100 2312/275    8.40
1050 1000 1200 2021/187   10.80

                    mean:  7.96

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I93ef375d05ff769481a07f2c74f061e307cb14d4

ANDROID: arm64: dts: juno-r2: Add dynamic-power-coefficient properties

Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Id1a44fb7d222d59f7d44b5f55797d407513eb7e7

ANDROID: arm64: dts: juno: Add dynamic-power-coefficient properties

Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I7c23a58fa49b281ed5df2f60db0514a9b3b50c7b

ANDROID: arm, arm64: Enable kernel config options required for EAS

arm and arm64:

  Add    Cgroups support
  Add    Energy Model
  Add    CpuFreq governors and make schedutil default

for arm:

  Add    Cpuset support
  Add    Scheduler autogroups
  Add    DIE sched domain level

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ib52a0bd27702c3f2c3d692e49c9c8e2fbbea2cf7

ANDROID: arm64: Enable dynamic sched_domain flag setting

The patch lets the arch_topology driver take over setting of
sched_domain flags that should be detected dynamically based on the
actual system topology.

cc: Catalin Marinas <catalin.marinas@arm.com>
cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I56b496517fe5d3a67f8650a661ea0a046ab81b59

ANDROID: arm: Enable dynamic sched_domain flag setting

The patch lets the arch_topology driver take over setting of
sched_domain flags that should be detected dynamically based on the
actual system topology.

cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ie40c390057ab4cc1b73d8d322e3253ac2443df48

ANDROID: drivers/base/arch_topology: Dynamic sched_domain flag detection

This patch add support for dynamic sched_domain flag detection. Flags
like SD_ASYM_CPUCAPACITY are not guaranteed to be set at the same level
for all systems. Let the arch_topology driver do the detection of where
those flags should be set instead. This patch adds initial support for
setting the SD_ASYM_CPUCAPACITY flag.

cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I0e729b8005f4132926b22e85e83f58d691938350

ANDROID: sched: Enable idle balance to pull single task towards cpu with higher capacity

We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu. Add an
extra criteria to need_active_balance() to kick off active load balance
if the source cpu is over-utilized and has lower capacity than the
destination cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Iea3b42b2a0f8d8a4252e42ba67cc33381a4a1075

ANDROID: sched: Prevent unnecessary active balance of single task in sched group

Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I28f69c72febe0211decbe77b7bc3e48839d3d7b3

FROMLIST: sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I8400c9a26d57da10e45b64c6f93bc1b99eb5c3c8

FROMLIST: sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

When hotplugging cpus out or creating exclusive cpusets (disabling
sched_load_balance) systems which were asymmetric at boot might become
symmetric. In this case leaving the flag set might lead to suboptimal
scheduling decisions.

The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or the
generic topology code has to verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.

We need to detect capacity based on calling arch_scale_cpu_capacity()
directly as rq->cpu_capacity_orig hasn't been set yet early in the boot
process.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I95dadee225f2217787713a61bd7b07e11e028fb0

FROMLIST: sched/fair: Don't move tasks to lower capacity cpus unless necessary

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I66c47bc93f6533ae8f36f026c033ad5de80f3275

FROMLIST: sched/fair: Set rq->rd->overload when misfit

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
  ^
  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Change-Id: I86d5aab67bd2c91a0b193f21844a4634a492f331

FROMLIST: sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I2b35baaebb9a55afa7877b049d2af1c255567601