git.osdn.net Git - sagit-ice-cold/kernel_xiaomi

proc: Don't let Google Camera and Settings run in the background

Google Camera and Settings both burn through CPU in the background doing
nothing useful. In the case of Google Camera, it keeps polling sensors
in the background while it is doing nothing for the user.

Meanwhile, in Settings, when leaving the Adaptive brightness activity
via any means other than using the back button (e.g., the home button),
the GIF in the Adaptive brightness activity will continue playing in the
background. This bug applies to all of the dumb GIFs in Settings.

Kill both of these apps when they reach the background to stop them from
burning through battery.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

ARM: dts: msm8998: Increase UFS CPU latency requirement to 100 us

Voting for longer than 70 us provides access to another CPU idle state.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

{chiron,sagit}_defconfig: Use a timer frequency of 100 Hz

The use of high CPU frequencies combined with a large number of cores
makes latency decent with a 100 Hz scheduler tick. Reducing the tick
rate from 300 Hz to 100 Hz improves throughput and significantly reduces
the number of interrupts firing off per second, improving power
consumption.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

qcacld-3.0: Nuke as much debug bloat as possible

The overhead from all the debugging in this monstrosity of a driver is
measurably significant. Chop it all out.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

msm: kgsl: Relax CPU latency requirements to save power

Relaxing the CPU latency requirement by about 500 us won't significantly
hurt graphics performance. On the flip side, most SoCs have many idle
levels just below 1000 us in latency, with deeper idle levels having
latencies in excess of 2000 us. Changing the latency requirement to
1000 us allows most SoCs to use their deepest sub-1000-us idle state
while the GPU is active.

Additionally, since the lpm driver has been updated to allow power
levels with latencies equal to target latencies, change the wakeup
latency from 101 to 100 for clarity.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

sched/tune: Hard-code top-app's stune boost to 1

Hard-code top-app's stune boost to 1 so that top-app processes are still
preferred to run on big cluster CPUs without significantly affecting the
CPU governor's frequency selection.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

cpuidle: lpm-levels: Allow exit latencies equal to target latencies

This allows pm_qos votes with, say, 100 us for example to select power
levels with exit latencies equal to 100 us. The extra microsecond of
exit latency doesn't hurt.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

{chiron,sagit}_defconfig: Don't print pid and CPU in dmesg

This cruft is annoying.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

{chiron,sagit}_defconfig: enable devfreq boost driver

msm: mdss: Mark display-wake kthread as performance critical

This kthread is responsible for powering on the display, so it needs to
run as soon as possible to minimize lag when turning the display on.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

msm: mdss: Power on display asynchronously as early as possible

Currently, mdss powers on the display a long time after an unblank is
requested and gets completely blocked while powering on the display, so
if the display takes a very long time to turn on, then mdss will be
stuck unable to do anything else in the meantime. This results in a long
delay between trying to wake the device and the display actually
powering on.

In order to make the display turn on faster when waking the device from
sleep, start powering on the display as soon as the framebuffer unblank
event is received. This allows mdss to continue resuming while the
display takes its time powering on. A high-priority kthread is used here
to ensure the display powers on as quickly as possible.

In the event that the framebuffer unblank notifier is not used (such as
for AOD), the display will be powered on at the time that it is
requested via the MDSS_EVENT_LINK_READY event.

To make this work, kickoffs need to be blocked when they attempt to
power on the display, so that a kickoff won't continue while the display
is still powered off.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

msm: mdss: Boost DDR bus when committing a new frame

In order to reduce jank, request a DDR bus boost whenever a new frame is
ready to be rendered to the display. The boost should be sufficient
enough to render 60 FPS without any dropped frames when there is no
significant external source of load.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

devfreq_boost: Mark boost kthreads as performance critical

The boost kthreads are performance critical for obvious reasons.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

devfreq_boost: Introduce devfreq boost driver

This driver boosts enumerated devfreq devices upon input, and allows for
boosting specific devfreq devices on other custom events. The boost
frequencies for this driver should be set so that frame drops are
near-zero at the boosted frequencies and power consumption is minimized
at said frequencies. The goal of this driver is to provide an interface
to achieve optimal device performance by requesting boosts on key
events, such as when a frame is ready to rendered to the display.

Currently, support is only present for boosting the cpubw devfreq
device, but the driver is structured in a way that makes it easy to add
support for new boostable devfreq devices in the future.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

cpufreq: Kill userspace CPU boosting entirely

Kernel-based CPU boosting is used now, so stop userspace from messing with
it by turning scaling_min_freq into a no-op. Note that this is done instead
of making scaling_min_freq read-only so that userspace doesn't spit out
error messages when it can't do its boosting.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

fs: exec: block nfs injector from launching

Another optimizer, duh...

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

{chiron,sagit}_defconfig: enable unwated apps blocker

fs: introduce unwated apps blocker

This is a POC commit which targets various Android optimiers,
that use 2011 tweaks and script to
"improve battery life and performance"
and completely mess up all the kernel settings, thus leading
to poor UX and whine from users.

Currently blocking: L Speed, FDE.AI

Now blocking: L Speed
Based on [1] and [2]:
https://github.com/RaphielGang/spins_kernel_xiaomi_sdm845/commit/75804821e68be3ece795402e175e36ebf7206540
https://github.com/kerneltoast/android_kernel_google_bluecross/commit/18f25c985ce1d23ca98f765249671e7252886e4d

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

zram: Do not allow compression algorithm to be changed

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

zram: Move default compression algorithm choice to Kconfig

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

kernel: bpf: move syscall allocations to stack

These are really small, freed in the same function, very frequent.
Allocating them on stack will improve performance.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

kernel/printk: use on-stack allocations for kernel log

These allocationsare just 1kb in size, using kmalloc is not
worth it for them. This should speed up printing of kernel log
when uptime gets very long.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

msm: gsi: disable debug driver

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

cpuidle: lpm-levels: Remove debug event logging

A measurably significant amount of CPU time is spent on logging events
for debugging purposes in lpm_cpuidle_enter. Kill the useless logging to
reduce overhead.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

ipa_v3: fix some maybe-uninitialised warnings

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

mm/slab_common: Align all caches' objects to hardware cachelines

This only increases the memory used by all caches by about 10%, which is
relatively very little for the performance benefit of cacheline
alignment.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

ASoC: msm: qdsp6v2: Make version checking no-op

After Pie tag was released CAF added functions
for checking fw version that are not supported
by our DSP.

And kernel tell us about it by spamming:

[10186.137518] q6core_get_service_version: Failed to get service size for service id 7
with error -95
[10186.141517] q6core_get_service_version: Failed to get service size for service id 8
with error -95
[10186.151816] q6core_get_service_version: Failed to get service size for service id 7
with error -95
[10254.278514] q6core_get_service_version: Failed to get service size for service id 7
with error -95
[10254.282274] q6core_get_service_version: Failed to get service size for service id 8
with error -95
[10254.292154] q6core_get_service_version: Failed to get service size for service id 7
with error -95
[10294.549313] q6core_get_service_version: Failed to get service size for service id 7
with error -95
[10294.553506] q6core_get_service_version: Failed to get service size for service id 8
with error -95
[10294.563891] q6core_get_service_version: Failed to get service size for service id 7
with error -95

This results in certain audio apps getting focked up
after system suspends and then goes back online.

Change-Id: I09dfa1ee3adad8df62f79bc79a88a74f60d73b23
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

Revert "{chiron,sagit}_defconfig: enable bfq"

This reverts commit e34df45d2cab26ba0dbc4fdc106548f3e13eb395.

Revert "BACKPORT: zsmalloc: introduce zs_huge_class_size()"

This reverts commit 4d1ddb8d3b84e9e162217bf55e8aad6fa796b836.

ARM: dts: msm8998: Tune mincpubw configs

f2fs: avoid kernel panic on corruption test

xfstests/generic/475 complains kernel warn/panic while testing corrupted disk.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: Check write pointer consistency of non-open zones

To catch f2fs bugs in write pointer handling code for zoned block
devices, check write pointers of non-open zones that current segments do
not point to. Do this check at mount time, after the fsync data recovery
and current segments' write pointer consistency fix. Check two items
comparing write pointers with valid block maps in SIT.

The first item is check for zones with no valid blocks. When there is no
valid blocks in a zone, the write pointer should be at the start of the
zone. If not, next write operation to the zone will cause unaligned write
error. If write pointer is not at the zone start, make mount fail and ask
users to run fsck.

The second item is check between the write pointer position and the last
valid block in the zone. It is unexpected that the last valid block
position is beyond the write pointer. In such a case, report as the bug.
Fix is not required for such zone, because the zone is not selected for
next write operation until the zone get discarded.

Also move a constant F2FS_REPORT_ZONE from super.c to f2fs.h to use it
in segment.c also.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: Check write pointer consistency of open zones

On sudden f2fs shutdown, write pointers of zoned block devices can go
further but f2fs meta data keeps current segments at positions before the
write operations. After remounting the f2fs, this inconsistency causes
write operations not at write pointers and "Unaligned write command"
error is reported.

To avoid the error, compare current segments with write pointers of open
zones the current segments point to, during mount operation. If the write
pointer position is not aligned with the current segment position, assign
a new zone to the current segments. Also check the newly assigned zone
has write pointer at zone start. If not, make mount fail and ask users to
run fsck.

Perform the consistency check twice. Once during fsync recovery. Not to
lose the fsync data, do the check after fsync data gets restored and
before checkpoint commit which flushes data at current segment positions.
The second check is done at end of f2fs_fill_super() to make sure the
write pointer consistency regardless of fsync data recovery execution.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix wrong description in document

As reported in bugzilla, default value of DEF_RAM_THRESHOLD was fixed by
commit 29710bcf9426 ("f2fs: fix wrong percentage"), however leaving wrong
description in document, fix it.

https://bugzilla.kernel.org/show_bug.cgi?id=205203

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: cache global IPU bio

In commit 8648de2c581e ("f2fs: add bio cache for IPU"), we added
f2fs_submit_ipu_bio() in __write_data_page() as below:

__write_data_page()

if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode)) {
f2fs_submit_ipu_bio(sbi, bio, page);
....
}

in order to avoid below deadlock:

Thread A Thread B
- __write_data_page (inode x, page y)
- f2fs_do_write_data_page
  - set_page_writeback        ---- set writeback flag in page y
  - f2fs_inplace_write_data
- f2fs_balance_fs
- lock gc_mutex
- lock gc_mutex
  - f2fs_gc
   - do_garbage_collect
    - gc_data_segment
     - move_data_page
      - f2fs_wait_on_page_writeback
       - wait_on_page_writeback  --- wait writeback of page y

However, the bio submission breaks the merge of IPU IOs.

So in this patch let's add a global bio cache for merged IPU pages,
then f2fs_wait_on_page_writeback() is able to submit bio if a
writebacked page is cached in global bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix to avoid memory leakage in f2fs_listxattr

In f2fs_listxattr, there is no boundary check before
memcpy e_name to buffer.
If the e_name_len is corrupted,
unexpected memory contents may be returned to the buffer.

Signed-off-by: Randall Huang <huangrandall@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: check total_segments from devices in raw_super

For multi-device F2FS, we should check if the sum of total_segments from
all devices matches segment_count.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: update multi-dev metadata in resize_fs

Multi-device metadata should be updated in resize_fs as well.

Also, we check that the new FS size still reaches the last device.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: mark recovery flag correctly in read_raw_super_block()

On the combination of first fail and second success,
we will miss to mark recovery flag because currently
we reuse err variable in the loop.

Signed-off-by: Chengguang Xu <cgxu519@zoho.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix to update time in lazytime mode

generic/018 reports an inconsistent status of atime, the
testcase is as below:
- open file with O_SYNC
- write file to construct fraged space
- calc md5 of file
- record {a,c,m}time
- defrag file --- do nothing
- umount & mount
- check {a,c,m}time

The root cause is, as f2fs enables lazytime by default, atime
update will dirty vfs inode, rather than dirtying f2fs inode (by set
with FI_DIRTY_INODE), so later f2fs_write_inode() called from VFS will
fail to update inode page due to our skip:

f2fs_write_inode()
if (is_inode_flag_set(inode, FI_DIRTY_INODE))
return 0;

So eventually, after evict(), we lose last atime for ever.

To fix this issue, we need to check whether {a,c,m,cr}time is
consistent in between inode cache and inode page, and only skip
f2fs_update_inode() if f2fs inode is not dirty and time is
consistent as well.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

simple_lmk: Change Kconfig defaults

Commit "simple_lmk: Make reclaim deterministic" changed Simple LMK's
behavior, so the default parameters must be updated as well to
compensate.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

simple_lmk: Clean up some code style nitpicks

Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

simple_lmk: Make reclaim deterministic

The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.

This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

simple_lmk: Fix broken multicopy atomicity for victims_to_kill

When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.

Fix this by using proper atomic helpers with memory barriers.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

msm: mdss: Silence debug logs

simple_lmk: Don't give victims privileges

* When slmk is triggered, there are usually heavy tasks running.
Increasing the victim's priority may result in unnecessary preemption and lag

defconfig: Disable CC_STACKPROTECTOR_STRONG

* save few size, idc others

Kbuild: don't pass "-C" to preprocessor when processing linker scripts

For some odd historical reason, we preprocessed the linker scripts with
"-C", which keeps comments around. That makes no sense, since the
comments are not meaningful for the build anyway.

And it actually breaks things, since linker scripts can't have C++ style
"//" comments in them, so keeping comments after preprocessing now
limits us in odd and surprising ways in our header files for no good
reason.

The -C option goes back to pre-git and pre-bitkeeper times, but seems to
have been historically used (along with "-traditional") for some
odd-ball architectures (ia64, MIPS and SH). It probably didn't matter
back then either, but might possibly have been used to minimize the
difference between the original file and the pre-processed result.

The reason for this may be lost in time, but let's not perpetuate it
only because we can't remember why we did this crazy thing.

This was triggered by the recent addition of SPDX lines to the source
tree, where people apparently were confused about why header files
couldn't use the C++ comment format.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

qcacld-3.0: Disable open-source flag

* It can disable some debug codes and make the driver more similar to OEM's

qpnp-fg-gen3: Don't ratelimit interaction related props

* The state changes corresponding to these props should be immediately fed back in userspace.

loop: avoid EAGAIN, if offset or block_size are changed

This patch tries to avoid EAGAIN due to nrpages!=0 that was originally trying
to drop stale pages resulting in wrong data access.

Report: https://bugs.chromium.org/p/chromium/issues/detail?id=938958#c38

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Fixes: 5db470e229e2 ("loop: drop caches if offset or block_size are changed")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Reported-by: grygorii tertychnyi <gtertych@cisco.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

scatterlist: Speed up for_each_sg() loop macro

Scatterlists are chained in predictable arrays of up to
SG_MAX_SINGLE_ALLOC sg structs in length. Using this knowledge, speed up
for_each_sg() by using constant operations to determine when to simply
increment the sg pointer by one or get the next sg array in the chain.

Rudimentary measurements with a trivial loop body show that this yields
roughly a 2x performance gain.

The following simple test module proves the correctness of the new loop
definition by testing all the different edge cases of sg chains:
#include <linux/module.h>
#include <linux/scatterlist.h>
#include <linux/slab.h>

static int __init test_for_each_sg(void)
{
static const gfp_t gfp_flags = GFP_KERNEL | __GFP_NOFAIL;
        struct scatterlist *sg;
        struct sg_table *table;
        long old = 0, new = 0;
        unsigned int i, nents;

        table = kmalloc(sizeof(*table), gfp_flags);
        for (nents = 1; nents <= 3 * SG_MAX_SINGLE_ALLOC; nents++) {
                BUG_ON(sg_alloc_table(table, nents, gfp_flags));
                for (sg = table->sgl; sg; sg = sg_next(sg))
                        old ^= (long)sg;
                for_each_sg(table->sgl, sg, nents, i)
                        new ^= (long)sg;
                sg_free_table(table);
        }

        BUG_ON(old != new);
        kfree(table);
        return 0;
}
module_init(test_for_each_sg);

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

mm/shmem.c: fix unlikely() test of info->seals to test only for WRITE and GROW

Running my likely/unlikely profiler, I discovered that the test in
shmem_write_begin() that tests for info->seals as unlikely, is always
incorrect. This is because shmem_get_inode() sets info->seals to have
F_SEAL_SEAL set by default, and it is unlikely to be cleared when
shmem_write_begin() is called. Thus, the if statement is very likely.

But as the if statement block only cares about F_SEAL_WRITE and
F_SEAL_GROW, change the test to only test those two bits.

Link: http://lkml.kernel.org/r/20170203105656.7aec6237@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sched/core: Remove unlikely() annotation from sched_move_task()

The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.

The first change came from ea86cb4b7621 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).

The second change came from 8e5bfa8c1f84 ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>

locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()

Running my likely/unlikely profiler for 3 weeks on two production
machines, I discovered that the unlikely() test in
__rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
100% of the time, making it a very likely case.

The reason is, on a vanilla kernel, the majority case of calling
rt_mutex() is from the futex code. This code is always called as
TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
likely scenario.

The rt_mutex() code should be optimized for the common vanilla case,
and that is from a futex, with TASK_INTERRUPTIBLE as the state.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>

locking/rtmutex: Only warn once on a trylock from bad context

One warning should be enough to get one motivated to fix this. It is
possible that this happens more than once and that starts flooding the
output. Later the prints will be suppressed so we only get half of it.
Depending on the console system used it might not be helpful.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>

rtmutex: Make wait_lock irq safe

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0 CPU1

rtmutex_lock(&rcu->rt_mutex)
  spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
    spin_lock(&timer->it_lock)

This is caused by the following code sequence on CPU1

     rcu_read_lock()
     x = lookup();
     if (x)
      spin_lock_irqsave(&x->it_lock);
     rcu_read_unlock();
     return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>

ASoC: tas2559: use power efficient workingqueues

power: supply: use power efficient workingqueues

* YEEEEEEE

ASoC: wcd9335: use power efficient workingqueues

mm, vmstat: Add likelihood labels to quiet_vmstat conditions

These labels are based on observations from a running system as well as
from inspecting the code:

!delayed_work_pending:
  true  = 3509732
  false = 7495535

!need_update:
  true  = 6656251
  false = 840000

Signed-off-by: Danny Lin <danny@kdrag0n.dev>

mm: vmstat: use power efficient workingqueues

platform: ipa: use power efficient workingqueues

Revert "f2fs: Tune issue discard commands up to 128MB"

* We have a good gc strategy, don't need this
This reverts commit 8439a3a642a927cfb1432b5c50c44eaaa76a4686.

msm: kgsl: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

qseecom: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

crypto: msm: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

kernel: sched: Mitigate non-boosted tasks preempting boosted tasks

Currently when a boosted task is scheduled we use prefer_idle to try and
get it to an idle core. Once it's scheduled, there is a possibility we
can schedule a non-boosted task on the same core where the boosted task
is running on. This change aims to mitigate that possibility by checking
if the core we're targeting has a boosted task and if so, use the next
best idle core instead.

Bug: 131626264
Change-Id: I3d321e1c71f96526f55f7f3a56e32db411311aa2
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

f2fs: gc_wake is unlikely as true

defconfig: Disable UNMAP_KERNEL_AT_EL0

f2fs: Tune background GC for Android

f2fs: Tune issue discard commands up to 128MB

https://android.googlesource.com/device/google/coral/+/refs/tags/android-10.0.0_r9/init.hardware.rc#550

page-writeback: Hardcode dirty_background_ratio to 10

cpufreq: schedutil: tune default rate limit from Pixel

arm64: Select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a
multiplier whose efficiency is comparable or better than a sequence of
three or so register-dependent arithmetic instructions. Select
ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the
few dusty old corners which care.

In a contrived benchmark calling hweight64() in a loop, this does indeed
turn out to be a small win overall, with no measurable impact on
Cortex-A57 but about 5% performance improvement on Cortex-A53.

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>

sdcardfs: fix wrong ENOENT when creating a file

There is subtle race condtion where lower_dentry is null. If we retry
lookup again, we should get the correct dentry.

Bug: 110585947
Bug: 110464178
Bug: 80587794
Bug: 37231161
Bug: 110199687
Change-Id: I39b95de4649b034287776f5c8a5d197b6ebd9ada
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>

ARM: dts: msm8998: Recalibrate busy costs using better data

Power and performance were measured on a production wahoo device in the
kernel in a high-precision manner. Use the resulting data to create a
better energy model that reflects the actual behavior of production
hardware.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
wloot: Adapt to 98897c5c ("ARM: dts: msm8998: Remove few bottom frequencies of low energy efficiency")

arm64/neon: Disable -Wincompatible-pointer-types when building with Clang

After commit cc9f8349cb33 ("arm64: crypto: add NEON accelerated XOR
implementation"), Clang builds for arm64 started failing with the
following error message.

arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
long long *') [-Werror,-Wincompatible-pointer-types]
                v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                         ^~~~~~~~
/usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
expanded from macro 'vld1q_u64'
  __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                              ^~~~

There has been quite a bit of debate and triage that has gone into
figuring out what the proper fix is, viewable at the link below, which
is still ongoing. Ard suggested disabling this warning with Clang with a
pragma so no neon code will have this type of error. While this is not
at all an ideal solution, this build error is the only thing preventing
KernelCI from having successful arm64 defconfig and allmodconfig builds
on linux-next. Getting continuous integration running is more important
so new warnings/errors or boot failures can be caught and fixed quickly.

Link: https://github.com/ClangBuiltLinux/linux/issues/283
Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

arm64/neon: add workaround for ambiguous C99 stdint.h types

In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround
for ambiguous C99 stdint.h types"), this patch redefines the macros that
are used in stdint.h so its definitions of uint64_t and int64_t are
compatible with those of the kernel.

This patch comes from: https://patchwork.kernel.org/patch/3540001/
Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

We mark this file as a private file and don't have to override asm/types.h

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>

ion: fix a possible memory leak in ion_cma_allocate

The memory leak occurs when kmalloc() for info->table fails, info is freed
but info->cpu_addr allocation is left.
Fixes: eeeb940746de ("ion: add snapshot of ion support for MSM")

Bug: 130817249
Test: builds and boots
Change-Id: I7faf5be5129a46b2f874f4e3803470e5f5130a21
Reported-by: Mikael Magnusson <Mikael.Magnusson@sony.com>
Suggested-by: Mikael Magnusson <Mikael.Magnusson@sony.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>

mm/page-writeback.c: place "not" inside of unlikely() statement in wb_domain_writeout_inc()

The likely/unlikely profiler noticed that the unlikely statement in
wb_domain_writeout_inc() is constantly wrong.  This is due to the "not"
(!) being outside the unlikely statement.  It is likely that
dom->period_time will be set, but unlikely that it wont be.  Move the
not into the unlikely statement.

Link: http://lkml.kernel.org/r/20170206120035.3c2e2b91@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

list: introduce list_for_each_entry_from_reverse helper

Similar to list_for_each_entry_continue and its reverse variant
list_for_each_entry_continue_reverse, introduce reverse helper for
list_for_each_entry_from.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mm/vmalloc: rework vmap_area_lock

With the new allocation approach introduced in the 5.2 kernel, it
becomes possible to get rid of one global spinlock. By doing that
we can further improve the KVA from the performance point of view.

Basically we can have two independent locks, one for allocation
part and another one for deallocation, because of two different
entities: "free data structures" and "busy data structures".

As a result, allocation/deallocation operations can still interfere
between each other in case of running simultaneously on different
CPUs, it means there is still dependency, but with two locks it
becomes lower.

Summarizing:
  - it reduces the high lock contention
  - it allows to perform operations on "free" and "busy"
    trees in parallel on different CPUs. Please note it
    does not solve scalability issue.

Test results:
In order to evaluate this patch, we can run "vmalloc test driver"
to see how many CPU cycles it takes to complete all test cases
running sequentially. All online CPUs run it so it will cause
a high lock contention.

HiKey 960, ARM64, 8xCPUs, big.LITTLE:

<snip>
    sudo ./test_vmalloc.sh sequential_test_order=1
<snip>

<default>
[  390.950557] All test took CPU0=457126382 cycles
[  391.046690] All test took CPU1=454763452 cycles
[  391.128586] All test took CPU2=454539334 cycles
[  391.222669] All test took CPU3=455649517 cycles
[  391.313946] All test took CPU4=388272196 cycles
[  391.410425] All test took CPU5=384036264 cycles
[  391.492219] All test took CPU6=387432964 cycles
[  391.578433] All test took CPU7=387201996 cycles
<default>

<patched>
[  304.721224] All test took CPU0=391521310 cycles
[  304.821219] All test took CPU1=393533002 cycles
[  304.917120] All test took CPU2=392243032 cycles
[  305.008986] All test took CPU3=392353853 cycles
[  305.108944] All test took CPU4=297630721 cycles
[  305.196406] All test took CPU5=297548736 cycles
[  305.288602] All test took CPU6=297092392 cycles
[  305.381088] All test took CPU7=297293597 cycles
<patched>

~14%-23% patched variant is better.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc: add more comments to the adjust_va_to_fit_type()

When fit type is NE_FIT_TYPE there is a need in one extra object.
Usually the "ne_fit_preload_node" per-CPU variable has it and
there is no need in GFP_NOWAIT allocation, but there are exceptions.

This commit just adds more explanations, as a result giving
answers on questions like when it can occur, how often, under
which conditions and what happens if GFP_NOWAIT gets failed.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>

mm/vmalloc: respect passed gfp_mask when do preloading

alloc_vmap_area() is given a gfp_mask for the page allocator.
Let's respect that mask and consider it even in the case when
doing regular CPU preloading, i.e. where a context can sleep.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: remove preempt_disable/enable when do preloading

Some background. The preemption was disabled before to guarantee
that a preloaded object is available for a CPU, it was stored for.

The aim was to not allocate in atomic context when spinlock
is taken later, for regular vmap allocations. But that approach
conflicts with CONFIG_PREEMPT_RT philosophy. It means that
calling spin_lock() with disabled preemption is forbidden
in the CONFIG_PREEMPT_RT kernel.

Therefore, get rid of preempt_disable() and preempt_enable() when
the preload is done for splitting purpose. As a result we do not
guarantee now that a CPU is preloaded, instead we minimize the
case when it is not, with this change.

For example i run the special test case that follows the preload
pattern and path. 20 "unbind" threads run it and each does
1000000 allocations. Only 3.5 times among 1000000 a CPU was
not preloaded. So it can happen but the number is negligible.

V2 - > V3:
    - update the commit message

V1 -> V2:
  - move __this_cpu_cmpxchg check when spin_lock is taken,
    as proposed by Andrew Morton
  - add more explanation in regard of preloading
  - adjust and move some comments

Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>

mm: vmalloc: show number of vmalloc pages in /proc/meminfo

Vmalloc() is getting more and more used these days (kernel stacks, bpf and
percpu allocator are new top users), and the total % of memory consumed by
vmalloc() can be pretty significant and changes dynamically.

/proc/meminfo is the best place to display this information: its top goal
is to show top consumers of the memory.

Since the VmallocUsed field in /proc/meminfo is not in use for quite a
long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of
'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
actual physical memory consumption of vmalloc().

Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: move 'area->pages' after if statement

If !area->pages statement is true where memory allocation fails, area is
freed.

In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.

[akpm@linux-foundation.org: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: modify struct vmap_area to reduce its size

Objective
---------

The current implementation of struct vmap_area wasted space.

After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.

Description
-----------

1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because

A) "subtree_max_size" is only used when vmap_area is in "free" tree

B) "vm" is only used when vmap_area is in "busy" tree

C) "purge_list" is only used when vmap_area is in vmap_purge_list

2) Eliminate "flags".

;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.

Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning

gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:

  mm/vmalloc.c: In function 'pcpu_get_vm_areas':
  mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      insert_vmap_area_augment(lva, &va->rb_node,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       &free_vmap_area_root, &free_vmap_area_list);
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
                      ^~~

Add an intialization to NULL, and check whether this has changed before
the first use.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: fix percpu free VM area search criteria

Recent changes to the vmalloc code by commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures.  These, in turn, can result
in panic()s in the slub code.  One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

RIP: 0033:0x7f46f7441b9b
Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator.  In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space.  So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures.  For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check.  Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: do not keep unpurged areas in the busy tree

The busy tree can be quite big, even though the area is freed
or unmapped it still stays there until "purge" logic removes
it.

1) Optimize and reduce the size of "busy" tree by removing a
node from it right away as soon as user triggers free paths.
It is possible to do so, because the allocation is done using
another augmented tree.

The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default
configuration:

sudo ./test_vmalloc.sh performance

<default>
Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
<default>

<this patch>
Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
<this patch>

2) Since the busy tree now contains allocated areas only and does
not interfere with lazily free nodes, introduce the new function
show_purge_info() that dumps "unpurged" areas that is propagated
through "/proc/vmallocinfo".

3) Eliminate VM_LAZY_FREE flag.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()

Trigger a warning if an object that is about to be freed is detached.
We used to have a BUG_ON(), but even though it is considered as faulty
behaviour that is not a good reason to break a system.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: get rid of one single unlink_va() when merge

It does not make sense to try to "unlink" the node that is definitely not
linked with a list nor tree. On the first merge step VA just points to
the previously disconnected busy area.

On the second step, check if the node has been merged and do "unlink" if
so, because now it points to an object that must be linked.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Roman Gushchin <guro@fb.com>

mm/vmalloc.c: preload a CPU with one object for split purpose

Refactor the NE_FIT_TYPE split case when it comes to an allocation of one
extra object. We need it in order to build a remaining space. The preload
is done per CPU in non-atomic context with GFP_KERNEL flags.

More permissive parameters can be beneficial for systems which are suffer
from high memory pressure or low memory condition. For example on my KVM
system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page
allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N
workers spinning on fork() and exit(), i can trigger below trace:

<snip>
[  179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[  179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003
[  179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  179.815171] Call Trace:
[  179.815178]  dump_stack+0x5c/0x7b
[  179.815182]  warn_alloc+0x108/0x190
[  179.815187]  __alloc_pages_slowpath+0xdc7/0xdf0
[  179.815191]  __alloc_pages_nodemask+0x2de/0x330
[  179.815194]  cache_grow_begin+0x77/0x420
[  179.815197]  fallback_alloc+0x161/0x200
[  179.815200]  kmem_cache_alloc+0x1c9/0x570
[  179.815202]  alloc_vmap_area+0x32c/0x990
[  179.815206]  __get_vm_area_node+0xb0/0x170
[  179.815208]  __vmalloc_node_range+0x6d/0x230
[  179.815211]  ? _do_fork+0xce/0x3d0
[  179.815213]  copy_process.part.46+0x850/0x1b90
[  179.815215]  ? _do_fork+0xce/0x3d0
[  179.815219]  _do_fork+0xce/0x3d0
[  179.815226]  ? __do_page_fault+0x2bf/0x4e0
[  179.815229]  do_syscall_64+0x55/0x130
[  179.815231]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  179.815234] RIP: 0033:0x7fedec4c738b
...
[  179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b
[  179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0
[  179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000
[  179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  179.815245] Mem-Info:
[  179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0
                active_file:502 inactive_file:61 isolated_file:70
                unevictable:2 dirty:0 writeback:0 unstable:0
                slab_reclaimable:2380 slab_unreclaimable:7520
                mapped:15069 shmem:14813 pagetables:10833 bounce:0
                free:1922 free_pcp:229 free_cma:0
<snip>

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: remove "node" argument

Remove unused argument from the __alloc_vmap_area() function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>

mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
64 bit as well.

Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro

This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro

This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: vmalloc: pass proper vm_start into debugobjects

Client can call vunmap with some intermediate 'addr' which may not be
the start of the VM area. Entire unmap code works with vm->vm_start
which is proper but debug object API is called with 'addr'. This could
be a problem within debug objects.

Pass proper start address into debug object API.

[akpm@linux-foundation.org: fix warning]
Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>