git.osdn.net Git - sagit-ice-cold/kernel_xiaomi

Revert "{chiron,sagit}_defconfig: enable bfq"

This reverts commit e34df45d2cab26ba0dbc4fdc106548f3e13eb395.

Revert "BACKPORT: zsmalloc: introduce zs_huge_class_size()"

This reverts commit 4d1ddb8d3b84e9e162217bf55e8aad6fa796b836.

ARM: dts: msm8998: Tune mincpubw configs

f2fs: avoid kernel panic on corruption test

xfstests/generic/475 complains kernel warn/panic while testing corrupted disk.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: Check write pointer consistency of non-open zones

To catch f2fs bugs in write pointer handling code for zoned block
devices, check write pointers of non-open zones that current segments do
not point to. Do this check at mount time, after the fsync data recovery
and current segments' write pointer consistency fix. Check two items
comparing write pointers with valid block maps in SIT.

The first item is check for zones with no valid blocks. When there is no
valid blocks in a zone, the write pointer should be at the start of the
zone. If not, next write operation to the zone will cause unaligned write
error. If write pointer is not at the zone start, make mount fail and ask
users to run fsck.

The second item is check between the write pointer position and the last
valid block in the zone. It is unexpected that the last valid block
position is beyond the write pointer. In such a case, report as the bug.
Fix is not required for such zone, because the zone is not selected for
next write operation until the zone get discarded.

Also move a constant F2FS_REPORT_ZONE from super.c to f2fs.h to use it
in segment.c also.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: Check write pointer consistency of open zones

On sudden f2fs shutdown, write pointers of zoned block devices can go
further but f2fs meta data keeps current segments at positions before the
write operations. After remounting the f2fs, this inconsistency causes
write operations not at write pointers and "Unaligned write command"
error is reported.

To avoid the error, compare current segments with write pointers of open
zones the current segments point to, during mount operation. If the write
pointer position is not aligned with the current segment position, assign
a new zone to the current segments. Also check the newly assigned zone
has write pointer at zone start. If not, make mount fail and ask users to
run fsck.

Perform the consistency check twice. Once during fsync recovery. Not to
lose the fsync data, do the check after fsync data gets restored and
before checkpoint commit which flushes data at current segment positions.
The second check is done at end of f2fs_fill_super() to make sure the
write pointer consistency regardless of fsync data recovery execution.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix wrong description in document

As reported in bugzilla, default value of DEF_RAM_THRESHOLD was fixed by
commit 29710bcf9426 ("f2fs: fix wrong percentage"), however leaving wrong
description in document, fix it.

https://bugzilla.kernel.org/show_bug.cgi?id=205203

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: cache global IPU bio

In commit 8648de2c581e ("f2fs: add bio cache for IPU"), we added
f2fs_submit_ipu_bio() in __write_data_page() as below:

__write_data_page()

if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode)) {
f2fs_submit_ipu_bio(sbi, bio, page);
....
}

in order to avoid below deadlock:

Thread A Thread B
- __write_data_page (inode x, page y)
- f2fs_do_write_data_page
  - set_page_writeback        ---- set writeback flag in page y
  - f2fs_inplace_write_data
- f2fs_balance_fs
- lock gc_mutex
- lock gc_mutex
  - f2fs_gc
   - do_garbage_collect
    - gc_data_segment
     - move_data_page
      - f2fs_wait_on_page_writeback
       - wait_on_page_writeback  --- wait writeback of page y

However, the bio submission breaks the merge of IPU IOs.

So in this patch let's add a global bio cache for merged IPU pages,
then f2fs_wait_on_page_writeback() is able to submit bio if a
writebacked page is cached in global bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix to avoid memory leakage in f2fs_listxattr

In f2fs_listxattr, there is no boundary check before
memcpy e_name to buffer.
If the e_name_len is corrupted,
unexpected memory contents may be returned to the buffer.

Signed-off-by: Randall Huang <huangrandall@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: check total_segments from devices in raw_super

For multi-device F2FS, we should check if the sum of total_segments from
all devices matches segment_count.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: update multi-dev metadata in resize_fs

Multi-device metadata should be updated in resize_fs as well.

Also, we check that the new FS size still reaches the last device.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: mark recovery flag correctly in read_raw_super_block()

On the combination of first fail and second success,
we will miss to mark recovery flag because currently
we reuse err variable in the loop.

Signed-off-by: Chengguang Xu <cgxu519@zoho.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: fix to update time in lazytime mode

generic/018 reports an inconsistent status of atime, the
testcase is as below:
- open file with O_SYNC
- write file to construct fraged space
- calc md5 of file
- record {a,c,m}time
- defrag file --- do nothing
- umount & mount
- check {a,c,m}time

The root cause is, as f2fs enables lazytime by default, atime
update will dirty vfs inode, rather than dirtying f2fs inode (by set
with FI_DIRTY_INODE), so later f2fs_write_inode() called from VFS will
fail to update inode page due to our skip:

f2fs_write_inode()
if (is_inode_flag_set(inode, FI_DIRTY_INODE))
return 0;

So eventually, after evict(), we lose last atime for ever.

To fix this issue, we need to check whether {a,c,m,cr}time is
consistent in between inode cache and inode page, and only skip
f2fs_update_inode() if f2fs inode is not dirty and time is
consistent as well.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

simple_lmk: Change Kconfig defaults

Commit "simple_lmk: Make reclaim deterministic" changed Simple LMK's
behavior, so the default parameters must be updated as well to
compensate.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

simple_lmk: Clean up some code style nitpicks

Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

simple_lmk: Make reclaim deterministic

The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.

This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

simple_lmk: Fix broken multicopy atomicity for victims_to_kill

When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.

Fix this by using proper atomic helpers with memory barriers.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: celtare21 <celtare21@gmail.com>

msm: mdss: Silence debug logs

simple_lmk: Don't give victims privileges

* When slmk is triggered, there are usually heavy tasks running.
Increasing the victim's priority may result in unnecessary preemption and lag

defconfig: Disable CC_STACKPROTECTOR_STRONG

* save few size, idc others

Kbuild: don't pass "-C" to preprocessor when processing linker scripts

For some odd historical reason, we preprocessed the linker scripts with
"-C", which keeps comments around. That makes no sense, since the
comments are not meaningful for the build anyway.

And it actually breaks things, since linker scripts can't have C++ style
"//" comments in them, so keeping comments after preprocessing now
limits us in odd and surprising ways in our header files for no good
reason.

The -C option goes back to pre-git and pre-bitkeeper times, but seems to
have been historically used (along with "-traditional") for some
odd-ball architectures (ia64, MIPS and SH). It probably didn't matter
back then either, but might possibly have been used to minimize the
difference between the original file and the pre-processed result.

The reason for this may be lost in time, but let's not perpetuate it
only because we can't remember why we did this crazy thing.

This was triggered by the recent addition of SPDX lines to the source
tree, where people apparently were confused about why header files
couldn't use the C++ comment format.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

qcacld-3.0: Disable open-source flag

* It can disable some debug codes and make the driver more similar to OEM's

qpnp-fg-gen3: Don't ratelimit interaction related props

* The state changes corresponding to these props should be immediately fed back in userspace.

loop: avoid EAGAIN, if offset or block_size are changed

This patch tries to avoid EAGAIN due to nrpages!=0 that was originally trying
to drop stale pages resulting in wrong data access.

Report: https://bugs.chromium.org/p/chromium/issues/detail?id=938958#c38

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Fixes: 5db470e229e2 ("loop: drop caches if offset or block_size are changed")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Reported-by: grygorii tertychnyi <gtertych@cisco.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

scatterlist: Speed up for_each_sg() loop macro

Scatterlists are chained in predictable arrays of up to
SG_MAX_SINGLE_ALLOC sg structs in length. Using this knowledge, speed up
for_each_sg() by using constant operations to determine when to simply
increment the sg pointer by one or get the next sg array in the chain.

Rudimentary measurements with a trivial loop body show that this yields
roughly a 2x performance gain.

The following simple test module proves the correctness of the new loop
definition by testing all the different edge cases of sg chains:
#include <linux/module.h>
#include <linux/scatterlist.h>
#include <linux/slab.h>

static int __init test_for_each_sg(void)
{
static const gfp_t gfp_flags = GFP_KERNEL | __GFP_NOFAIL;
        struct scatterlist *sg;
        struct sg_table *table;
        long old = 0, new = 0;
        unsigned int i, nents;

        table = kmalloc(sizeof(*table), gfp_flags);
        for (nents = 1; nents <= 3 * SG_MAX_SINGLE_ALLOC; nents++) {
                BUG_ON(sg_alloc_table(table, nents, gfp_flags));
                for (sg = table->sgl; sg; sg = sg_next(sg))
                        old ^= (long)sg;
                for_each_sg(table->sgl, sg, nents, i)
                        new ^= (long)sg;
                sg_free_table(table);
        }

        BUG_ON(old != new);
        kfree(table);
        return 0;
}
module_init(test_for_each_sg);

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

mm/shmem.c: fix unlikely() test of info->seals to test only for WRITE and GROW

Running my likely/unlikely profiler, I discovered that the test in
shmem_write_begin() that tests for info->seals as unlikely, is always
incorrect. This is because shmem_get_inode() sets info->seals to have
F_SEAL_SEAL set by default, and it is unlikely to be cleared when
shmem_write_begin() is called. Thus, the if statement is very likely.

But as the if statement block only cares about F_SEAL_WRITE and
F_SEAL_GROW, change the test to only test those two bits.

Link: http://lkml.kernel.org/r/20170203105656.7aec6237@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sched/core: Remove unlikely() annotation from sched_move_task()

The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.

The first change came from ea86cb4b7621 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).

The second change came from 8e5bfa8c1f84 ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>

locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()

Running my likely/unlikely profiler for 3 weeks on two production
machines, I discovered that the unlikely() test in
__rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
100% of the time, making it a very likely case.

The reason is, on a vanilla kernel, the majority case of calling
rt_mutex() is from the futex code. This code is always called as
TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
likely scenario.

The rt_mutex() code should be optimized for the common vanilla case,
and that is from a futex, with TASK_INTERRUPTIBLE as the state.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>

locking/rtmutex: Only warn once on a trylock from bad context

One warning should be enough to get one motivated to fix this. It is
possible that this happens more than once and that starts flooding the
output. Later the prints will be suppressed so we only get half of it.
Depending on the console system used it might not be helpful.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>

rtmutex: Make wait_lock irq safe

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0 CPU1

rtmutex_lock(&rcu->rt_mutex)
  spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
    spin_lock(&timer->it_lock)

This is caused by the following code sequence on CPU1

     rcu_read_lock()
     x = lookup();
     if (x)
      spin_lock_irqsave(&x->it_lock);
     rcu_read_unlock();
     return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>

ASoC: tas2559: use power efficient workingqueues

power: supply: use power efficient workingqueues

* YEEEEEEE

ASoC: wcd9335: use power efficient workingqueues

mm, vmstat: Add likelihood labels to quiet_vmstat conditions

These labels are based on observations from a running system as well as
from inspecting the code:

!delayed_work_pending:
  true  = 3509732
  false = 7495535

!need_update:
  true  = 6656251
  false = 840000

Signed-off-by: Danny Lin <danny@kdrag0n.dev>

mm: vmstat: use power efficient workingqueues

platform: ipa: use power efficient workingqueues

Revert "f2fs: Tune issue discard commands up to 128MB"

* We have a good gc strategy, don't need this
This reverts commit 8439a3a642a927cfb1432b5c50c44eaaa76a4686.

msm: kgsl: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

qseecom: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

crypto: msm: Use for_each_sg() macro where possible

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

kernel: sched: Mitigate non-boosted tasks preempting boosted tasks

Currently when a boosted task is scheduled we use prefer_idle to try and
get it to an idle core. Once it's scheduled, there is a possibility we
can schedule a non-boosted task on the same core where the boosted task
is running on. This change aims to mitigate that possibility by checking
if the core we're targeting has a boosted task and if so, use the next
best idle core instead.

Bug: 131626264
Change-Id: I3d321e1c71f96526f55f7f3a56e32db411311aa2
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

f2fs: gc_wake is unlikely as true

defconfig: Disable UNMAP_KERNEL_AT_EL0

f2fs: Tune background GC for Android

f2fs: Tune issue discard commands up to 128MB

https://android.googlesource.com/device/google/coral/+/refs/tags/android-10.0.0_r9/init.hardware.rc#550

page-writeback: Hardcode dirty_background_ratio to 10

cpufreq: schedutil: tune default rate limit from Pixel

arm64: Select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a
multiplier whose efficiency is comparable or better than a sequence of
three or so register-dependent arithmetic instructions. Select
ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the
few dusty old corners which care.

In a contrived benchmark calling hweight64() in a loop, this does indeed
turn out to be a small win overall, with no measurable impact on
Cortex-A57 but about 5% performance improvement on Cortex-A53.

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>

sdcardfs: fix wrong ENOENT when creating a file

There is subtle race condtion where lower_dentry is null. If we retry
lookup again, we should get the correct dentry.

Bug: 110585947
Bug: 110464178
Bug: 80587794
Bug: 37231161
Bug: 110199687
Change-Id: I39b95de4649b034287776f5c8a5d197b6ebd9ada
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>

ARM: dts: msm8998: Recalibrate busy costs using better data

Power and performance were measured on a production wahoo device in the
kernel in a high-precision manner. Use the resulting data to create a
better energy model that reflects the actual behavior of production
hardware.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
wloot: Adapt to 98897c5c ("ARM: dts: msm8998: Remove few bottom frequencies of low energy efficiency")

arm64/neon: Disable -Wincompatible-pointer-types when building with Clang

After commit cc9f8349cb33 ("arm64: crypto: add NEON accelerated XOR
implementation"), Clang builds for arm64 started failing with the
following error message.

arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
long long *') [-Werror,-Wincompatible-pointer-types]
                v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                         ^~~~~~~~
/usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
expanded from macro 'vld1q_u64'
  __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                              ^~~~

There has been quite a bit of debate and triage that has gone into
figuring out what the proper fix is, viewable at the link below, which
is still ongoing. Ard suggested disabling this warning with Clang with a
pragma so no neon code will have this type of error. While this is not
at all an ideal solution, this build error is the only thing preventing
KernelCI from having successful arm64 defconfig and allmodconfig builds
on linux-next. Getting continuous integration running is more important
so new warnings/errors or boot failures can be caught and fixed quickly.

Link: https://github.com/ClangBuiltLinux/linux/issues/283
Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>

arm64/neon: add workaround for ambiguous C99 stdint.h types

In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround
for ambiguous C99 stdint.h types"), this patch redefines the macros that
are used in stdint.h so its definitions of uint64_t and int64_t are
compatible with those of the kernel.

This patch comes from: https://patchwork.kernel.org/patch/3540001/
Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

We mark this file as a private file and don't have to override asm/types.h

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>

ion: fix a possible memory leak in ion_cma_allocate

The memory leak occurs when kmalloc() for info->table fails, info is freed
but info->cpu_addr allocation is left.
Fixes: eeeb940746de ("ion: add snapshot of ion support for MSM")

Bug: 130817249
Test: builds and boots
Change-Id: I7faf5be5129a46b2f874f4e3803470e5f5130a21
Reported-by: Mikael Magnusson <Mikael.Magnusson@sony.com>
Suggested-by: Mikael Magnusson <Mikael.Magnusson@sony.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>

mm/page-writeback.c: place "not" inside of unlikely() statement in wb_domain_writeout_inc()

The likely/unlikely profiler noticed that the unlikely statement in
wb_domain_writeout_inc() is constantly wrong.  This is due to the "not"
(!) being outside the unlikely statement.  It is likely that
dom->period_time will be set, but unlikely that it wont be.  Move the
not into the unlikely statement.

Link: http://lkml.kernel.org/r/20170206120035.3c2e2b91@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

list: introduce list_for_each_entry_from_reverse helper

Similar to list_for_each_entry_continue and its reverse variant
list_for_each_entry_continue_reverse, introduce reverse helper for
list_for_each_entry_from.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mm/vmalloc: rework vmap_area_lock

With the new allocation approach introduced in the 5.2 kernel, it
becomes possible to get rid of one global spinlock. By doing that
we can further improve the KVA from the performance point of view.

Basically we can have two independent locks, one for allocation
part and another one for deallocation, because of two different
entities: "free data structures" and "busy data structures".

As a result, allocation/deallocation operations can still interfere
between each other in case of running simultaneously on different
CPUs, it means there is still dependency, but with two locks it
becomes lower.

Summarizing:
  - it reduces the high lock contention
  - it allows to perform operations on "free" and "busy"
    trees in parallel on different CPUs. Please note it
    does not solve scalability issue.

Test results:
In order to evaluate this patch, we can run "vmalloc test driver"
to see how many CPU cycles it takes to complete all test cases
running sequentially. All online CPUs run it so it will cause
a high lock contention.

HiKey 960, ARM64, 8xCPUs, big.LITTLE:

<snip>
    sudo ./test_vmalloc.sh sequential_test_order=1
<snip>

<default>
[  390.950557] All test took CPU0=457126382 cycles
[  391.046690] All test took CPU1=454763452 cycles
[  391.128586] All test took CPU2=454539334 cycles
[  391.222669] All test took CPU3=455649517 cycles
[  391.313946] All test took CPU4=388272196 cycles
[  391.410425] All test took CPU5=384036264 cycles
[  391.492219] All test took CPU6=387432964 cycles
[  391.578433] All test took CPU7=387201996 cycles
<default>

<patched>
[  304.721224] All test took CPU0=391521310 cycles
[  304.821219] All test took CPU1=393533002 cycles
[  304.917120] All test took CPU2=392243032 cycles
[  305.008986] All test took CPU3=392353853 cycles
[  305.108944] All test took CPU4=297630721 cycles
[  305.196406] All test took CPU5=297548736 cycles
[  305.288602] All test took CPU6=297092392 cycles
[  305.381088] All test took CPU7=297293597 cycles
<patched>

~14%-23% patched variant is better.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc: add more comments to the adjust_va_to_fit_type()

When fit type is NE_FIT_TYPE there is a need in one extra object.
Usually the "ne_fit_preload_node" per-CPU variable has it and
there is no need in GFP_NOWAIT allocation, but there are exceptions.

This commit just adds more explanations, as a result giving
answers on questions like when it can occur, how often, under
which conditions and what happens if GFP_NOWAIT gets failed.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>

mm/vmalloc: respect passed gfp_mask when do preloading

alloc_vmap_area() is given a gfp_mask for the page allocator.
Let's respect that mask and consider it even in the case when
doing regular CPU preloading, i.e. where a context can sleep.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: remove preempt_disable/enable when do preloading

Some background. The preemption was disabled before to guarantee
that a preloaded object is available for a CPU, it was stored for.

The aim was to not allocate in atomic context when spinlock
is taken later, for regular vmap allocations. But that approach
conflicts with CONFIG_PREEMPT_RT philosophy. It means that
calling spin_lock() with disabled preemption is forbidden
in the CONFIG_PREEMPT_RT kernel.

Therefore, get rid of preempt_disable() and preempt_enable() when
the preload is done for splitting purpose. As a result we do not
guarantee now that a CPU is preloaded, instead we minimize the
case when it is not, with this change.

For example i run the special test case that follows the preload
pattern and path. 20 "unbind" threads run it and each does
1000000 allocations. Only 3.5 times among 1000000 a CPU was
not preloaded. So it can happen but the number is negligible.

V2 - > V3:
    - update the commit message

V1 -> V2:
  - move __this_cpu_cmpxchg check when spin_lock is taken,
    as proposed by Andrew Morton
  - add more explanation in regard of preloading
  - adjust and move some comments

Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>

mm: vmalloc: show number of vmalloc pages in /proc/meminfo

Vmalloc() is getting more and more used these days (kernel stacks, bpf and
percpu allocator are new top users), and the total % of memory consumed by
vmalloc() can be pretty significant and changes dynamically.

/proc/meminfo is the best place to display this information: its top goal
is to show top consumers of the memory.

Since the VmallocUsed field in /proc/meminfo is not in use for quite a
long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of
'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
actual physical memory consumption of vmalloc().

Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: move 'area->pages' after if statement

If !area->pages statement is true where memory allocation fails, area is
freed.

In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.

[akpm@linux-foundation.org: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: modify struct vmap_area to reduce its size

Objective
---------

The current implementation of struct vmap_area wasted space.

After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.

Description
-----------

1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because

A) "subtree_max_size" is only used when vmap_area is in "free" tree

B) "vm" is only used when vmap_area is in "busy" tree

C) "purge_list" is only used when vmap_area is in vmap_purge_list

2) Eliminate "flags".

;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.

Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning

gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:

  mm/vmalloc.c: In function 'pcpu_get_vm_areas':
  mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      insert_vmap_area_augment(lva, &va->rb_node,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       &free_vmap_area_root, &free_vmap_area_list);
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
                      ^~~

Add an intialization to NULL, and check whether this has changed before
the first use.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: fix percpu free VM area search criteria

Recent changes to the vmalloc code by commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures.  These, in turn, can result
in panic()s in the slub code.  One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

RIP: 0033:0x7f46f7441b9b
Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator.  In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space.  So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures.  For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check.  Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: do not keep unpurged areas in the busy tree

The busy tree can be quite big, even though the area is freed
or unmapped it still stays there until "purge" logic removes
it.

1) Optimize and reduce the size of "busy" tree by removing a
node from it right away as soon as user triggers free paths.
It is possible to do so, because the allocation is done using
another augmented tree.

The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default
configuration:

sudo ./test_vmalloc.sh performance

<default>
Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
<default>

<this patch>
Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
<this patch>

2) Since the busy tree now contains allocated areas only and does
not interfere with lazily free nodes, introduce the new function
show_purge_info() that dumps "unpurged" areas that is propagated
through "/proc/vmallocinfo".

3) Eliminate VM_LAZY_FREE flag.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()

Trigger a warning if an object that is about to be freed is detached.
We used to have a BUG_ON(), but even though it is considered as faulty
behaviour that is not a good reason to break a system.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: get rid of one single unlink_va() when merge

It does not make sense to try to "unlink" the node that is definitely not
linked with a list nor tree. On the first merge step VA just points to
the previously disconnected busy area.

On the second step, check if the node has been merged and do "unlink" if
so, because now it points to an object that must be linked.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Roman Gushchin <guro@fb.com>

mm/vmalloc.c: preload a CPU with one object for split purpose

Refactor the NE_FIT_TYPE split case when it comes to an allocation of one
extra object. We need it in order to build a remaining space. The preload
is done per CPU in non-atomic context with GFP_KERNEL flags.

More permissive parameters can be beneficial for systems which are suffer
from high memory pressure or low memory condition. For example on my KVM
system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page
allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N
workers spinning on fork() and exit(), i can trigger below trace:

<snip>
[  179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[  179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003
[  179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  179.815171] Call Trace:
[  179.815178]  dump_stack+0x5c/0x7b
[  179.815182]  warn_alloc+0x108/0x190
[  179.815187]  __alloc_pages_slowpath+0xdc7/0xdf0
[  179.815191]  __alloc_pages_nodemask+0x2de/0x330
[  179.815194]  cache_grow_begin+0x77/0x420
[  179.815197]  fallback_alloc+0x161/0x200
[  179.815200]  kmem_cache_alloc+0x1c9/0x570
[  179.815202]  alloc_vmap_area+0x32c/0x990
[  179.815206]  __get_vm_area_node+0xb0/0x170
[  179.815208]  __vmalloc_node_range+0x6d/0x230
[  179.815211]  ? _do_fork+0xce/0x3d0
[  179.815213]  copy_process.part.46+0x850/0x1b90
[  179.815215]  ? _do_fork+0xce/0x3d0
[  179.815219]  _do_fork+0xce/0x3d0
[  179.815226]  ? __do_page_fault+0x2bf/0x4e0
[  179.815229]  do_syscall_64+0x55/0x130
[  179.815231]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  179.815234] RIP: 0033:0x7fedec4c738b
...
[  179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b
[  179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0
[  179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000
[  179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  179.815245] Mem-Info:
[  179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0
                active_file:502 inactive_file:61 isolated_file:70
                unevictable:2 dirty:0 writeback:0 unstable:0
                slab_reclaimable:2380 slab_unreclaimable:7520
                mapped:15069 shmem:14813 pagetables:10833 bounce:0
                free:1922 free_pcp:229 free_cma:0
<snip>

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: remove "node" argument

Remove unused argument from the __alloc_vmap_area() function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>

mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
64 bit as well.

Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro

This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro

This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: vmalloc: pass proper vm_start into debugobjects

Client can call vunmap with some intermediate 'addr' which may not be
the start of the VM area. Entire unmap code works with vm->vm_start
which is proper but debug object API is called with 'addr'. This could
be a problem within debug objects.

Pass proper start address into debug object API.

[akpm@linux-foundation.org: fix warning]
Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: vmalloc: clean up vunmap to avoid pgtable ops twice

vunmap does page table clear operations twice in the case when
DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled.

So, clean up the code as that is unintended.

As a perf gain, we save few us.  Below ftrace data was obtained while
doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch
applied.  After this patch, we can save ~3 us (on 1 extra
vunmap_page_range).

  CPU  DURATION                  FUNCTION CALLS
  |     |   |                     |   |   |   |
6)               |  __vunmap() {
6)               |    vmap_debug_free_range() {
6)   3.281 us    |      vunmap_page_range();
6) + 45.468 us   |    }
6)   2.760 us    |    vunmap_page_range();
6) ! 505.105 us  |  }

[cpandya@codeaurora.org: v3]
Link: http://lkml.kernel.org/r/1525176960-18408-1-git-send-email-cpandya@codeaurora.org
Link: http://lkml.kernel.org/r/1523876342-10545-1-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: use rb_entry_safe

Use rb_entry_safe() instead of open-coding it.

Link: http://lkml.kernel.org/r/81bb9820e5b9e4a1c596b3e76f88abf8c4a76cb0.1482221947.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: simplify /proc/vmallocinfo implementation

Many seq_file helpers exist for simplifying implementation of virtual
files especially, for /proc nodes. however, the helpers for iteration
over list_head are available but aren't adopted to implement
/proc/vmallocinfo currently.

Simplify /proc/vmallocinfo implementation by using existing seq_file
helpers.

Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmap: keep track of free blocks for vmap allocation

Currently an allocation of the new vmap area is done over busy
list iteration(complexity O(n)) until a suitable hole is found
between two busy areas. Therefore each new allocation causes
the list being grown. Due to over fragmented list and different
permissive parameters an allocation can take a long time. For
example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range. It uses an augment red-black tree that keeps
blocks sorted by their offsets in pair with linked list keeping
the free space in order of increasing addresses.

Nodes are augmented with the size of the maximum available free
block in its left or right sub-tree. Thus, that allows to take a
decision and traversal toward the block that will fit and will
have the lowest start address, i.e. it is sequential allocation.

Allocation: to allocate a new block a search is done over the
tree until a suitable lowest(left most) block is large enough
to encompass: the requested size, alignment and vstart point.
If the block is bigger than requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be
merged or inserted to the tree. Red-black tree allows efficiently
find a spot whereas a linked list provides a constant-time access
to previous and next blocks to check if merging can be done. In case
of merging of de-allocated memory chunk a large coalesced area is
created.

Complexity: ~O(log(N))

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

vmalloc: show lazy-purged vma info in vmallocinfo

When ioremap a 67112960 bytes vm_area with the vmallocinfo:
[..]
0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap

we get the result:
0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap

For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER':

if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);

So it makes idiot like me a litte puzzled why this was a jump the
vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving
0xed000000-0xf1000000 as a big hole.

This patch is to show all of vm_area, including vmas which are freeing
but still in the vmap_area_list, to make it more clear about why we will
get 0xf1000000-0xf5001000 in the above case.  And we will get a
vmallocinfo like:

[..]
0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
[..]
0xece7c000-0xece7e000    8192 unpurged vm_area
0xece7e000-0xece83000   20480 vm_map_ram
0xf0099000-0xf00aa000   69632 vm_map_ram

after this patch.

Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()

In pcpu_get_vm_areas(), it checks each range is not overlapped. To make
sure it is, only (N^2)/2 comparison is necessary, while current code
does N^2 times. By starting from the next range, it achieves the goal
and the continue could be removed.

Also,

- the overlap check of two ranges could be done with one clause

- one typo in comment is fixed.

Link: http://lkml.kernel.org/r/20170803063822.48702-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: Update is_vmalloc_addr to account for vmalloc savings"

This reverts commit 8db21e1d697814324681f0046f19ccf116b467e9.

Signed-off-by: Artem Labazov <123321artyom@gmail.com>

mm: add priority threshold to __purge_vmap_area_lazy()

commit 763b218ddfaf ("mm: add preempt points into
__purge_vmap_area_lazy()")

introduced some preempt points, one of those is making an
allocation more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well
all the time, i.e. it should be rather a compromise.

1) Number of lazy pages directly influence on busy list length
thus on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem i run into a situation
when memory usage gets increased hitting out_of_memory -> panic
state due to completely blocking of logic that frees vmap areas
in the __purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized
back over allocation creating a balance between each other.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

mm/vmalloc.c: fix align value calculation error

It causes double align requirement for __get_vm_area_node() if parameter
size is power of 2 and VM_IOREMAP is set in parameter flags, for example
size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000

get_count_order_long() is implemented and can be used instead of
fls_long() for fixing the bug, for example size=0x10000 ->
get_count_order_long(0x10000)=16 -> align=0x10000

[akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/]
[zijun_hu@zoho.com: fixes]
Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
[akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()]
[akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()]
[zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope]
Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: fix overflow in vm_map_ram()

When remapping pages accounting for 4G or more memory space, the
operation 'count << PAGE_SHIFT' overflows as it is performed on an
integer. Solution: cast before doing the bitshift.

[akpm@linux-foundation.org: fix vm_unmap_ram() also]
[akpm@linux-foundation.org: fix vmap() as well, per Guillermo]
Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es
Signed-off-by: Guillermo Julián Moreno <guillermo.julian@naudit.es>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment

We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
for checking PAGE_SIZE aligned case.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting

As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
we can optimize some cases by checking the enablement state.

This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:

  https://lkml.org/lkml/2016/1/27/194

Remaining work is to make sparc to be aware of this but it looks not
easy for me so I skip that in this series.

This patch (of 5):

We can disable debug_pagealloc processing even if the code is complied
with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
whether it is enabled or not in runtime.

[akpm@linux-foundation.org: update comment, per David.  Adjust comment to use 80 cols]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment

Just cleanup, no functional change.

Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, vmalloc: remove VM_VPAGES

VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
reading /proc/vmallocinfo.

[akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: use list_{next,first}_entry

To make the intention clearer, use list_{next,first}_entry instead of
list_entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

vmalloc: allow to account vmalloc to memcg

Make vmalloc family functions allocate vmalloc area pages with
alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
to memcg. This is needed, at least, to account alloc_fdmem allocations.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sched/idle: Optimize the generic idle loop

Currently, smp_processor_id() is used to fetch the current CPU in
cpu_idle_loop(). Every time the idle thread runs, it fetches the
current CPU using smp_processor_id().

Since the idle thread is per CPU, the current CPU is constant, so we
can lift the load out of the loop, saving execution cycles/time in the
loop.

x86-64:

Before patch (execution in loop):
148:    0f ae e8                lfence
14b:    65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax
152:    00
153:    89 c0                   mov %eax,%eax
155:    49 0f a3 04 24          bt %rax,(%r12)

After patch (execution in loop):
150:    0f ae e8                lfence
153:    4d 0f a3 34 24          bt %r14,(%r12)

ARM64:

Before patch (execution in loop):
168:    d5033d9f        dsb     ld
16c:    b9405661        ldr     w1,[x19,#84]
170:    1100fc20        add     w0,w1,#0x3f
174:    6b1f003f        cmp     w1,wzr
178:    1a81b000        csel    w0,w0,w1,lt
17c:    130c7000        asr     w0,w0,#6
180:    937d7c00        sbfiz   x0,x0,#3,#32
184:    f8606aa0        ldr     x0,[x21,x0]
188:    9ac12401        lsr     x1,x0,x1
18c:    36000e61        tbz     w1,#0,358

After patch (execution in loop):
1a8:    d50339df        dsb     ld
1ac:    f8776ac0        ldr     x0,[x22,x23]
ab0:    ea18001f        tst     x0,x24
1b4:    54000ea0        b.eq    388

Further observance on ARM64 for 4 seconds shows that cpu_idle_loop is
called 8672 times. Shifting the code will save instructions executed
in loop and eventually time as well.

Signed-off-by: Gaurav Jindal <gaurav.jindal@spreadtrum.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sanjeev Yadav <sanjeev.yadav@spreadtrum.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160512101330.GA488@gauravjindalubtnb.del.spreadtrum.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

sched/fair: honor sync only if CPU is about to goto idle

sync is causing excessive latencies during binder replies as its causing
migration of important tasks to busy CPU. Incase the CPU has a lot of
tasks running, prevent sync from happening

bug: 78790904
Change-Id: I8e4ef0d331a92b86111882bfe1b68b93a8b5a687
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

zram: show lzo compressor only be enabled

Fixes: 85a88878 ("zram: Set default compressor to lz4")

iommu: Sync with suntan

* fix possible deadlock

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

f2fs: Revert rapid GC

zram: fix re-do compression

defconfig: Increase LMK reclaim frequency

ANDROID: sched/fair: initialise util_est values to 0 on fork

Since "sched/fair: Align PELT windows between cfs_rq and its se" the
upstream kernel has initialised the whole content of sched_avg to zero
on fork. When util_est was backported, we missed this and so ended up
with util_est values copied from the parent task.

Add the zero initialisation which is present upstream and ensure that
util_est values always start from a known point.

Fixes: 700f1172f7a7 ("BACKPORT: sched/fair: Add util_est on top of PELT")
Reported-by: Puja Gupta <pujag@quicinc.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Saravana Kannan <skannan@codeaurora.org>
Change-Id: I06995e4320d606a52761d0e773baf28fcd1e2680
Signed-off-by: Chris Redpath <chris.redpath@arm.com>

{chiron,sagit}_defconfig: Re-use Simple LMK

This reverts commit 522a29d7e10203da7d9f826660b0e94fb1a03bb0.

ANDROID: arm64: vdso: unconditionally set -Wl,--hash-style=sysv

The conditional check through cc-ldoption conflicts with
cfi-clang-flags once set. This is causing failures in
VtsKernelLinuxKselftest#vDSO_kselftest_vdso_test_64bit due to GOLD
defaulting to gnu rather than sysv as the hash style.

Bug: 122343936
Bug: 122902928
Test: m kselftest_vdso_test && \
adb push $OUT/data/nativetest64/linux-kselftest/vDSO/kselftest_vdso_test /data/local/tmp/ && \
adb shell /data/local/tmp/kselftest_vdso_test
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: Ice6ad2f99baa0eba2cc0813ed18486d1d9f4d612

devfreq: msm_adreno_tz: Decrease busy time ceiling