OSDN Git Service

uclinux-h8/linux.git
5 years agoperf intel-pt: Insert callchain context into synthesized callchains
Adrian Hunter [Wed, 31 Oct 2018 09:10:42 +0000 (11:10 +0200)]
perf intel-pt: Insert callchain context into synthesized callchains

In the absence of a fallback, callchains must encode also the callchain
context. Do that now there is no fallback.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: stable@vger.kernel.org # 4.19
Link: http://lkml.kernel.org/r/100ea2ec-ed14-b56d-d810-e0a6d2f4b069@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf tools: Don't clone maps from parent when synthesizing forks
David Miller [Wed, 31 Oct 2018 05:24:04 +0000 (22:24 -0700)]
perf tools: Don't clone maps from parent when synthesizing forks

When synthesizing FORK events, we are trying to create thread objects
for the already running tasks on the machine.

Normally, for a kernel FORK event, we want to clone the parent's maps
because that is what the kernel just did.

But when synthesizing, this should not be done.  If we do, we end up
with overlapping maps as we process the sythesized MMAP2 events that
get delivered shortly thereafter.

Use the FORK event misc flags in an internal way to signal this
situation, so we can elide the map clone when appropriate.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Link: http://lkml.kernel.org/r/20181030.222404.2085088822877051075.davem@davemloft.net
[ Added comment about flag use in machine__process_fork_event(),
  use ternary op in thread__clone_map_groups() as suggested by Jiri ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf top: Start display thread earlier
David Miller [Wed, 31 Oct 2018 05:30:03 +0000 (22:30 -0700)]
perf top: Start display thread earlier

If events are coming in at a rate such that the event processing thread
can barely keep up, our initial run of the event ring will almost never
terminate and this delays the starting of the display thread.

The screen basically stays black until the event thread can get out of
it's endless loop.

Therefore, start the display thread before we start processing the ring
buffer.

This also make sure that we always have the user requested real time
setting engaged when processing the ring.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20181030.223003.2242527041807905962.davem@davemloft.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools headers uapi: Update linux/if_link.h header copy
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 20:06:57 +0000 (17:06 -0300)]
tools headers uapi: Update linux/if_link.h header copy

To pick the changes from:

  9163a0fc1f0c ("net: bridge: add support for per-port vlan stats")

And silence this build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from latest version at 'include/uapi/linux/if_link.h'

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Leblond <eric@regit.org>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Link: https://lkml.kernel.org/n/tip-7p53ghippywz7fqkwo3nkzet@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools headers uapi: Update linux/netlink.h header copy
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 20:04:47 +0000 (17:04 -0300)]
tools headers uapi: Update linux/netlink.h header copy

Picking the changes from:

  89d35528d17d ("netlink: Add new socket option to enable strict checking on dumps")

To silence this build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/netlink.h' differs from latest version at 'include/uapi/linux/netlink.h'

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Leblond <eric@regit.org>
Link: https://lkml.kernel.org/n/tip-1xymkfjpmhxfzrs46t8z8mjw@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools headers: Sync the various kvm.h header copies
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 20:01:46 +0000 (17:01 -0300)]
tools headers: Sync the various kvm.h header copies

For powerpc, s390, x86 and the main uapi linux/kvm.h header, none of
them entail changes in tooling.

Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/n/tip-avn7iy8f4tcm2y40sbsdk31m@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools include uapi: Update linux/mmap.h copy
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 19:50:08 +0000 (16:50 -0300)]
tools include uapi: Update linux/mmap.h copy

To pick up the changes from:

  20916d4636a9 ("mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes")

That do not entail changes in in tools, this just shows that we have to
consider bits [26:31] of flags to beautify that in tools like 'perf
trace'

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/mman.h' differs from latest version at 'include/uapi/linux/mman.h'
  diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/n/tip-3rvc39lon93kgt5pl31d8g4x@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace beauty: Use the mmap flags table generated from headers
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 19:30:38 +0000 (16:30 -0300)]
perf trace beauty: Use the mmap flags table generated from headers

Instead of requiring us to go on and edit sources to add new flag.

  # perf trace -e *mmap sleep 0.1
     0.025 ( 0.005 ms): sleep/29876 mmap(len: 163746, prot: READ, flags: PRIVATE, fd: 3) = 0x7faa68ad1000
     0.059 ( 0.004 ms): sleep/29876 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7faa68acf000
     0.069 ( 0.006 ms): sleep/29876 mmap(len: 3889792, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7faa6851f000
     0.086 ( 0.009 ms): sleep/29876 mmap(addr: 0x7faa688cb000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 1753088) = 0x7faa688cb000
     0.101 ( 0.005 ms): sleep/29876 mmap(addr: 0x7faa688d1000, len: 14976, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7faa688d1000
     0.348 ( 0.005 ms): sleep/29876 mmap(len: 111950656, prot: READ, flags: PRIVATE, fd: 3) = 0x7faa61a5b000
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-ggmoy6vxoygh5yim890ht0kf@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf beauty: Wire up the mmap flags table generator to the Makefile
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 19:11:59 +0000 (16:11 -0300)]
perf beauty: Wire up the mmap flags table generator to the Makefile

Now when we run 'make -C tools/perf O=/tmp/build/perf' we end up with:

  $ cat /tmp/build/perf/trace/beauty/generated/mmap_flags_array.c
  static const char *mmap_flags[] = {
[ilog2(0x40) + 1] = "32BIT",
[ilog2(0x01) + 1] = "SHARED",
[ilog2(0x02) + 1] = "PRIVATE",
[ilog2(0x10) + 1] = "FIXED",
[ilog2(0x20) + 1] = "ANONYMOUS",
[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
[ilog2(0x0100) + 1] = "GROWSDOWN",
[ilog2(0x0800) + 1] = "DENYWRITE",
[ilog2(0x1000) + 1] = "EXECUTABLE",
[ilog2(0x2000) + 1] = "LOCKED",
[ilog2(0x4000) + 1] = "NORESERVE",
[ilog2(0x8000) + 1] = "POPULATE",
[ilog2(0x10000) + 1] = "NONBLOCK",
[ilog2(0x20000) + 1] = "STACK",
[ilog2(0x40000) + 1] = "HUGETLB",
[ilog2(0x80000) + 1] = "SYNC",
  };
  $

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-t3fn7u3tjsupio6e6vkufx9m@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf beauty: Add a generator for MAP_ mmap's flag constants
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 18:26:47 +0000 (15:26 -0300)]
perf beauty: Add a generator for MAP_ mmap's flag constants

It'll use tools/{arch}/*,include copies of mman.h to generate a table to
be used by tools, initially by the 'mmap' beautifiers in 'perf trace',
but that could also be used to translate from a string constant to the
integer value to be used in a eBPF or tracefs tracepoint filter.

Tested for all archs using:

$ for arch in `ls tools/arch/` ; \
do echo $arch ; tools/perf/trace/beauty/mmap_flags.sh $arch ; \
   done | less

Example for alpha, an oddball, doesn't include any header, defines all
its stuff:

  $ tools/perf/trace/beauty/mmap_flags.sh alpha
  static const char *mmap_flags[] = {
[ilog2(0x10) + 1] = "ANONYMOUS",
[ilog2(0x02000) + 1] = "DENYWRITE",
[ilog2(0x04000) + 1] = "EXECUTABLE",
[ilog2(0x100) + 1] = "FIXED",
[ilog2(0x01000) + 1] = "GROWSDOWN",
[ilog2(0x100000) + 1] = "HUGETLB",
[ilog2(0x08000) + 1] = "LOCKED",
[ilog2(0x40000) + 1] = "NONBLOCK",
[ilog2(0x10000) + 1] = "NORESERVE",
[ilog2(0x20000) + 1] = "POPULATE",
[ilog2(0x02) + 1] = "PRIVATE",
[ilog2(0x01) + 1] = "SHARED",
[ilog2(0x80000) + 1] = "STACK",
  };
  $

Common case, my workstation, defines one entry (MAP_32BIT), then
includes mman.h, which gets it to include mman-common.h too:

  $ tools/perf/trace/beauty/mmap_flags.sh
  static const char *mmap_flags[] = {
[ilog2(0x40) + 1] = "32BIT",
[ilog2(0x01) + 1] = "SHARED",
[ilog2(0x02) + 1] = "PRIVATE",
[ilog2(0x10) + 1] = "FIXED",
[ilog2(0x20) + 1] = "ANONYMOUS",
[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
[ilog2(0x0100) + 1] = "GROWSDOWN",
[ilog2(0x0800) + 1] = "DENYWRITE",
[ilog2(0x1000) + 1] = "EXECUTABLE",
[ilog2(0x2000) + 1] = "LOCKED",
[ilog2(0x4000) + 1] = "NORESERVE",
[ilog2(0x8000) + 1] = "POPULATE",
[ilog2(0x10000) + 1] = "NONBLOCK",
[ilog2(0x20000) + 1] = "STACK",
[ilog2(0x40000) + 1] = "HUGETLB",
[ilog2(0x80000) + 1] = "SYNC",
  };
  $ uname -m
  x86_64
  $

Sparc, that defines a bunch then includes just mman-common.h:

  $ tools/perf/trace/beauty/mmap_flags.sh sparc
  static const char *mmap_flags[] = {
[ilog2(0x0800) + 1] = "DENYWRITE",
[ilog2(0x1000) + 1] = "EXECUTABLE",
[ilog2(0x0200) + 1] = "GROWSDOWN",
[ilog2(0x40000) + 1] = "HUGETLB",
[ilog2(0x100) + 1] = "LOCKED",
[ilog2(0x10000) + 1] = "NONBLOCK",
[ilog2(0x40) + 1] = "NORESERVE",
[ilog2(0x8000) + 1] = "POPULATE",
[ilog2(0x20000) + 1] = "STACK",
[ilog2(0x01) + 1] = "SHARED",
[ilog2(0x02) + 1] = "PRIVATE",
[ilog2(0x10) + 1] = "FIXED",
[ilog2(0x20) + 1] = "ANONYMOUS",
[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
  };
  [acme@jouet perf]$

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-xydeh491z8fkgglcmqnl5thj@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools include uapi: Update asound.h copy
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 17:20:07 +0000 (14:20 -0300)]
tools include uapi: Update asound.h copy

To silence this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/sound/asound.h' differs from latest version at 'include/uapi/sound/asound.h'
  diff -u tools/include/uapi/sound/asound.h include/uapi/sound/asound.h

Due to this cset:

  a98401518def ("ALSA: timer: fix wrong comment to refer to 'SNDRV_TIMER_PSFLG_*'")

Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Cc: Takashi Iwai <tiwai@suse.de>
Link: https://lkml.kernel.org/n/tip-76gsvs0w2g0x723ivqa2xua3@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 16:10:50 +0000 (13:10 -0300)]
tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies

To get the changes in:

  82b355d161c9 ("y2038: Remove newstat family from default syscall set")

Which will make the syscall table used by 'perf trace' for arm64 to be
updated from the changes in that patch.

This silences these perf build warnings:

  Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/unistd.h' differs from latest version at 'arch/arm64/include/uapi/asm/unistd.h'
  diff -u tools/arch/arm64/include/uapi/asm/unistd.h arch/arm64/include/uapi/asm/unistd.h
  Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
  diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h

Cc: Kim Phillips <kim.phillips@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/n/tip-3euy7c4yy5mvnp5bm16t9vqg@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools include uapi: Update linux/fs.h copy
Arnaldo Carvalho de Melo [Tue, 30 Oct 2018 15:27:37 +0000 (12:27 -0300)]
tools include uapi: Update linux/fs.h copy

To silence this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fs.h' differs from latest version at 'include/uapi/linux/fs.h'
  diff -u tools/include/uapi/linux/fs.h include/uapi/linux/fs.h

Due to just two comments added by:

  Fixes: 578bdaabd015 ("crypto: speck - remove Speck")

So nothing that entails changes in tools/, that so far uses fs.h to
generate the mount and umount syscalls 'flags' argument integer->string
tables with:

  $ tools/perf/trace/beauty/mount_flags.sh
  static const char *mount_flags[] = {
[4096 ? (ilog2(4096) + 1) : 0] = "BIND",
<SNIP>
[30 + 1] = "ACTIVE",
[31 + 1] = "NOUSER",
  };
  $
  # trace -e mount,umount mount --bind /proc /mnt
     1.228 ( 2.581 ms): mount/1068 mount(dev_name: /mnt, dir_name: 0x55f011c354a0, type: 0x55f011c38170, flags: BIND) = 0
  # trace -e mount,umount umount /proc /mnt
  umount: /proc: target is busy.
     1.587 ( 0.010 ms): umount/1070 umount2(name: /proc) = -1 EBUSY Device or resource busy
     1.799 (12.660 ms): umount/1070 umount2(name: /mnt) = 0
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lkml.kernel.org/n/tip-c00bqzclscgah26z2g5zxm73@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
David S. Miller [Tue, 30 Oct 2018 15:12:26 +0000 (12:12 -0300)]
perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}

When processing using 'perf report -g caller', which is the default, we
ended up reverting the callchain entries received from the kernel, but
simply reverting throws away the information that tells that from a
point onwards the addresses are for userspace, kernel, guest kernel,
guest user, hypervisor.

The idea is that if we are walking backwards, for each cluster of
non-cpumode entries we have to first scan backwards for the next one and
use that for the cluster.

This seems silly and more expensive than it needs to be but it is enough
for a initial fix.

The code here is really complicated because it is intimately intertwined
with the lbr and branch handling, as well as this callchain order,
further fixes will be needed to properly take into account the cpumode
in those cases.

Another problem with ORDER_CALLER is that the NULL "0" IP that is at the
end of most callchains shows up at the top of the histogram because
every callchain contains it and with ORDER_CALLER it is the first entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Souvik Banerjee <souvik1997@gmail.com>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: stable@vger.kernel.org # 4.19
Link: https://lkml.kernel.org/n/tip-2wt3ayp6j2y2f2xowixa8y6y@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf cs-etm: Correct CPU mode for samples
Leo Yan [Tue, 30 Oct 2018 07:18:28 +0000 (15:18 +0800)]
perf cs-etm: Correct CPU mode for samples

Since commit edeb0c90df35 ("perf tools: Stop fallbacking to kallsyms for
vdso symbols lookup"), the kernel address cannot be properly parsed to
kernel symbol with command 'perf script -k vmlinux'.  The reason is
CoreSight samples is always to set CPU mode as PERF_RECORD_MISC_USER,
thus it fails to find corresponding map/dso in below flows:

  process_sample_event()
    `-> machine__resolve()
  `-> thread__find_map(thread, sample->cpumode, sample->ip, al);

In this flow it needs to pass argument 'sample->cpumode' to tell what's
the CPU mode, before it always passed PERF_RECORD_MISC_USER but without
any failure until the commit edeb0c90df35 ("perf tools: Stop fallbacking
to kallsyms for vdso symbols lookup") has been merged.  The reason is
even with the wrong CPU mode the function thread__find_map() firstly
fails to find map but it will rollback to find kernel map for vdso
symbols lookup.  In the latest code it has removed the fallback code,
thus if CPU mode is PERF_RECORD_MISC_USER then it cannot find map
anymore with kernel address.

This patch is to correct samples CPU mode setting, it creates a new
helper function cs_etm__cpu_mode() to tell what's the CPU mode based on
the address with the info from machine structure; this patch has a bit
extension to check not only kernel and user mode, but also check for
host/guest and hypervisor mode.  Finally this patch uses the function in
instruction and branch samples and also apply in cs_etm__mem_access()
for a minor polishing.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@kernel.org # v4.19
Link: http://lkml.kernel.org/r/1540883908-17018-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf unwind: Take pgoff into account when reporting elf to libdwfl
Milian Wolff [Mon, 29 Oct 2018 14:16:44 +0000 (15:16 +0100)]
perf unwind: Take pgoff into account when reporting elf to libdwfl

libdwfl parses an ELF file itself and creates mappings for the
individual sections. perf on the other hand sees raw mmap events which
represent individual sections. When we encounter an address pointing
into a mapping with pgoff != 0, we must take that into account and
report the file at the non-offset base address.

This fixes unwinding with libdwfl in some cases. E.g. for a file like:

```

using namespace std;

mutex g_mutex;

double worker()
{
    lock_guard<mutex> guard(g_mutex);
    uniform_real_distribution<double> uniform(-1E5, 1E5);
    default_random_engine engine;
    double s = 0;
    for (int i = 0; i < 1000; ++i) {
        s += norm(complex<double>(uniform(engine), uniform(engine)));
    }
    cout << s << endl;
    return s;
}

int main()
{
    vector<std::future<double>> results;
    for (int i = 0; i < 10000; ++i) {
        results.push_back(async(launch::async, worker));
    }
    return 0;
}
```

Compile it with `g++ -g -O2 -lpthread cpp-locking.cpp  -o cpp-locking`,
then record it with `perf record --call-graph dwarf -e
sched:sched_switch`.

When you analyze it with `perf script` and libunwind, you should see:

```
cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
            7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
            7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so)
            7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so)
            7f38e42569e5 __GI___libc_malloc+0x115 (inlined)
            7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined)
            7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined)
            7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined)
            7f38e424df36 _IO_new_file_xsputn+0x116 (inlined)
            7f38e4242bfb __GI__IO_fwrite+0xdb (inlined)
            7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c>
            7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl>
            7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25)
            563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined)
            563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking)
            563b9cb506fb double std::__invoke_impl<double, double (*)()>(std::__invoke_other, double (*&&)())+0x2b (inlined)
            563b9cb506fb std::__invoke_result<double (*)()>::type std::__invoke<double (*)()>(double (*&&)())+0x2b (inlined)
            563b9cb506fb decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<double (*)()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x2b (inlined)
            563b9cb506fb std::thread::_Invoker<std::tuple<double (*)()> >::operator()()+0x2b (inlined)
            563b9cb506fb std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<double>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<double (*)()> >, dou>
            563b9cb506fb std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_>
            563b9cb507e8 std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const+0x28 (inlined)
            563b9cb507e8 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)+0x28 (/ssd/milian/>
            7f38e46d24fe __pthread_once_slow+0xbe (/usr/lib/libpthread-2.28.so)
            563b9cb51149 __gthread_once+0xe9 (inlined)
            563b9cb51149 void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>
            563b9cb51149 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)+0xe9 (inlined)
            563b9cb51149 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >&&)::{lambda()#1}::op>
            563b9cb51149 void std::__invoke_impl<void, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double>
            563b9cb51149 std::__invoke_result<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >>
            563b9cb51149 decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_>
            563b9cb51149 std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<dou>
            563b9cb51149 std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread>
            7f38e45f0062 execute_native_thread_routine+0x12 (/usr/lib/libstdc++.so.6.0.25)
            7f38e46caa9c start_thread+0xfc (/usr/lib/libpthread-2.28.so)
            7f38e42ccb22 __GI___clone+0x42 (inlined)
```

Before this patch, using libdwfl, you would see:

```
cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
            7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
        a041161e77950c5c [unknown] ([unknown])
```

With this patch applied, we get a bit further in unwinding:

```
cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
        ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
            7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
            7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so)
            7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so)
            7f38e42569e5 __GI___libc_malloc+0x115 (inlined)
            7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined)
            7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined)
            7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined)
            7f38e424df36 _IO_new_file_xsputn+0x116 (inlined)
            7f38e4242bfb __GI__IO_fwrite+0xdb (inlined)
            7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined)
            7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c>
            7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl>
            7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25)
            563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined)
            563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking)
        6eab825c1ee3e4ff [unknown] ([unknown])
```

Note that the backtrace is still stopping too early, when compared to
the nice results obtained via libunwind. It's unclear so far what the
reason for that is.

Committer note:

Further comment by Milian on the thread started on the Link: tag below:

 ---
The remaining issue is due to a bug in elfutils:

https://sourceware.org/ml/elfutils-devel/2018-q4/msg00089.html

With both patches applied, libunwind and elfutils produce the same output for
the above scenario.
 ---

Signed-off-by: Milian Wolff <milian.wolff@kdab.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20181029141644.3907-1-milian.wolff@kdab.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf top: Do not use overwrite mode by default
Arnaldo Carvalho de Melo [Mon, 29 Oct 2018 12:47:00 +0000 (09:47 -0300)]
perf top: Do not use overwrite mode by default

Enabling --overwrite mode allows us to to use just the most recent
records, which helps in high core count machines such as Knights
Landing/Mill, but right now is being disabled by default as the pausing
used in this technique is leading to loss of metadata events such as
PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples,
leading to lots of unknown samples appearing on the UI.

Enabling this may be useful if you are in such machines and profiling a
workload that doesn't creates short lived threads and/or doesn't uses
many executable mmap operations.

Work is being planed to solve this situation, till then, this will
remain disabled by default.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Kan Liang <kan.liang@intel.com>
Link: https://lkml.kernel.org/r/4f84468f-37d9-cf1b-12c1-514ef74b6a48@linux.intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
Link: https://lkml.kernel.org/n/tip-ehvf77vi1si9409r7p4wx788@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf top: Allow disabling the overwrite mode
Arnaldo Carvalho de Melo [Fri, 26 Oct 2018 18:55:23 +0000 (15:55 -0300)]
perf top: Allow disabling the overwrite mode

In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
forgot to leave a way to disable that new default, add a --overwrite
option that can be disabled using --no-overwrite, since the code already
in such a way that we can readily disable this mode.

This is useful when investigating bugs with this mode like the recent
report from David Miller where lots of unknown symbols appear due to
disabling the events while processing them which disables all record
types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
maps when we lose PERF_RECORD_MMAP records.

This can be easily seen while building a kernel, when there are lots of
short lived processes.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Kan Liang <kan.liang@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace: Beautify mount's first pathname arg
Arnaldo Carvalho de Melo [Fri, 26 Oct 2018 16:51:45 +0000 (13:51 -0300)]
perf trace: Beautify mount's first pathname arg

The pathname beautifiers so far support just one augmented pathname per
syscall, so do it just for mount's first arg, later this will get fixed.

With:

  # perf probe -l
  probe:vfs_getname    (on getname_flags:73@acme/git/linux/fs/namei.c with pathname)
  #

Later this will get added to augmented_syscalls.c (eBPF):

In one xterm:

  # perf trace -e mount,umount
  2687.331 ( 3.544 ms): mount/8892 mount(dev_name: /mnt, dir_name: 0x561f9ac184a0, type: 0x561f9ac1b170, flags: BIND) = 0
  3912.126 ( 8.807 ms): umount/8895 umount2(name: /mnt) = 0
  ^C#

In the other:

  $ sudo mount --bind /proc /mnt
  $ sudo umount /mnt

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-qsvhrm2es635cl4zicqjeth2@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace: Beautify the umount's 'name' argument
Arnaldo Carvalho de Melo [Fri, 26 Oct 2018 16:23:25 +0000 (13:23 -0300)]
perf trace: Beautify the umount's 'name' argument

By using the SCA_FILENAME beautifier, that works when either the
probe:vfs_getname probe is in place or with the eBPF program
tools/perf/examples/bpf/augmented_syscalls.c:

  # perf probe -l
  probe:vfs_getname (on getname_flags:73@acme/git/linux/fs/namei.c with pathname)
  # perf trace -e umount
  9630.332 ( 9.521 ms): umount/8082 umount2(name: /mnt) = 0
  #

The augmented syscalls one will be done in the next patch.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-hegbzlpd2nrn584l5jxn7sy2@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace: Consider syscall aliases too
Arnaldo Carvalho de Melo [Thu, 25 Oct 2018 20:24:45 +0000 (17:24 -0300)]
perf trace: Consider syscall aliases too

When trying to trace the 'umount' syscall on x86_64 I noticed that it
was failing:

  # trace -e umount umount /mnt
  event syntax error: 'umount'
                       \___ parser error
  Run 'perf list' for a list of valid events

   Usage: perf trace [<options>] [<command>]
      or: perf trace [<options>] -- <command> [<options>]
      or: perf trace record [<options>] [<command>]
      or: perf trace record [<options>] -- <command> [<options>]

      -e, --event <event>   event/syscall selector. use 'perf list' to list available events
  #

This is because in the x86-64 we have it just as 'umount2':

  $ grep umount arch/x86/entry/syscalls/syscall_64.tbl
  166 common umount2 __x64_sys_umount
  $

So if the syscall name fails, try fallbacking to looking at the aliases
we have in the syscall_fmts table to then re-lookup, now:

  # trace -e umount umount -f /mnt
  umount: /mnt: not mounted.
     1.759 ( 0.004 ms): umount/18365 umount2(name: 0x55fbfcbc4480, flags: 1) = -1 EINVAL Invalid argument
  #

Time to beautify the flags arg :-)

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-ukweodgzbmjd25lfkgryeft1@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace beauty: Beautify mount/umount's 'flags' argument
Arnaldo Carvalho de Melo [Thu, 25 Oct 2018 18:18:06 +0000 (15:18 -0300)]
perf trace beauty: Beautify mount/umount's 'flags' argument

  # trace -e mount mount -o ro -t debugfs nodev /mnt
     0.000 ( 1.040 ms): mount/27235 mount(dev_name: 0x5601cc8c64e0, dir_name: 0x5601cc8c6500, type: 0x5601cc8c6480, flags: RDONLY) = 0
  # trace -e mount mount -o remount,relatime -t debugfs nodev /mnt
     0.000 ( 2.946 ms): mount/27262 mount(dev_name: 0x55f4a73d64e0, dir_name: 0x55f4a73d6500, type: 0x55f4a73d6480, flags: REMOUNT|RELATIME) = 0
  # trace -e mount mount -o remount,strictatime -t debugfs nodev /mnt
     0.000 ( 2.934 ms): mount/27265 mount(dev_name: 0x5617f71d94e0, dir_name: 0x5617f71d9500, type: 0x5617f71d9480, flags: REMOUNT|STRICTATIME) = 0
  # trace -e mount mount -o remount,suid,silent -t debugfs nodev /mnt
     0.000 ( 0.049 ms): mount/27273 mount(dev_name: 0x55ad65df24e0, dir_name: 0x55ad65df2500, type: 0x55ad65df2480, flags: REMOUNT|SILENT) = 0
  # trace -e mount mount -o remount,rw,sync,lazytime -t debugfs nodev /mnt
     0.000 ( 2.684 ms): mount/27281 mount(dev_name: 0x561216055530, dir_name: 0x561216055550, type: 0x561216055510, flags: SYNCHRONOUS|REMOUNT|LAZYTIME) = 0
  # trace -e mount mount -o remount,dirsync -t debugfs nodev /mnt
     0.000 ( 3.512 ms): mount/27314 mount(dev_name: 0x55c4e7188480, dir_name: 0x55c4e7188530, type: 0x55c4e71884a0, flags: REMOUNT|DIRSYNC, data: 0x55c4e71884e0) = 0
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-i5ncao73c0bd02qprgrq6wb9@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf trace beauty: Allow syscalls to mask an argument before considering it
Arnaldo Carvalho de Melo [Thu, 25 Oct 2018 19:09:47 +0000 (16:09 -0300)]
perf trace beauty: Allow syscalls to mask an argument before considering it

Take mount's 'flags' arg, to cope with this semantic, as defined in do_mount in fs/namespace.c:

  /*
   * Pre-0.97 versions of mount() didn't have a flags word.  When the
   * flags word was introduced its top half was required to have the
   * magic value 0xC0ED, and this remained so until 2.4.0-test9.
   * Therefore, if this magic number is present, it carries no
   * information and must be discarded.
   */

We need to mask this arg, and then see if it is zero, when we simply
don't print the arg name and value.

The next patch will use this for mount's 'flag' arg.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-btue14k5jemayuykfrwsnh85@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf beauty: Introduce strarray__scnprintf_flags()
Arnaldo Carvalho de Melo [Thu, 25 Oct 2018 17:21:31 +0000 (14:21 -0300)]
perf beauty: Introduce strarray__scnprintf_flags()

Generalizing pkey_alloc__scnprintf_access_rights(), so that we can use
it with other flags-like arguments, such as mount's mountflags argument.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-o3ymi3104m8moaz9865g09w9@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf beauty: Switch from GPL v2.0 to LGPL v2.1
Arnaldo Carvalho de Melo [Wed, 24 Oct 2018 18:54:23 +0000 (15:54 -0300)]
perf beauty: Switch from GPL v2.0 to LGPL v2.1

The intention is to have this as a library, since it is not perf
specific at all.

I did the switch for the files where I'm the only contributor, with the
exception of a few lines changed by Jiri Olsa.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-a04q6chdyjknm1hr305ulx8h@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf beauty: Add a generator for MS_ mount/umount's flag constants
Arnaldo Carvalho de Melo [Wed, 24 Oct 2018 17:54:55 +0000 (14:54 -0300)]
perf beauty: Add a generator for MS_ mount/umount's flag constants

It'll use tools/include copy of linux/fs.h to generate a table to be
used by tools, initially by the 'mount' and 'umount' beautifiers in
'perf trace', but that could also be used to translate from a string
constant to the integer value to be used in a eBPF or tracefs tracepoint
filter.

When used without any args it produces:

  $ tools/perf/trace/beauty/mount_flags.sh
  static const char *mount_flags[] = {
[1 ? (ilog2(1) + 1) : 0] = "RDONLY",
[2 ? (ilog2(2) + 1) : 0] = "NOSUID",
[4 ? (ilog2(4) + 1) : 0] = "NODEV",
[8 ? (ilog2(8) + 1) : 0] = "NOEXEC",
[16 ? (ilog2(16) + 1) : 0] = "SYNCHRONOUS",
[32 ? (ilog2(32) + 1) : 0] = "REMOUNT",
[64 ? (ilog2(64) + 1) : 0] = "MANDLOCK",
[128 ? (ilog2(128) + 1) : 0] = "DIRSYNC",
[1024 ? (ilog2(1024) + 1) : 0] = "NOATIME",
[2048 ? (ilog2(2048) + 1) : 0] = "NODIRATIME",
[4096 ? (ilog2(4096) + 1) : 0] = "BIND",
[8192 ? (ilog2(8192) + 1) : 0] = "MOVE",
[16384 ? (ilog2(16384) + 1) : 0] = "REC",
[32768 ? (ilog2(32768) + 1) : 0] = "SILENT",
[16 + 1] = "POSIXACL",
[17 + 1] = "UNBINDABLE",
[18 + 1] = "PRIVATE",
[19 + 1] = "SLAVE",
[20 + 1] = "SHARED",
[21 + 1] = "RELATIME",
[22 + 1] = "KERNMOUNT",
[23 + 1] = "I_VERSION",
[24 + 1] = "STRICTATIME",
[25 + 1] = "LAZYTIME",
[26 + 1] = "SUBMOUNT",
[27 + 1] = "NOREMOTELOCK",
[28 + 1] = "NOSEC",
[29 + 1] = "BORN",
[30 + 1] = "ACTIVE",
[31 + 1] = "NOUSER",
  };
  $

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-mgutbbkmip9gfnmd28ikg7xt@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agotools include uapi: Grab a copy of linux/fs.h
Arnaldo Carvalho de Melo [Wed, 24 Oct 2018 17:31:12 +0000 (14:31 -0300)]
tools include uapi: Grab a copy of linux/fs.h

We'll use it to create tables for the 'flags' argument to the 'mount'
and 'umount' syscalls.

Add it to check_headers.sh so that when a new protocol gets added we get
a notification during the build process.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Benjamin Peterson <benjamin@python.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-yacf9jvkwfwg2g95r2us3xb3@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
5 years agoperf/core: Clean up inconsisent indentation
Colin Ian King [Mon, 29 Oct 2018 23:32:11 +0000 (23:32 +0000)]
perf/core: Clean up inconsisent indentation

Replace a bunch of spaces with tab, cleans up indentation

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20181029233211.21475-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
5 years agoMerge branch 'linus' into perf/urgent, to pick up fixes
Ingo Molnar [Mon, 29 Oct 2018 06:20:52 +0000 (07:20 +0100)]
Merge branch 'linus' into perf/urgent, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
5 years agoi2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array
Linus Torvalds [Sat, 27 Oct 2018 16:10:48 +0000 (09:10 -0700)]
i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array

Commit 9ee3e06610fd ("HID: i2c-hid: override HID descriptors for certain
devices") added a new dmi_system_id quirk table to override certain HID
report descriptors for some systems that lack them.

But the table wasn't properly terminated, causing the dmi matching to
walk off into la-la-land, and starting to treat random data as dmi
descriptor pointers, causing boot-time oopses if you were at all
unlucky.

Terminate the array.

We really should have some way to just statically check that arrays that
should be terminated by an empty entry actually are so.  But the HID
people really should have caught this themselves, rather than have me
deal with an oops during the merge window.  Tssk, tssk.

Cc: Julian Sax <jsbc@gmx.de>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 27 Oct 2018 02:33:41 +0000 (19:33 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  hugetlbfs: dirty pages as they are added to pagecache
  mm: export add_swap_extent()
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
  mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
  mm/gup: cache dev_pagemap while pinning pages
  Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  mm: return zero_resv_unavail optimization
  mm: zero remaining unavailable struct pages
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
  tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
  tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
  mm/gup_benchmark.c: add additional pinning methods
  mm/gup_benchmark.c: time put_page()
  mm: don't raise MEMCG_OOM event due to failed high-order allocation
  mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
  ...

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Sat, 27 Oct 2018 02:25:07 +0000 (19:25 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
 "What better way to start off a weekend than with some networking bug
  fixes:

  1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
     by David Ahern and Bjørn Mork.

  2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
     properly in UDP, from Sean Tranchetti.

  3) Remove TCA_OPTIONS from policy validation, it turns out we don't
     consistently use nested attributes for this across all packet
     schedulers. From David Ahern.

  4) Fix SKB corruption in cadence driver, from Tristram Ha.

  5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.

  6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
  net/neigh: fix NULL deref in pneigh_dump_table()
  net: allow traceroute with a specified interface in a vrf
  bridge: do not add port to router list when receives query with source 0.0.0.0
  net/smc: fix smc_buf_unuse to use the lgr pointer
  ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
  net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
  lan743x: Remove SPI dependency from Microchip group.
  drivers: net: remove <net/busy_poll.h> inclusion when not needed
  net: phy: genphy_10g_driver: Avoid NULL pointer dereference
  r8169: fix broken Wake-on-LAN from S5 (poweroff)
  octeontx2-af: Use GFP_ATOMIC under spin lock
  net: ethernet: cadence: fix socket buffer corruption problem
  net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
  net: sched: Remove TCA_OPTIONS from policy
  ice: Poll for link status change
  ice: Allocate VF interrupts and set queue map
  ice: Introduce ice_dev_onetime_setup
  net: hns3: Fix for warning uninitialized symbol hw_err_lst3
  octeontx2-af: Copy the right amount of memory
  net: udp: fix handling of CHECKSUM_COMPLETE packets
  ...

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Linus Torvalds [Sat, 27 Oct 2018 00:15:49 +0000 (17:15 -0700)]
Merge git://git./linux/kernel/git/davem/sparc

Pull sparc fixes from David Miller:
 "Some more sparc fixups, mostly aimed at getting the allmodconfig build
  up and clean again"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Rework xchg() definition to avoid warnings.
  sparc64: Export __node_distance.
  sparc64: Make corrupted user stacks more debuggable.

5 years agohugetlbfs: dirty pages as they are added to pagecache
Mike Kravetz [Fri, 26 Oct 2018 22:10:58 +0000 (15:10 -0700)]
hugetlbfs: dirty pages as they are added to pagecache

Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.

This can be recreated as follows:
 fallocate -l 2M /dev/hugepages/foo
 echo 1 > /proc/sys/vm/drop_caches
 fallocate -l 2M /dev/hugepages/foo
 grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
 ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

To address this issue, dirty pages as they are added to pagecache.  This
can easily be reproduced with fallocate as shown above.  Read faulted
pages will eventually end up being marked dirty.  But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.

Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mihcla Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: export add_swap_extent()
Omar Sandoval [Fri, 26 Oct 2018 22:10:55 +0000 (15:10 -0700)]
mm: export add_swap_extent()

Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices.  See 35054394c4b3
("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").

However, the swap code has a mechanism for the filesystem to manually add
swap extents using add_swap_extent() from the ->swap_activate() aop.
iomap has done this since 67482129cdab ("iomap: add a swapfile activation
function").  Btrfs will do the same in a later patch, so export
add_swap_extent().

Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Sterba <dsterba@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
Omar Sandoval [Fri, 26 Oct 2018 22:10:51 +0000 (15:10 -0700)]
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
through the filesystem, and to make swapoff() call ->swap_deactivate().
For Btrfs, we want the latter but not the former, so split this flag into
two.  This makes us always call ->swap_deactivate() if ->swap_activate()
succeeded, not just if it didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
Michael Ellerman [Fri, 26 Oct 2018 22:10:48 +0000 (15:10 -0700)]
tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE

Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann
Horn.  This would have caught the overlap bug reported by Daniel Micay.

I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but
instead of writing a selftest I spent my time bike-shedding whether it
should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW ..  mea culpa.

Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.au
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
Andrea Arcangeli [Fri, 26 Oct 2018 22:10:43 +0000 (15:10 -0700)]
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()

There should be no cache left by the time we overwrite the old transhuge
pmd with the new one.  It's already too late to flush through the virtual
address because we already copied the page data to the new physical
address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning on
x86 where flush_cache_range() is a noop.

Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
Andrea Arcangeli [Fri, 26 Oct 2018 22:10:40 +0000 (15:10 -0700)]
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()

change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away.  do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start().  So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs.  It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:

CPU0 CPU1 GPU sharing linux pagetables using
                                only ->invalidate_range
----------- ------------ ---------
GPU secondary MMU writes to the page
mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
do_huge_pmd_numa_page()
CPU TLB flush globally (1)
CPU cannot write to page
migrate_misplaced_transhuge_page()
GPU writes to the page...
migrate_page_copy()
...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
GPU secondary MMU is invalidated
and cannot write to the page anymore
(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().

After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.

Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
Andrea Arcangeli [Fri, 26 Oct 2018 22:10:36 +0000 (15:10 -0700)]
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition

Patch series "migrate_misplaced_transhuge_page race conditions".

Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.

While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.

These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.

I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce).  The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).

To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.

This patch (of 3):

This is a corollary of ced108037c2aa ("thp: fix MADV_DONTNEED vs.  numa
balancing race"), 58ceeb6bec8 ("thp: fix MADV_DONTNEED vs.  MADV_FREE
race") and 5b7abeae3af8c ("thp: fix MADV_DONTNEED vs clear soft dirty
race).

When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.

The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407fbf7 ("mm: Close races between THP migration and PMD numa
clearing").

The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.

The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.

I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.

At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().

This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.

This also optimizes the code and removes a superfluous TLB flush.

[akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
Clark Williams [Fri, 26 Oct 2018 22:10:32 +0000 (15:10 -0700)]
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t

The static lock quarantine_lock is used in quarantine.c to protect the
quarantine queue datastructures.  It is taken inside quarantine queue
manipulation routines (quarantine_put(), quarantine_reduce() and
quarantine_remove_cache()), with IRQs disabled.  This is not a problem on
a stock kernel but is problematic on an RT kernel where spin locks are
sleeping spinlocks, which can sleep and can not be acquired with disabled
interrupts.

Convert the quarantine_lock to a raw spinlock_t.  The usage of
quarantine_lock is confined to quarantine.c and the work performed while
the lock is held is used for debug purpose.

[bigeasy@linutronix.de: slightly altered the commit message]
Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/gup: cache dev_pagemap while pinning pages
Keith Busch [Fri, 26 Oct 2018 22:10:28 +0000 (15:10 -0700)]
mm/gup: cache dev_pagemap while pinning pages

Getting pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata.  This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.

This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap while getting user
pages.  The gup_benchmark kernel self test reports this reduces time to
get user pages to as low as 1/3 of the previous time.

Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoRevert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
Masayoshi Mizuma [Fri, 26 Oct 2018 22:10:24 +0000 (15:10 -0700)]
Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") breaks movable_node kernel option because it changed
the memory gap range to reserved memblock.  So, the node is marked as
Normal zone even if the SRAT has Hot pluggable affinity.

    =====================================================================
    kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
    kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
    ...
    kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
    ...
    kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
    kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
    ...
    kernel: Movable zone start for each node
    kernel:  Node 3: 0x00001c0000000000
    kernel: Early memory node ranges
    ...
    =====================================================================

The original issue is fixed by the former patches, so let's revert commit
124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved").

Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.com
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: return zero_resv_unavail optimization
Pavel Tatashin [Fri, 26 Oct 2018 22:10:21 +0000 (15:10 -0700)]
mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary
to verify that pfns within pageblock_nr_pages ranges are valid, only the
first one needs to be checked.  This is because memory for pages are
allocated in contiguous chunks that contain pageblock_nr_pages struct
pages.

Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: zero remaining unavailable struct pages
Naoya Horiguchi [Fri, 26 Oct 2018 22:10:15 +0000 (15:10 -0700)]
mm: zero remaining unavailable struct pages

Patch series "mm: Fix for movable_node boot option", v3.

This patch series contains a fix for the movable_node boot option issue
which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved").

The commit breaks the option because it changed the memory gap range to
reserved memblock.  So, the node is marked as Normal zone even if the SRAT
has Hot pluggable affinity.

First and second patch fix the original issue which the commit tried to
fix, then revert the commit.

This patch (of 3):

There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

  BUG: unable to handle kernel paging request at fffffffffffffffe
  PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
  RIP: 0010:stable_page_flags+0x27/0x3c0
  Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
  RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
  RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
  R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
  R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
  FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
  Call Trace:
   kpageflags_read+0xc7/0x120
   proc_reg_read+0x3c/0x60
   __vfs_read+0x36/0x170
   vfs_read+0x89/0x130
   ksys_pread64+0x71/0x90
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7efc42e75e23
  Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone.  Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
default (no memmap= given) memblock layout is like below:

  MEMBLOCK configuration:
   memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x4
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
   memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

  MEMBLOCK configuration:
   memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x3
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory.  So some of struct pages in the gap
range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct pages
outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
patch extends it to cover all unavailable range, which fixes the reported
issue.

Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
Keith Busch [Fri, 26 Oct 2018 22:10:12 +0000 (15:10 -0700)]
tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option

Add a new option '-H' to the gup benchmark to help understand how hugetlb
mapping pages compare with the default.

Link: http://lkml.kernel.org/r/20181010195605.10689-6-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
Keith Busch [Fri, 26 Oct 2018 22:10:08 +0000 (15:10 -0700)]
tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option

Add a new benchmark option, -S, to request MAP_SHARED.  This can be used
to compare with MAP_PRIVATE, or for files that require this option, like
dax.

Link: http://lkml.kernel.org/r/20181010195605.10689-5-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/vm/gup_benchmark.c: allow user specified file
Keith Busch [Fri, 26 Oct 2018 22:10:02 +0000 (15:10 -0700)]
tools/testing/selftests/vm/gup_benchmark.c: allow user specified file

Allow a user to specify a file to map by adding a new option, '-f',
providing a means to test various file backings.

If not specified, the benchmark will use a private mapping of /dev/zero,
which produces an anonymous mapping as before.

[akpm@linux-foundation.org: avoid using comma operator]
Link: http://lkml.kernel.org/r/20181010195605.10689-4-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
Keith Busch [Fri, 26 Oct 2018 22:09:59 +0000 (15:09 -0700)]
tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage

If the '-w' parameter was provided, the benchmark would exit due to a
mssing 'break'.

Link: http://lkml.kernel.org/r/20181010195605.10689-3-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/gup_benchmark.c: add additional pinning methods
Keith Busch [Fri, 26 Oct 2018 22:09:56 +0000 (15:09 -0700)]
mm/gup_benchmark.c: add additional pinning methods

Provide new gup benchmark ioctl commands to run different user page
pinning methods, get_user_pages_longterm() and get_user_pages(), in
addition to the existing get_user_pages_fast().

Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/gup_benchmark.c: time put_page()
Keith Busch [Fri, 26 Oct 2018 22:09:52 +0000 (15:09 -0700)]
mm/gup_benchmark.c: time put_page()

We'd like to measure time to unpin user pages, so this adds a second
benchmark timer on put_page, separate from get_page.

Adding the field breaks this ioctl ABI, but should be okay since this an
in-tree kernel selftest.

[akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use]
Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: don't raise MEMCG_OOM event due to failed high-order allocation
Roman Gushchin [Fri, 26 Oct 2018 22:09:48 +0000 (15:09 -0700)]
mm: don't raise MEMCG_OOM event due to failed high-order allocation

It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason.  Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened.  However, no tasks have been killed.

The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page.  In such case, the OOM killer is
never invoked.  As shown below, it can happen under conditions, which are
very far from a real OOM: e.g.  there is plenty of clean page cache and no
memory pressure.

There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g.  restart the
workload, as in my case).

Let's look at the charging path in try_charge().  If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages.  Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.  The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.

In this case the idea of raising MEMCG_OOM looks dubious.

Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.

Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
Dave Chinner [Fri, 26 Oct 2018 22:09:45 +0000 (15:09 -0700)]
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process application
doing concurrent writes and fsyncs to a small range of a file.

range_cyclic
writeback Process 1 Process 2

xfs_vm_writepages
  write_cache_pages
    writeback_index = 2
    cycled = 0
    ....
    find page 2 dirty
    lock Page 2
    ->writepage
      page 2 writeback
      page 2 clean
      page 2 added to bio
    no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()
....
xfs_vm_writepages
write_cache_pages
  start index 0
  find page 1 towrite
  lock Page 1
  ->writepage
    page 1 writeback
    page 1 clean
    page 1 added to bio
  find page 2 towrite
  lock Page 2
  page 2 is writeback
  <blocks>
write()
locks page 1
dirties page 1
fsync()
....
xfs_vm_writepages
write_cache_pages
  start index 0

    !done && !cycled
      sets index to 0, restarts lookup
    find page 1 dirty
  find page 1 towrite
  lock Page 1
  page 1 is writeback
  <blocks>

    lock Page 1
    <blocks>

DEADLOCK because:

- process 1 needs page 2 writeback to complete to make
  enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
  can progress and unlock the page it is blocked on, then it
  can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
  for page 1

The underlying cause of the problem here is that range_cyclic writeback is
processing pages in descending index order as we hold higher index pages
in a structure controlled from above write_cache_pages().  The
write_cache_pages() caller needs to be able to submit these pages for IO
before write_cache_pages restarts writeback at mapping index 0 to avoid
wcp inverting the page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no private
context held across write_cache_pages() - filesystems using this
infrastructure always submit pages in ->writepage immediately and so there
is no problem with range_cyclic going back to mapping index 0.

However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under writeback
in their private structures until write_cache_pages() returns, and hence
they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
at the code long enough to understand that it has a similar retry loop for
range_cyclic writeback reaching the end of the file and then promptly ran
away before my eyes bled too much.  I'll leave it for the ext4 developers
to determine if their code is actually has this deadlock and how to fix it
if it has.

There's a few ways I can see avoid this deadlock.  There's probably more,
but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option - range_cyclic prevents frequently
dirtied lower file offset from starving background writeback of
rarely touched higher file offsets.

#2 is simple, and I don't think it will have any impact on
performance as going back to the start of the file implies an
immediate seek. We'll have exactly the same number of seeks if we
switch writeback to another inode, and then come back to this one
later and restart from index 0.

#2a is pretty much "status quo without the deadlock". Moving the
retry loop up into the wcp caller means we can issue IO on the
pending pages before calling wcp again, and so avoid locking or
waiting on pages in the wrong order. I'm not convinced we need to do
this given that we get the same thing from #2 on the next writeback
call from the writeback infrastructure.

#3 is really just a band-aid - it doesn't fix the access/wait
inversion problem, just prevents it from becoming a deadlock
situation. I'd prefer we fix the inversion, not sweep it under the
carpet like this.

#3a is really an optimisation that just so happens to include the
band-aid fix of #3.

So it seems that the simplest way to fix this issue is to implement
solution #2

Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: move mirrored memory specific code outside of memmap_init_zone
Pavel Tatashin [Fri, 26 Oct 2018 22:09:40 +0000 (15:09 -0700)]
mm: move mirrored memory specific code outside of memmap_init_zone

memmap_init_zone, is getting complex, because it is called from different
contexts: hotplug, and during boot, and also because it must handle some
architecture quirks.  One of them is mirrored memory.

Move the code that decides whether to skip mirrored memory outside of
memmap_init_zone, into a separate function.

[pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: calculate deferred pages after skipping mirrored memory
Pavel Tatashin [Fri, 26 Oct 2018 22:09:37 +0000 (15:09 -0700)]
mm: calculate deferred pages after skipping mirrored memory

update_defer_init() should be called only when struct page is about to be
initialized. Because it counts number of initialized struct pages, but
there we may skip struct pages if there is some mirrored memory.

So move, update_defer_init() after checking for mirrored memory.

Also, rename update_defer_init() to defer_init() and reverse the return
boolean to emphasize that this is a boolean function, that tells that the
reset of memmap initialization should be deferred.

Make this function self-contained: do not pass number of already
initialized pages in this zone by using static counters.

I found this bug by reading the code.  The effect is that fewer than
expected struct pages are initialized early in boot, and it is possible
that in some corner cases we may fail to boot when mirrored pages are
used.  The deferred on demand code should somewhat mitigate this.  But
this still brings some inconsistencies compared to when booting without
mirrored pages, so it is better to fix.

[pasha.tatashin@oracle.com: add comment about defer_init's lack of locking]
Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: make defer_init non-inline, __meminit]
Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: make memmap_init a proper function
Pavel Tatashin [Fri, 26 Oct 2018 22:09:32 +0000 (15:09 -0700)]
mm: make memmap_init a proper function

memmap_init is sometimes a macro sometimes a function based on
__HAVE_ARCH_MEMMAP_INIT.  It is only a function on ia64.  Make memmap_init
a weak function instead, and let ia64 redefine it.

Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type
Kirill Tkhai [Fri, 26 Oct 2018 22:09:28 +0000 (15:09 -0700)]
mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type

This will allow to use generic refcount_t interfaces to check counters
overflow instead of currently existing VM_BUG_ON().  The only difference
after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
with WARN().  But this seems not to be significant here, since such the
problems are usually caught by syzbot with panic-on-warn enabled.

Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/page_alloc.c: initialize num_movable in move_freepages()
David Rientjes [Fri, 26 Oct 2018 22:09:24 +0000 (15:09 -0700)]
mm/page_alloc.c: initialize num_movable in move_freepages()

If move_freepages_block() returns 0 because !zone_spans_pfn(),
*num_movable can hold the value from the stack because it does not get
initialized in move_freepages().

Move the initialization to move_freepages_block() to guarantee the value
actually makes sense.

This currently doesn't affect its only caller where num_movable != NULL,
so no bug fix, but just more robust.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/zsmalloc.c: fix fall-through annotation
Gustavo A. R. Silva [Fri, 26 Oct 2018 22:09:20 +0000 (15:09 -0700)]
mm/zsmalloc.c: fix fall-through annotation

Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Link: http://lkml.kernel.org/r/20181003105114.GA24423@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agouserfaultfd: selftest: recycle lock threads first
Peter Xu [Fri, 26 Oct 2018 22:09:17 +0000 (15:09 -0700)]
userfaultfd: selftest: recycle lock threads first

Now we recycle the uffd servicing threads earlier than the lock threads.
It might happen that when the lock thread is still blocked at a pthread
mutex lock while the servicing thread has already quitted for the cpu so
the lock thread will be blocked forever and hang the test program.  To fix
the possible race, recycle the lock threads first.

This never happens with current missing-only tests, but when I start to
run the write-protection tests (the feature is not yet posted upstream) it
happens every time of the run possibly because in that new test we'll need
to service two page faults for each lock operation.

Link: http://lkml.kernel.org/r/20180930074259.18229-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agouserfaultfd: selftest: generalize read and poll
Peter Xu [Fri, 26 Oct 2018 22:09:13 +0000 (15:09 -0700)]
userfaultfd: selftest: generalize read and poll

We do very similar things in read and poll modes, but we're copying the
codes around.  Share the codes properly on reading the message and
handling the page fault to make the code cleaner.  Meanwhile this solves
previous mismatch of behaviors between the two modes on that the old code:

- did not check EAGAIN case in read() mode
- ignored BOUNCE_VERIFY check in read() mode

Link: http://lkml.kernel.org/r/20180930074259.18229-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agouserfaultfd: selftest: cleanup help messages
Peter Xu [Fri, 26 Oct 2018 22:09:09 +0000 (15:09 -0700)]
userfaultfd: selftest: cleanup help messages

Firstly, the help in the comment region is obsolete, now we support
three parameters.  Since at it, change it and move it into the help
message of the program.

Also, the help messages dumped here and there is obsolete too.  Use a
single usage() helper.

Link: http://lkml.kernel.org/r/20180930074259.18229-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmstat.c: assert that vmstat_text is in sync with stat_items_size
Jann Horn [Fri, 26 Oct 2018 22:09:05 +0000 (15:09 -0700)]
mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size

Having two gigantic arrays that must manually be kept in sync, including
ifdefs, isn't exactly robust.  To make it easier to catch such issues in
the future, add a BUILD_BUG_ON().

Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory.c: recheck page table entry with page table lock held
Aneesh Kumar K.V [Fri, 26 Oct 2018 22:09:01 +0000 (15:09 -0700)]
mm/memory.c: recheck page table entry with page table lock held

We clear the pte temporarily during read/modify/write update of the pte.
If we take a page fault while the pte is cleared, the application can get
SIGBUS.  One such case is with remap_pfn_range without a backing
vm_ops->fault callback.  do_fault will return SIGBUS in that case.

cpu 0   cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
. page fault.
.
.
prep_modify_prot_commit()

Fix this by taking page table lock and rechecking for pte_none.

[aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: dax: add comment for PFN_SPECIAL
Yang Shi [Fri, 26 Oct 2018 22:08:57 +0000 (15:08 -0700)]
mm: dax: add comment for PFN_SPECIAL

The comment for PFN_SPECIAL is missed in pfn_t.h. Add comment to get
consistent with other pfn flags.

Link: http://lkml.kernel.org/r/1538086549-100536-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: brk: downgrade mmap_sem to read when shrinking
Yang Shi [Fri, 26 Oct 2018 22:08:54 +0000 (15:08 -0700)]
mm: brk: downgrade mmap_sem to read when shrinking

brk might be used to shrink memory mapping too other than munmap().  So,
it may hold write mmap_sem for long time when shrinking large mapping, as
what commit ("mm: mmap: zap pages with read mmap_sem in munmap")
described.

The brk() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case.  But, it may set mm->brk after __do_munmap(),
which needs hold write mmap_sem.

However, a simple trick can workaround this by setting mm->brk before
__do_munmap().  Then restore the original value if __do_munmap() fails.
With this trick, it is safe to downgrade to read mmap_sem.

So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.

The period of holding exclusive mmap_sem for shrinking large mapping would
be reduced significantly with this optimization.

[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/1538687672-17795-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: mremap: downgrade mmap_sem to read when shrinking
Yang Shi [Fri, 26 Oct 2018 22:08:50 +0000 (15:08 -0700)]
mm: mremap: downgrade mmap_sem to read when shrinking

Other than munmap, mremap might be used to shrink memory mapping too.
So, it may hold write mmap_sem for long time when shrinking large
mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
munmap") described.

The mremap() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case, so it is safe to downgrade to read mmap_sem.

So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.

The period of holding exclusive mmap_sem for shrinking large mapping
would be reduced significantly with this optimization.

MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
optimization since they need manipulate vmas after do_munmap(),
downgrading mmap_sem may create race window.

Simple mapping shrink is the low hanging fruit, and it may cover the
most cases of unmap with munmap together.

[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/filemap.c: use vmf_error()
Souptick Joarder [Fri, 26 Oct 2018 22:08:47 +0000 (15:08 -0700)]
mm/filemap.c: use vmf_error()

These codes can be replaced with new inline vmf_error().

Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_ptep_get
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:43 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_ptep_get

ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use the same
version of huge_ptep_get, so move this generic implementation into
asm-generic/hugetlb.h.

[arnd@arndb.de: fix ARM 3level page tables]
Link: http://lkml.kernel.org/r/20181005161722.904274-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20180920060358.16606-12-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_ptep_set_access_flags()
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:39 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_ptep_set_access_flags()

arm, ia64, sh, x86 architectures use the same version
of huge_ptep_set_access_flags, so move this generic implementation
into asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-11-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_ptep_set_wrprotect()
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:35 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_ptep_set_wrprotect()

arm, ia64, mips, powerpc, sh, x86 architectures use the same version of
huge_ptep_set_wrprotect, so move this generic implementation into
asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-10-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of prepare_hugepage_range
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:31 +0000 (15:08 -0700)]
hugetlb: introduce generic version of prepare_hugepage_range

arm, arm64, powerpc, sparc, x86 architectures use the same version of
prepare_hugepage_range, so move this generic implementation into
asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-9-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_pte_wrprotect
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:26 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_pte_wrprotect

arm, arm64, ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use
the same version of huge_pte_wrprotect, so move this generic
implementation into asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-8-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_pte_none()
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:22 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_pte_none()

arm, arm64, ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use
the same version of huge_pte_none, so move this generic implementation
into asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-7-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_ptep_clear_flush
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:17 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_ptep_clear_flush

arm, x86 architectures use the same version of huge_ptep_clear_flush, so
move this generic implementation into asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-6-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of huge_ptep_get_and_clear()
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:12 +0000 (15:08 -0700)]
hugetlb: introduce generic version of huge_ptep_get_and_clear()

arm, ia64, sh, x86 architectures use the same version of
huge_ptep_get_and_clear, so move this generic implementation into
asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of set_huge_pte_at()
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:07 +0000 (15:08 -0700)]
hugetlb: introduce generic version of set_huge_pte_at()

arm, ia64, mips, powerpc, sh, x86 architectures use the same version of
set_huge_pte_at, so move this generic implementation into
asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-4-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: introduce generic version of hugetlb_free_pgd_range
Alexandre Ghiti [Fri, 26 Oct 2018 22:08:03 +0000 (15:08 -0700)]
hugetlb: introduce generic version of hugetlb_free_pgd_range

arm, arm64, mips, parisc, sh, x86 architectures use the same version of
hugetlb_free_pgd_range, so move this generic implementation into
asm-generic/hugetlb.h.

Link: http://lkml.kernel.org/r/20180920060358.16606-3-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlb: harmonize hugetlb.h arch specific defines with pgtable.h
Alexandre Ghiti [Fri, 26 Oct 2018 22:07:59 +0000 (15:07 -0700)]
hugetlb: harmonize hugetlb.h arch specific defines with pgtable.h

In order to reduce copy/paste of functions across architectures and then
make riscv hugetlb port (and future ports) simpler and smaller, this
patchset intends to factorize the numerous hugetlb primitives that are
defined across all the architectures.

Except for prepare_hugepage_range, this patchset moves the versions that
are just pass-through to standard pte primitives into
asm-generic/hugetlb.h by using the same #ifdef semantic that can be found
in asm-generic/pgtable.h, i.e.  __HAVE_ARCH_***.

s390 architecture has not been tackled in this serie since it does not use
asm-generic/hugetlb.h at all.

This patchset has been compiled on all addressed architectures with
success (except for parisc, but the problem does not come from this
series).

This patch (of 11):

asm-generic/hugetlb.h proposes generic implementations of hugetlb related
functions: use __HAVE_ARCH_HUGE* defines in order to make arch specific
implementations of hugetlb functions consistent with pgtable.h scheme.

Link: http://lkml.kernel.org/r/20180920060358.16606-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org> [x86]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: remove unnecessary local variable addr in __get_user_pages_fast()
Wei Yang [Fri, 26 Oct 2018 22:07:55 +0000 (15:07 -0700)]
mm: remove unnecessary local variable addr in __get_user_pages_fast()

The local variable `addr' in __get_user_pages_fast() is just a shadow of
`start'.  Since `start' never changes after assignment to `addr', it is
fine to replace `start' with it.

Also the meaning of [start, end] is more obvious than [addr, end] when
passed to gup_pgd_range().

Link: http://lkml.kernel.org/r/20180925021448.20265-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: defer ZONE_DEVICE page initialization to the point where we init pgmap
Alexander Duyck [Fri, 26 Oct 2018 22:07:52 +0000 (15:07 -0700)]
mm: defer ZONE_DEVICE page initialization to the point where we init pgmap

The ZONE_DEVICE pages were being initialized in two locations.  One was
with the memory_hotplug lock held and another was outside of that lock.
The problem with this is that it was nearly doubling the memory
initialization time.  Instead of doing this twice, once while holding a
global lock and once without, I am opting to defer the initialization to
the one outside of the lock.  This allows us to avoid serializing the
overhead for memory init and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way.  One wasn't initializing hmm_data, and the other was initializing it
to a poison value.  Since this is something that is exposed to the driver
in the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

[alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: create non-atomic version of SetPageReserved for init use
Alexander Duyck [Fri, 26 Oct 2018 22:07:48 +0000 (15:07 -0700)]
mm: create non-atomic version of SetPageReserved for init use

It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.

This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing.  On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory.  I believe the main advantage
of this is that it allows for more compiler optimization as the __set_bit
operation can be reordered whereas the atomic version cannot.

I tried adding a bit of documentation based on f1dd2cd13c4 ("mm,
memory_hotplug: do not associate hotadded memory to zones until online").

Ideally the reserved flag should be set earlier since there is a brief
window where the page is initialization via __init_single_page and we have
not set the PG_Reserved flag.  I'm leaving that for a future patch set as
that will require a more significant refactor.

Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: provide kernel parameter to allow disabling page init poisoning
Alexander Duyck [Fri, 26 Oct 2018 22:07:45 +0000 (15:07 -0700)]
mm: provide kernel parameter to allow disabling page init poisoning

Patch series "Address issues slowing persistent memory initialization", v5.

The main thing this patch set achieves is that it allows us to initialize
each node worth of persistent memory independently.  As a result we reduce
page init time by about 2 minutes because instead of taking 30 to 40
seconds per node and going through each node one at a time, we process all
4 nodes in parallel in the case of a 12TB persistent memory setup spread
evenly over 4 nodes.

This patch (of 3):

On systems with a large amount of memory it can take a significant amount
of time to initialize all of the page structs with the PAGE_POISON_PATTERN
value.  I have seen it take over 2 minutes to initialize a system with
over 12TB of RAM.

In order to work around the issue I had to disable CONFIG_DEBUG_VM and
then the boot time returned to something much more reasonable as the
arch_add_memory call completed in milliseconds versus seconds.  However in
doing that I had to disable all of the other VM debugging on the system.

In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
on a system that has a large amount of memory I have added a new kernel
parameter named "vm_debug" that can be set to "-" in order to disable it.

Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomemcg: remove memcg_kmem_skip_account
Shakeel Butt [Fri, 26 Oct 2018 22:07:41 +0000 (15:07 -0700)]
memcg: remove memcg_kmem_skip_account

The flag memcg_kmem_skip_account was added during the era of opt-out kmem
accounting.  There is no need for such flag in the opt-in world as there
aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue().

Link: http://lkml.kernel.org/r/20180919004501.178023-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug.c: clean up node_states_check_changes_offline()
Oscar Salvador [Fri, 26 Oct 2018 22:07:38 +0000 (15:07 -0700)]
mm/memory_hotplug.c: clean up node_states_check_changes_offline()

This patch, as the previous one, gets rid of the wrong if statements.
While at it, I realized that the comments are sometimes very confusing,
to say the least, and wrong.
For example:

___
zone_last = ZONE_MOVABLE;

/*
 * check whether node_states[N_HIGH_MEMORY] will be changed
 * If we try to offline the last present @nr_pages from the node,
 * we can determind we will need to clear the node from
 * node_states[N_HIGH_MEMORY].
 */

for (; zt <= zone_last; zt++)
        present_pages += pgdat->node_zones[zt].present_pages;
if (nr_pages >= present_pages)
        arg->status_change_nid = zone_to_nid(zone);
else
        arg->status_change_nid = -1;
___

In case the node gets empry, it must be removed from N_MEMORY.  We already
check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM ifdef code.  Not
to say that status_change_nid is for N_MEMORY, and not for N_HIGH_MEMORY.

So I re-wrote some of the comments to what I think is better.

[osalvador@suse.de: address feedback from Pavel]
Link: http://lkml.kernel.org/r/20180921132634.10103-5-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180919100819.25518-6-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug.c: simplify node_states_check_changes_online
Oscar Salvador [Fri, 26 Oct 2018 22:07:34 +0000 (15:07 -0700)]
mm/memory_hotplug.c: simplify node_states_check_changes_online

While looking at node_states_check_changes_online, I stumbled upon some
confusing things.

Right after entering the function, we find this:

if (N_MEMORY == N_NORMAL_MEMORY)
        zone_last = ZONE_MOVABLE;

This is wrong.
N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
My guess is that this wanted to be something like:

if (N_NORMAL_MEMORY == N_HIGH_MEMORY)

to check if we have CONFIG_HIGHMEM.

Later on, in the CONFIG_HIGHMEM block, we have:

if (N_MEMORY == N_HIGH_MEMORY)
        zone_last = ZONE_MOVABLE;

Again, this is wrong, and will never be evaluated to true.

Besides removing these wrong if statements, I simplified the function a
bit.

[osalvador@suse.de: address feedback from Pavel]
Link: http://lkml.kernel.org/r/20180921132634.10103-4-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180919100819.25518-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug.c: tidy up node_states_clear_node()
Oscar Salvador [Fri, 26 Oct 2018 22:07:28 +0000 (15:07 -0700)]
mm/memory_hotplug.c: tidy up node_states_clear_node()

node_states_clear has the following if statements:

if ((N_MEMORY != N_NORMAL_MEMORY) &&
    (arg->status_change_nid_high >= 0))
        ...

if ((N_MEMORY != N_HIGH_MEMORY) &&
    (arg->status_change_nid >= 0))
        ...

N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor
N_HIGH_MEMORY.

Similar problem was found in [1].
Since this is wrong, let us get rid of it.

[1] https://patchwork.kernel.org/patch/10579155/

Link: http://lkml.kernel.org/r/20180919100819.25518-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug.c: spare unnecessary calls to node_set_state
Oscar Salvador [Fri, 26 Oct 2018 22:07:25 +0000 (15:07 -0700)]
mm/memory_hotplug.c: spare unnecessary calls to node_set_state

In node_states_check_changes_online, we check if the node will have to be
set for any of the N_*_MEMORY states after the pages have been onlined.

Later on, we perform the activation in node_states_set_node.  Currently,
in node_states_set_node we set the node to N_MEMORY unconditionally.

This means that we call node_set_state for N_MEMORY every time pages go
online, but we only need to do it if the node has not yet been set for
N_MEMORY.

Fix this by checking status_change_nid.

Link: http://lkml.kernel.org/r/20180919100819.25518-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/filemap.c: Use existing variable
haiqing.shq [Fri, 26 Oct 2018 22:07:22 +0000 (15:07 -0700)]
mm/filemap.c: Use existing variable

Use the variable write_len instead of ov_iter_count(from).

Link: http://lkml.kernel.org/r/1537375855-2088-1-git-send-email-leviathan0992@gmail.com
Signed-off-by: haiqing.shq <leviathan0992@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: unmap VM_PFNMAP mappings with optimized path
Yang Shi [Fri, 26 Oct 2018 22:07:18 +0000 (15:07 -0700)]
mm: unmap VM_PFNMAP mappings with optimized path

When unmapping VM_PFNMAP mappings, vm flags need to be updated.  Since the
vmas have been detached, so it sounds safe to update vm flags with read
mmap_sem.

Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: unmap VM_HUGETLB mappings with optimized path
Yang Shi [Fri, 26 Oct 2018 22:07:15 +0000 (15:07 -0700)]
mm: unmap VM_HUGETLB mappings with optimized path

When unmapping VM_HUGETLB mappings, vm flags need to be updated.  Since
the vmas have been detached, so it sounds safe to update vm flags with
read mmap_sem.

Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: mmap: zap pages with read mmap_sem in munmap
Yang Shi [Fri, 26 Oct 2018 22:07:11 +0000 (15:07 -0700)]
mm: mmap: zap pages with read mmap_sem in munmap

Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.

Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.

History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).

V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786).  Then I received a lot great
feedback and suggestions.

Then this topic was discussed on LSFMM summit 2018.  In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach.  Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)

Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.

But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
  * The unexpected state from PF if it wins the race in the middle of munmap.
    It may return zero page, instead of the content or SIGSEGV.
  * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
    is a showstopper from akpm

But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
        acquire write mmap_sem
        lookup vmas (find and split vmas)
        deal with special mappings
        detach vmas
        downgrade_write

        zap pages
        free page tables
        release mmap_sem

The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e.  page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.

If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings.  They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.

But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.

Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.  So, uprobe unmap will not be handled by the regular
path.

With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.

And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.

For the time being, just do this in munmap syscall path.  Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle.  But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization.  So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them.  mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.

This patch (of 3):

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle.  When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does.  Then re-acquire write mmap_sem to
cleanup vmas.

But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
        acquire write mmap_sem
        lookup vmas (find and split vmas)
        deal with special mappings
        detach vmas
        downgrade_write

        zap pages
        free page tables
        release mmap_sem

The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e.  page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.

If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings.  They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.

But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.

Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.

With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.

And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.

For the time being, just do this in munmap syscall path.  Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle.  But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization.  So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them.  mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.

With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.

munmap_test-15002 [008]   594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us
|    unmap_region();
munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us
|  }

Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.

[1] https://lwn.net/Articles/753269/

Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agovfree: add debug might_sleep()
Andrey Ryabinin [Fri, 26 Oct 2018 22:07:07 +0000 (15:07 -0700)]
vfree: add debug might_sleep()

Add might_sleep() call to vfree() to catch potential sleep-in-atomic bugs
earlier.

[aryabinin@virtuozzo.com: drop might_sleep_if() from kvfree()]
Link: http://lkml.kernel.org/r/7e19e4df-b1a6-29bd-9ae7-0266d50bef1d@virtuozzo.com
Link: http://lkml.kernel.org/r/20180914130512.10394-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmalloc.c: improve vfree() kerneldoc
Andrey Ryabinin [Fri, 26 Oct 2018 22:07:03 +0000 (15:07 -0700)]
mm/vmalloc.c: improve vfree() kerneldoc

vfree() might sleep if called not in interrupt context.  Explain that in
the comment.

Link: http://lkml.kernel.org/r/20180914130512.10394-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokvfree(): fix misleading comment
Andrey Ryabinin [Fri, 26 Oct 2018 22:07:00 +0000 (15:07 -0700)]
kvfree(): fix misleading comment

vfree() might sleep if called not in interrupt context.  So does kvfree()
too.  Fix misleading kvfree()'s comment about allowed context.

Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
Fixes: 04b8e946075d ("mm/util.c: improve kvfree() kerneldoc")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/mempolicy.c: use match_string() helper to simplify the code
zhong jiang [Fri, 26 Oct 2018 22:06:57 +0000 (15:06 -0700)]
mm/mempolicy.c: use match_string() helper to simplify the code

match_string() returns the index of an array for a matching string, which
can be used intead of open coded implementation.

Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/swap.c: remove duplicated include
YueHaibing [Fri, 26 Oct 2018 22:06:53 +0000 (15:06 -0700)]
mm/swap.c: remove duplicated include

Remove duplicated include linux/memremap.h

Link: http://lkml.kernel.org/r/20180917131308.16420-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, page_alloc: drop should_suppress_show_mem
Michal Hocko [Fri, 26 Oct 2018 22:06:49 +0000 (15:06 -0700)]
mm, page_alloc: drop should_suppress_show_mem

should_suppress_show_mem() was introduced to reduce the overhead of
show_mem on large NUMA systems.  Things have changed since then though.
Namely c78e93630d15 ("mm: do not walk all of system memory during
show_mem") has reduced the overhead considerably.

Moreover warn_alloc_show_mem clears SHOW_MEM_FILTER_NODES when called from
the IRQ context already so we are not printing per node stats.

Remove should_suppress_show_mem because we are losing potentially
interesting information about allocation failures.  We have seen a bug
report where system gets unresponsive under memory pressure and there is
only

kernel: [2032243.696888] qlge 0000:8b:00.1 ql1: Could not get a page chunk, i=8, clean_idx =200 .
kernel: [2032243.710725] swapper/7: page allocation failure: order:1, mode:0x1084120(GFP_ATOMIC|__GFP_COLD|__GFP_COMP)

without an additional information for debugging.  It would be great to see
the state of the page allocator at the moment.

Link: http://lkml.kernel.org/r/20180907114334.7088-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memcontrol.c: fix memory.stat item ordering
Johannes Weiner [Fri, 26 Oct 2018 22:06:45 +0000 (15:06 -0700)]
mm/memcontrol.c: fix memory.stat item ordering

The refault stats go better with the page fault stats, and are of
higher interest than the stats on LRU operations. In fact they used to
be grouped together; when the LRU operation stats were added later on,
they were wedged in between.

Move them back together. Documentation/admin-guide/cgroup-v2.rst
already lists them in the right order.

Link: http://lkml.kernel.org/r/20181010140239.GA2527@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: zero-seek shrinkers
Johannes Weiner [Fri, 26 Oct 2018 22:06:42 +0000 (15:06 -0700)]
mm: zero-seek shrinkers

The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.

Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.

This is a bogus setting.  A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.

In most cases, the behavior in practice is still fine.  Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems.  But there are scenarios where
that's not true.

Our database workloads suffer from two of those.  For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it.  The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting.  This is a huge waste of memory.

These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.

This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches.  This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.

It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.

Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>