OSDN Git Service

tomoyo/tomoyo-test1.git
4 years agoMerge branch 'bpf_enable_stats'
Alexei Starovoitov [Fri, 1 May 2020 17:36:32 +0000 (10:36 -0700)]
Merge branch 'bpf_enable_stats'

Song Liu says:

====================
run_time_ns is a useful stats for BPF programs. However, it is gated by
sysctl kernel.bpf_stats_enabled. When multiple user space tools are
toggling kernl.bpf_stats_enabled at the same time, they may confuse each
other.

Solve this problem with a new BPF command BPF_ENABLE_STATS.

Changes v8 => v9:
  1. Clean up in selftest (Andrii).
  2. Not using static variable in test program (Andrii).

Changes v7 => v8:
  1. Change name BPF_STATS_RUNTIME_CNT => BPF_STATS_RUN_TIME (Alexei).
  2. Add CHECK_ATTR to bpf_enable_stats() (Alexei).
  3. Rebase (Andrii).
  4. Simplfy the selftest (Alexei).

Changes v6 => v7:
  1. Add test to verify run_cnt matches count measured by the program.

Changes v5 => v6:
  1. Simplify test program (Yonghong).
  2. Rebase (with some conflicts).

Changes v4 => v5:
  1. Use memset to zero bpf_attr in bpf_enable_stats() (Andrii).

Changes v3 => v4:
  1. Add libbpf support and selftest;
  2. Avoid cleaning trailing space.

Changes v2 => v3:
  1. Rename the command to BPF_ENABLE_STATS, and make it extendible.
  2. fix commit log;
  3. remove unnecessary headers.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: Add selftest for BPF_ENABLE_STATS
Song Liu [Thu, 30 Apr 2020 07:15:06 +0000 (00:15 -0700)]
bpf: Add selftest for BPF_ENABLE_STATS

Add test for BPF_ENABLE_STATS, which should enable run_time_ns stats.

~/selftests/bpf# ./test_progs -t enable_stats  -v
test_enable_stats:PASS:skel_open_and_load 0 nsec
test_enable_stats:PASS:get_stats_fd 0 nsec
test_enable_stats:PASS:attach_raw_tp 0 nsec
test_enable_stats:PASS:get_prog_info 0 nsec
test_enable_stats:PASS:check_stats_enabled 0 nsec
test_enable_stats:PASS:check_run_cnt_valid 0 nsec
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200430071506.1408910-4-songliubraving@fb.com
4 years agolibbpf: Add support for command BPF_ENABLE_STATS
Song Liu [Thu, 30 Apr 2020 07:15:05 +0000 (00:15 -0700)]
libbpf: Add support for command BPF_ENABLE_STATS

bpf_enable_stats() is added to enable given stats.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200430071506.1408910-3-songliubraving@fb.com
4 years agobpf: Sharing bpf runtime stats with BPF_ENABLE_STATS
Song Liu [Thu, 30 Apr 2020 07:15:04 +0000 (00:15 -0700)]
bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS

Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
Typical userspace tools use kernel.bpf_stats_enabled as follows:

  1. Enable kernel.bpf_stats_enabled;
  2. Check program run_time_ns;
  3. Sleep for the monitoring period;
  4. Check program run_time_ns again, calculate the difference;
  5. Disable kernel.bpf_stats_enabled.

The problem with this approach is that only one userspace tool can toggle
this sysctl. If multiple tools toggle the sysctl at the same time, the
measurement may be inaccurate.

To fix this problem while keep backward compatibility, introduce a new
bpf command BPF_ENABLE_STATS. On success, this command enables stats and
returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
command to support other types of stats in the future.

With BPF_ENABLE_STATS, user space tool would have the following flow:

  1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
  2. Check program run_time_ns;
  3. Sleep for the monitoring period;
  4. Check program run_time_ns again, calculate the difference;
  5. Close the fd.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
4 years agoselftests/bpf: Test allowed maps for bpf_sk_select_reuseport
Jakub Sitnicki [Thu, 30 Apr 2020 10:47:38 +0000 (12:47 +0200)]
selftests/bpf: Test allowed maps for bpf_sk_select_reuseport

Check that verifier allows passing a map of type:

 BPF_MAP_TYPE_REUSEPORT_SOCKARRARY, or
 BPF_MAP_TYPE_SOCKMAP, or
 BPF_MAP_TYPE_SOCKHASH

... to bpf_sk_select_reuseport helper.

Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200430104738.494180-1-jakub@cloudflare.com
4 years agolibbpf: Fix false uninitialized variable warning
Andrii Nakryiko [Thu, 30 Apr 2020 02:14:36 +0000 (19:14 -0700)]
libbpf: Fix false uninitialized variable warning

Some versions of GCC falsely detect that vi might not be initialized. That's
not true, but let's silence it with NULL initialization.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200430021436.1522502-1-andriin@fb.com
4 years agobpf, riscv: Fix stack layout of JITed code on RV32
Luke Nelson [Thu, 30 Apr 2020 00:51:27 +0000 (17:51 -0700)]
bpf, riscv: Fix stack layout of JITed code on RV32

This patch fixes issues with stackframe unwinding and alignment in the
current stack layout for BPF programs on RV32.

In the current layout, RV32 fp points to the JIT scratch registers, rather
than to the callee-saved registers. This breaks stackframe unwinding,
which expects fp to point just above the saved ra and fp registers.

This patch fixes the issue by moving the callee-saved registers to be
stored on the top of the stack, pointed to by fp. This satisfies the
assumptions of stackframe unwinding.

This patch also fixes an issue with the old layout that the stack was
not aligned to 16 bytes.

Stacktrace from JITed code using the old stack layout:

  [   12.196249 ] [<c0402200>] walk_stackframe+0x0/0x96

Stacktrace using the new stack layout:

  [   13.062888 ] [<c0402200>] walk_stackframe+0x0/0x96
  [   13.063028 ] [<c04023c6>] show_stack+0x28/0x32
  [   13.063253 ] [<a403e778>] bpf_prog_82b916b2dfa00464+0x80/0x908
  [   13.063417 ] [<c09270b2>] bpf_test_run+0x124/0x39a
  [   13.063553 ] [<c09276c0>] bpf_prog_test_run_skb+0x234/0x448
  [   13.063704 ] [<c048510e>] __do_sys_bpf+0x766/0x13b4
  [   13.063840 ] [<c0485d82>] sys_bpf+0xc/0x14
  [   13.063961 ] [<c04010f0>] ret_from_syscall+0x0/0x2

The new code is also simpler to understand and includes an ASCII diagram
of the stack layout.

Tested on riscv32 QEMU virt machine.

Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Xi Wang <xi.wang@gmail.com>
Link: https://lore.kernel.org/bpf/20200430005127.2205-1-luke.r.nels@gmail.com
4 years agobpf: Fix unused variable warning
Arnd Bergmann [Wed, 29 Apr 2020 13:21:58 +0000 (15:21 +0200)]
bpf: Fix unused variable warning

Hiding the only using of bpf_link_type_strs[] in an #ifdef causes
an unused-variable warning:

kernel/bpf/syscall.c:2280:20: error: 'bpf_link_type_strs' defined but not used [-Werror=unused-variable]
 2280 | static const char *bpf_link_type_strs[] = {

Move the definition into the same #ifdef.

Fixes: f2e10bff16a0 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200429132217.1294289-1-arnd@arndb.de
4 years agoselftests/bpf: Use SOCKMAP for server sockets in bpf_sk_assign test
Jakub Sitnicki [Wed, 29 Apr 2020 18:11:54 +0000 (20:11 +0200)]
selftests/bpf: Use SOCKMAP for server sockets in bpf_sk_assign test

Update bpf_sk_assign test to fetch the server socket from SOCKMAP, now that
map lookup from BPF in SOCKMAP is enabled. This way the test TC BPF program
doesn't need to know what address server socket is bound to.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429181154.479310-4-jakub@cloudflare.com
4 years agoselftests/bpf: Test that lookup on SOCKMAP/SOCKHASH is allowed
Jakub Sitnicki [Wed, 29 Apr 2020 18:11:53 +0000 (20:11 +0200)]
selftests/bpf: Test that lookup on SOCKMAP/SOCKHASH is allowed

Now that bpf_map_lookup_elem() is white-listed for SOCKMAP/SOCKHASH,
replace the tests which check that verifier prevents lookup on these map
types with ones that ensure that lookup operation is permitted, but only
with a release of acquired socket reference.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429181154.479310-3-jakub@cloudflare.com
4 years agobpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH
Jakub Sitnicki [Wed, 29 Apr 2020 18:11:52 +0000 (20:11 +0200)]
bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH

White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a
pointer to a full socket and acquires a reference if necessary.

To support it we need to extend the verifier to know that:

 (1) register storing the lookup result holds a pointer to socket, if
     lookup was done on SOCKMAP/SOCKHASH, and that

 (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation,
     which needs a corresponding reference release with bpf_sk_release.

On sock_map side, lookup handlers exposed via bpf_map_ops now bump
sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport,
the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was
updated to release the reference.

Sockets fetched from a map can be used in the same way as ones returned by
BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they
can be used with bpf_sk_assign to direct packets toward a socket on TC
ingress path.

Suggested-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
4 years agotools: bpftool: Make libcap dependency optional
Quentin Monnet [Wed, 29 Apr 2020 14:45:06 +0000 (15:45 +0100)]
tools: bpftool: Make libcap dependency optional

The new libcap dependency is not used for an essential feature of
bpftool, and we could imagine building the tool without checks on
CAP_SYS_ADMIN by disabling probing features as an unprivileged users.

Make it so, in order to avoid a hard dependency on libcap, and to ease
packaging/embedding of bpftool.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429144506.8999-4-quentin@isovalent.com
4 years agotools: bpftool: Allow unprivileged users to probe features
Quentin Monnet [Wed, 29 Apr 2020 14:45:05 +0000 (15:45 +0100)]
tools: bpftool: Allow unprivileged users to probe features

There is demand for a way to identify what BPF helper functions are
available to unprivileged users. To do so, allow unprivileged users to
run "bpftool feature probe" to list BPF-related features. This will only
show features accessible to those users, and may not reflect the full
list of features available (to administrators) on the system.

To avoid the case where bpftool is inadvertently run as non-root and
would list only a subset of the features supported by the system when it
would be expected to list all of them, running as unprivileged is gated
behind the "unprivileged" keyword passed to the command line. When used
by a privileged user, this keyword allows to drop the CAP_SYS_ADMIN and
to list the features available to unprivileged users. Note that this
addsd a dependency on libpcap for compiling bpftool.

Note that there is no particular reason why the probes were restricted
to root, other than the fact I did not need them for unprivileged and
did not bother with the additional checks at the time probes were added.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429144506.8999-3-quentin@isovalent.com
4 years agotools: bpftool: For "feature probe" define "full_mode" bool as global
Quentin Monnet [Wed, 29 Apr 2020 14:45:04 +0000 (15:45 +0100)]
tools: bpftool: For "feature probe" define "full_mode" bool as global

The "full_mode" variable used for switching between full or partial
feature probing (i.e. with or without probing helpers that will log
warnings in kernel logs) was piped from the main do_probe() function
down to probe_helpers_for_progtype(), where it is needed.

Define it as a global variable: the calls will be more readable, and if
other similar flags were to be used in the future, we could use global
variables as well instead of extending again the list of arguments with
new flags.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429144506.8999-2-quentin@isovalent.com
4 years agoMerge branch 'test_progs-asan'
Alexei Starovoitov [Wed, 29 Apr 2020 02:48:06 +0000 (19:48 -0700)]
Merge branch 'test_progs-asan'

Andrii Nakryiko says:

====================
Add necessary infra to build selftests with ASAN (or any other sanitizer). Fix
a bunch of found memory leaks and other memory access issues.

v1->v2:
  - don't add ASAN flavor, but allow extra flags for build (Alexei);
  - fix few more found issues, which somehow were missed first time.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agoselftests/bpf: Add runqslower binary to .gitignore
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:11 +0000 (18:21 -0700)]
selftests/bpf: Add runqslower binary to .gitignore

With recent changes, runqslower is being copied into selftests/bpf root
directory. So add it into .gitignore.

Fixes: b26d1e2b6028 ("selftests/bpf: Copy runqslower to OUTPUT directory")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Veronika Kabatova <vkabatov@redhat.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-12-andriin@fb.com
4 years agoselftests/bpf: Fix bpf_link leak in ns_current_pid_tgid selftest
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:10 +0000 (18:21 -0700)]
selftests/bpf: Fix bpf_link leak in ns_current_pid_tgid selftest

If condition is inverted, but it's also just not necessary.

Fixes: 1c1052e0140a ("tools/testing/selftests/bpf: Add self-tests for new helper bpf_get_ns_current_pid_tgid.")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Carlos Neira <cneirabustos@gmail.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-11-andriin@fb.com
4 years agoselftests/bpf: Disable ASAN instrumentation for mmap()'ed memory read
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:09 +0000 (18:21 -0700)]
selftests/bpf: Disable ASAN instrumentation for mmap()'ed memory read

AddressSanitizer assumes that all memory dereferences are done against memory
allocated by sanitizer's malloc()/free() code and not touched by anyone else.
Seems like this doesn't hold for perf buffer memory. Disable instrumentation
on perf buffer callback function.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-10-andriin@fb.com
4 years agolibbpf: Fix huge memory leak in libbpf_find_vmlinux_btf_id()
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:08 +0000 (18:21 -0700)]
libbpf: Fix huge memory leak in libbpf_find_vmlinux_btf_id()

BTF object wasn't freed.

Fixes: a6ed02cac690 ("libbpf: Load btf_vmlinux only once per object.")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-9-andriin@fb.com
4 years agoselftests/bpf: Fix invalid memory reads in core_relo selftest
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:07 +0000 (18:21 -0700)]
selftests/bpf: Fix invalid memory reads in core_relo selftest

Another one found by AddressSanitizer. input_len is bigger than actually
initialized data size.

Fixes: c7566a69695c ("selftests/bpf: Add field existence CO-RE relocs tests")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-8-andriin@fb.com
4 years agoselftests/bpf: Fix memory leak in extract_build_id()
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:06 +0000 (18:21 -0700)]
selftests/bpf: Fix memory leak in extract_build_id()

getline() allocates string, which has to be freed.

Fixes: 81f77fd0deeb ("bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-7-andriin@fb.com
4 years agoselftests/bpf: Fix memory leak in test selector
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:05 +0000 (18:21 -0700)]
selftests/bpf: Fix memory leak in test selector

Free test selector substrings, which were strdup()'ed.

Fixes: b65053cd94f4 ("selftests/bpf: Add whitelist/blacklist of test names to test_progs")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-6-andriin@fb.com
4 years agolibbpf: Fix memory leak and possible double-free in hashmap__clear
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:04 +0000 (18:21 -0700)]
libbpf: Fix memory leak and possible double-free in hashmap__clear

Fix memory leak in hashmap_clear() not freeing hashmap_entry structs for each
of the remaining entries. Also NULL-out bucket list to prevent possible
double-free between hashmap__clear() and hashmap__free().

Running test_progs-asan flavor clearly showed this problem.

Reported-by: Alston Tang <alston64@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-5-andriin@fb.com
4 years agoselftests/bpf: Convert test_hashmap into test_progs test
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:03 +0000 (18:21 -0700)]
selftests/bpf: Convert test_hashmap into test_progs test

Fold stand-alone test_hashmap test into test_progs.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-4-andriin@fb.com
4 years agoselftests/bpf: Add SAN_CFLAGS param to selftests build to allow sanitizers
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:02 +0000 (18:21 -0700)]
selftests/bpf: Add SAN_CFLAGS param to selftests build to allow sanitizers

Add ability to specify extra compiler flags with SAN_CFLAGS for compilation of
all user-space C files.  This allows to build all of selftest programs with,
e.g., custom sanitizer flags, without requiring support for such sanitizers
from anyone compiling selftest/bpf.

As an example, to compile everything with AddressSanitizer, one would do:

  $ make clean && make SAN_CFLAGS="-fsanitize=address"

For AddressSanitizer to work, one needs appropriate libasan shared library
installed in the system, with version of libasan matching what GCC links
against. E.g., GCC8 needs libasan5, while GCC7 uses libasan4.

For CentOS 7, to build everything successfully one would need to:
  $ sudo yum install devtoolset-8-gcc devtoolset-libasan-devel
  $ scl enable devtoolset-8 bash # set up environment

For Arch Linux to run selftests, one would need to install gcc-libs package to
get libasan.so.5:
  $ sudo pacman -S gcc-libs

N.B. EXTRA_CFLAGS name wasn't used, because it's also used by libbpf's
Makefile and this causes few issues:
1. default "-g -Wall" flags are overriden;
2. compiling shared library with AddressSanitizer generates a bunch of symbols
   like: "_GLOBAL__sub_D_00099_0_btf_dump.c", "_GLOBAL__sub_D_00099_0_bpf.c",
   etc, which screws up versioned symbols check.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Julia Kartseva <hex@fb.com>
Link: https://lore.kernel.org/bpf/20200429012111.277390-3-andriin@fb.com
4 years agoselftests/bpf: Ensure test flavors use correct skeletons
Andrii Nakryiko [Wed, 29 Apr 2020 01:21:01 +0000 (18:21 -0700)]
selftests/bpf: Ensure test flavors use correct skeletons

Ensure that test runner flavors include their own skeletons from <flavor>/
directory. Previously, skeletons generated for no-flavor test_progs were used.
Apart from fixing correctness, this also makes it possible to compile only
flavors individually:

  $ make clean && make test_progs-no_alu32
  ... now succeeds ...

Fixes: 74b5a5968fe8 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429012111.277390-2-andriin@fb.com
4 years agoMerge branch 'BTF-map-in-map'
Alexei Starovoitov [Wed, 29 Apr 2020 00:35:03 +0000 (17:35 -0700)]
Merge branch 'BTF-map-in-map'

Andrii Nakryiko says:

====================
This patch set teaches libbpf how to declare and initialize ARRAY_OF_MAPS and
HASH_OF_MAPS maps. See patch #3 for all the details.

Patch #1 refactors parsing BTF definition of map to re-use it cleanly for
inner map definition parsing.

Patch #2 refactors map creation and destruction logic for reuse. It also fixes
existing bug with not closing successfully created maps when bpf_object map
creation overall fails.

Patch #3 adds support for an extension of BTF-defined map syntax, as well as
parsing, recording, and use of relocations to allow declaratively initialize
outer maps with references to inner maps.

v1->v2:
  - rename __inner to __array (Alexei).
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agolibbpf: Add BTF-defined map-in-map support
Andrii Nakryiko [Wed, 29 Apr 2020 00:27:39 +0000 (17:27 -0700)]
libbpf: Add BTF-defined map-in-map support

As discussed at LPC 2019 ([0]), this patch brings (a quite belated) support
for declarative BTF-defined map-in-map support in libbpf. It allows to define
ARRAY_OF_MAPS and HASH_OF_MAPS BPF maps without any user-space initialization
code involved.

Additionally, it allows to initialize outer map's slots with references to
respective inner maps at load time, also completely declaratively.

Despite a weak type system of C, the way BTF-defined map-in-map definition
works, it's actually quite hard to accidentally initialize outer map with
incompatible inner maps. This being C, of course, it's still possible, but
even that would be caught at load time and error returned with helpful debug
log pointing exactly to the slot that failed to be initialized.

As an example, here's a rather advanced HASH_OF_MAPS declaration and
initialization example, filling slots #0 and #4 with two inner maps:

  #include <bpf/bpf_helpers.h>

  struct inner_map {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __uint(max_entries, 1);
          __type(key, int);
          __type(value, int);
  } inner_map1 SEC(".maps"),
    inner_map2 SEC(".maps");

  struct outer_hash {
          __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
          __uint(max_entries, 5);
          __uint(key_size, sizeof(int));
          __array(values, struct inner_map);
  } outer_hash SEC(".maps") = {
          .values = {
                  [0] = &inner_map2,
                  [4] = &inner_map1,
          },
  };

Here's the relevant part of libbpf debug log showing pretty clearly of what's
going on with map-in-map initialization:

  libbpf: .maps relo #0: for 6 value 0 rel.r_offset 96 name 260 ('inner_map1')
  libbpf: .maps relo #0: map 'outer_arr' slot [0] points to map 'inner_map1'
  libbpf: .maps relo #1: for 7 value 32 rel.r_offset 112 name 249 ('inner_map2')
  libbpf: .maps relo #1: map 'outer_arr' slot [2] points to map 'inner_map2'
  libbpf: .maps relo #2: for 7 value 32 rel.r_offset 144 name 249 ('inner_map2')
  libbpf: .maps relo #2: map 'outer_hash' slot [0] points to map 'inner_map2'
  libbpf: .maps relo #3: for 6 value 0 rel.r_offset 176 name 260 ('inner_map1')
  libbpf: .maps relo #3: map 'outer_hash' slot [4] points to map 'inner_map1'
  libbpf: map 'inner_map1': created successfully, fd=4
  libbpf: map 'inner_map2': created successfully, fd=5
  libbpf: map 'outer_hash': created successfully, fd=7
  libbpf: map 'outer_hash': slot [0] set to map 'inner_map2' fd=5
  libbpf: map 'outer_hash': slot [4] set to map 'inner_map1' fd=4

Notice from the log above that fd=6 (not logged explicitly) is used for inner
"prototype" map, necessary for creation of outer map. It is destroyed
immediately after outer map is created.

See also included selftest with some extra comments explaining extra details
of usage. Additionally, similar initialization syntax and libbpf functionality
can be used to do initialization of BPF_PROG_ARRAY with references to BPF
sub-programs. This can be done in follow up patches, if there will be a demand
for this.

  [0] https://linuxplumbersconf.org/event/4/contributions/448/

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200429002739.48006-4-andriin@fb.com
4 years agolibbpf: Refactor map creation logic and fix cleanup leak
Andrii Nakryiko [Wed, 29 Apr 2020 00:27:38 +0000 (17:27 -0700)]
libbpf: Refactor map creation logic and fix cleanup leak

Factor out map creation and destruction logic to simplify code and especially
error handling. Also fix map FD leak in case of partially successful map
creation during bpf_object load operation.

Fixes: 57a00f41644f ("libbpf: Add auto-pinning of maps when loading BPF objects")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200429002739.48006-3-andriin@fb.com
4 years agolibbpf: Refactor BTF-defined map definition parsing logic
Andrii Nakryiko [Wed, 29 Apr 2020 00:27:37 +0000 (17:27 -0700)]
libbpf: Refactor BTF-defined map definition parsing logic

Factor out BTF map definition logic into stand-alone routine for easier reuse
for map-in-map case.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429002739.48006-2-andriin@fb.com
4 years agoMerge branch 'bpf_link-observability'
Alexei Starovoitov [Wed, 29 Apr 2020 00:27:08 +0000 (17:27 -0700)]
Merge branch 'bpf_link-observability'

Andrii Nakryiko says:

====================
This patch series adds various observability APIs to bpf_link:
  - each bpf_link now gets ID, similar to bpf_map and bpf_prog, by which
    user-space can iterate over all existing bpf_links and create limited FD
    from ID;
  - allows to get extra object information with bpf_link general and
    type-specific information;
  - implements `bpf link show` command which lists all active bpf_links in the
    system;
  - implements `bpf link pin` allowing to pin bpf_link by ID or from other
    pinned path.

v2->v3:
  - improve spin locking around bpf_link ID (Alexei);
  - simplify bpf_link_info handling and fix compilation error on sh arch;
v1->v2:
  - simplified `bpftool link show` implementation (Quentin);
  - fixed formatting of bpftool-link.rst (Quentin);
  - fixed attach type printing logic (Quentin);
rfc->v1:
  - dropped read-only bpf_links (Alexei);
  - fixed bug in bpf_link_cleanup() not removing ID;
  - fixed bpftool link pinning search logic;
  - added bash-completion and man page.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpftool: Add link bash completions
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:14 +0000 (17:16 -0700)]
bpftool: Add link bash completions

Extend bpftool's bash-completion script to handle new link command and its
sub-commands.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200429001614.1544-11-andriin@fb.com
4 years agobpftool: Add bpftool-link manpage
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:13 +0000 (17:16 -0700)]
bpftool: Add bpftool-link manpage

Add bpftool-link manpage with information and examples of link-related
commands.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200429001614.1544-10-andriin@fb.com
4 years agobpftool: Add bpf_link show and pin support
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:12 +0000 (17:16 -0700)]
bpftool: Add bpf_link show and pin support

Add `bpftool link show` and `bpftool link pin` commands.

Example plain output for `link show` (with showing pinned paths):

[vmuser@archvm bpf]$ sudo ~/local/linux/tools/bpf/bpftool/bpftool -f link
1: tracing  prog 12
        prog_type tracing  attach_type fentry
        pinned /sys/fs/bpf/my_test_link
        pinned /sys/fs/bpf/my_test_link2
2: tracing  prog 13
        prog_type tracing  attach_type fentry
3: tracing  prog 14
        prog_type tracing  attach_type fentry
4: tracing  prog 15
        prog_type tracing  attach_type fentry
5: tracing  prog 16
        prog_type tracing  attach_type fentry
6: tracing  prog 17
        prog_type tracing  attach_type fentry
7: raw_tracepoint  prog 21
        tp 'sys_enter'
8: cgroup  prog 25
        cgroup_id 584  attach_type egress
9: cgroup  prog 25
        cgroup_id 599  attach_type egress
10: cgroup  prog 25
        cgroup_id 614  attach_type egress
11: cgroup  prog 25
        cgroup_id 629  attach_type egress

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200429001614.1544-9-andriin@fb.com
4 years agobpftool: Expose attach_type-to-string array to non-cgroup code
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:11 +0000 (17:16 -0700)]
bpftool: Expose attach_type-to-string array to non-cgroup code

Move attach_type_strings into main.h for access in non-cgroup code.
bpf_attach_type is used for non-cgroup attach types quite widely now. So also
complete missing string translations for non-cgroup attach types.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200429001614.1544-8-andriin@fb.com
4 years agoselftests/bpf: Test bpf_link's get_next_id, get_fd_by_id, and get_obj_info
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:10 +0000 (17:16 -0700)]
selftests/bpf: Test bpf_link's get_next_id, get_fd_by_id, and get_obj_info

Extend bpf_obj_id selftest to verify bpf_link's observability APIs.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-7-andriin@fb.com
4 years agolibbpf: Add low-level APIs for new bpf_link commands
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:09 +0000 (17:16 -0700)]
libbpf: Add low-level APIs for new bpf_link commands

Add low-level API calls for bpf_link_get_next_id() and
bpf_link_get_fd_by_id().

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-6-andriin@fb.com
4 years agobpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:08 +0000 (17:16 -0700)]
bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link

Add ability to fetch bpf_link details through BPF_OBJ_GET_INFO_BY_FD command.
Also enhance show_fdinfo to potentially include bpf_link type-specific
information (similarly to obj_info).

Also introduce enum bpf_link_type stored in bpf_link itself and expose it in
UAPI. bpf_link_tracing also now will store and return bpf_attach_type.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-5-andriin@fb.com
4 years agobpf: Support GET_FD_BY_ID and GET_NEXT_ID for bpf_link
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:07 +0000 (17:16 -0700)]
bpf: Support GET_FD_BY_ID and GET_NEXT_ID for bpf_link

Add support to look up bpf_link by ID and iterate over all existing bpf_links
in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
done as the very last step in finalizing bpf_link, together with installing
FD. This approach allows users of bpf_link in kernel code to not worry about
races between user-space and kernel code that hasn't finished attaching and
initializing bpf_link.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-4-andriin@fb.com
4 years agobpf: Allocate ID for bpf_link
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:06 +0000 (17:16 -0700)]
bpf: Allocate ID for bpf_link

Generate ID for each bpf_link using IDR, similarly to bpf_map and bpf_prog.
bpf_link creation, initialization, attachment, and exposing to user-space
through FD and ID is a complicated multi-step process, abstract it away
through bpf_link_primer and bpf_link_prime(), bpf_link_settle(), and
bpf_link_cleanup() internal API. They guarantee that until bpf_link is
properly attached, user-space won't be able to access partially-initialized
bpf_link either from FD or ID. All this allows to simplify bpf_link attachment
and error handling code.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-3-andriin@fb.com
4 years agobpf: Refactor bpf_link update handling
Andrii Nakryiko [Wed, 29 Apr 2020 00:16:05 +0000 (17:16 -0700)]
bpf: Refactor bpf_link update handling

Make bpf_link update support more generic by making it into another
bpf_link_ops methods. This allows generic syscall handling code to be agnostic
to various conditionally compiled features (e.g., the case of
CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
static within respective code base. Refactor existing bpf_cgroup_link code and
take advantage of this.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com
4 years agoselftests/bpf: fix test_sysctl_prog with alu32
Alexei Starovoitov [Tue, 28 Apr 2020 22:30:48 +0000 (15:30 -0700)]
selftests/bpf: fix test_sysctl_prog with alu32

Similar to commit b7a0d65d80a0 ("bpf, testing: Workaround a verifier failure for test_progs")
fix test_sysctl_prog.c as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agolibbpf: Remove unneeded semicolon in btf_dump_emit_type
Zou Wei [Tue, 28 Apr 2020 09:07:09 +0000 (17:07 +0800)]
libbpf: Remove unneeded semicolon in btf_dump_emit_type

Fixes the following coccicheck warning:

 tools/lib/bpf/btf_dump.c:661:4-5: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1588064829-70613-1-git-send-email-zou_wei@huawei.com
4 years agoselftests/bpf: Copy runqslower to OUTPUT directory
Veronika Kabatova [Tue, 28 Apr 2020 17:37:42 +0000 (19:37 +0200)]
selftests/bpf: Copy runqslower to OUTPUT directory

$(OUTPUT)/runqslower makefile target doesn't actually create runqslower
binary in the $(OUTPUT) directory. As lib.mk expects all
TEST_GEN_PROGS_EXTENDED (which runqslower is a part of) to be present in
the OUTPUT directory, this results in an error when running e.g. `make
install`:

rsync: link_stat "tools/testing/selftests/bpf/runqslower" failed: No
       such file or directory (2)

Copy the binary into the OUTPUT directory after building it to fix the
error.

Fixes: 3a0d3092a4ed ("selftests/bpf: Build runqslower from selftests")
Signed-off-by: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200428173742.2988395-1-vkabatov@redhat.com
4 years agoMerge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git...
Daniel Borkmann [Tue, 28 Apr 2020 19:20:20 +0000 (21:20 +0200)]
Merge branch 'work.sysctl' of ssh://gitolite./linux/kernel/git/viro/vfs

Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
4 years agobpf, cgroup: Remove unused exports
Christoph Hellwig [Fri, 24 Apr 2020 06:43:34 +0000 (08:43 +0200)]
bpf, cgroup: Remove unused exports

Except for a few of the networking hooks called from modular ipv4
or ipv6 code, all of hooks are just called from guaranteed to be
built-in code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20200424064338.538313-2-hch@lst.de
4 years agolibbpf: Return err if bpf_object__load failed
Mao Wenan [Sun, 26 Apr 2020 06:36:35 +0000 (14:36 +0800)]
libbpf: Return err if bpf_object__load failed

bpf_object__load() has various return code, when it failed to load
object, it must return err instead of -EINVAL.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200426063635.130680-3-maowenan@huawei.com
4 years agosysctl: pass kernel pointers to ->proc_handler
Christoph Hellwig [Fri, 24 Apr 2020 06:43:38 +0000 (08:43 +0200)]
sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4 years agosysctl: avoid forward declarations
Christoph Hellwig [Fri, 24 Apr 2020 06:43:37 +0000 (08:43 +0200)]
sysctl: avoid forward declarations

Move the sysctl tables to the end of the file to avoid lots of pointless
forward declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4 years agosysctl: remove all extern declaration from sysctl.c
Christoph Hellwig [Fri, 24 Apr 2020 06:43:36 +0000 (08:43 +0200)]
sysctl: remove all extern declaration from sysctl.c

Extern declarations in .c files are a bad style and can lead to
mismatches.  Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4 years agomm: remove watermark_boost_factor_sysctl_handler
Christoph Hellwig [Fri, 24 Apr 2020 06:43:35 +0000 (08:43 +0200)]
mm: remove watermark_boost_factor_sysctl_handler

watermark_boost_factor_sysctl_handler is just a pointless wrapper for
proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4 years agoMerge branch 'cloudflare-prog'
Alexei Starovoitov [Sun, 26 Apr 2020 17:00:37 +0000 (10:00 -0700)]
Merge branch 'cloudflare-prog'

Lorenz Bauer says:

====================
We've been developing an in-house L4 load balancer based on XDP
and TC for a while. Following Alexei's call for more up-to-date examples of
production BPF in the kernel tree [1], Cloudflare is making this available
under dual GPL-2.0 or BSD 3-clause terms.

The code requires at least v5.3 to function correctly.

1: https://lore.kernel.org/bpf/20200326210719.den5isqxntnoqhmv@ast-mbp/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agoselftests/bpf: Add cls_redirect classifier
Lorenz Bauer [Fri, 24 Apr 2020 18:55:55 +0000 (19:55 +0100)]
selftests/bpf: Add cls_redirect classifier

cls_redirect is a TC clsact based replacement for the glb-redirect iptables
module available at [1]. It enables what GitHub calls "second chance"
flows [2], similarly proposed by the Beamer paper [3]. In contrast to
glb-redirect, it also supports migrating UDP flows as long as connected
sockets are used. cls_redirect is in production at Cloudflare, as part of
our own L4 load balancer.

We have modified the encapsulation format slightly from glb-redirect:
glbgue_chained_routing.private_data_type has been repurposed to form a
version field and several flags. Both have been arranged in a way that
a private_data_type value of zero matches the current glb-redirect
behaviour. This means that cls_redirect will understand packets in
glb-redirect format, but not vice versa.

The test suite only covers basic features. For example, cls_redirect will
correctly forward path MTU discovery packets, but this is not exercised.
It is also possible to switch the encapsulation format to GRE on the last
hop, which is also not tested.

There are two major distinctions from glb-redirect: first, cls_redirect
relies on receiving encapsulated packets directly from a router. This is
because we don't have access to the neighbour tables from BPF, yet. See
forward_to_next_hop for details. Second, cls_redirect performs decapsulation
instead of using separate ipip and sit tunnel devices. This
avoids issues with the sit tunnel [4] and makes deploying the classifier
easier: decapsulated packets appear on the same interface, so existing
firewall rules continue to work as expected.

The code base started it's life on v4.19, so there are most likely still
hold overs from old workarounds. In no particular order:

- The function buf_off is required to defeat a clang optimization
  that leads to the verifier rejecting the program due to pointer
  arithmetic in the wrong order.

- The function pkt_parse_ipv6 is force inlined, because it would
  otherwise be rejected due to returning a pointer to stack memory.

- The functions fill_tuple and classify_tcp contain kludges, because
  we've run out of function arguments.

- The logic in general is rather nested, due to verifier restrictions.
  I think this is either because the verifier loses track of constants
  on the stack, or because it can't track enum like variables.

1: https://github.com/github/glb-director/tree/master/src/glb-redirect
2: https://github.com/github/glb-director/blob/master/docs/development/second-chance-design.md
3: https://www.usenix.org/conference/nsdi18/presentation/olteanu
4: https://github.com/github/glb-director/issues/64

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200424185556.7358-2-lmb@cloudflare.com
4 years agobpf: Make verifier log more relevant by default
Andrii Nakryiko [Thu, 23 Apr 2020 19:58:50 +0000 (12:58 -0700)]
bpf: Make verifier log more relevant by default

To make BPF verifier verbose log more releavant and easier to use to debug
verification failures, "pop" parts of log that were successfully verified.
This has effect of leaving only verifier logs that correspond to code branches
that lead to verification failure, which in practice should result in much
shorter and more relevant verifier log dumps. This behavior is made the
default behavior and can be overriden to do exhaustive logging by specifying
BPF_LOG_LEVEL2 log level.

Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some
cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump
verbosity, but still have only relevant verifier branches logged. But for this
patch, I didn't want to add any new flags. It might be worth-while to just
rethink how BPF verifier logging is performed and requested and streamline it
a bit. But this trimming of successfully verified branches seems to be useful
and a good default behavior.

To test this, I modified runqslower slightly to introduce read of
uninitialized stack variable. Log (**truncated in the middle** to save many
lines out of this commit message) BEFORE this change:

; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)

[ ... instruction dump from insn #7 through #50 are cut out ... ]

51: (b7) r2 = 16
52: (85) call bpf_get_current_comm#16
last_idx 52 first_idx 42
regs=4 stack=0 before 51: (b7) r2 = 16
; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
53: (bf) r1 = r6
54: (18) r2 = 0xffff8881f3868800
56: (18) r3 = 0xffffffff
58: (bf) r4 = r7
59: (b7) r5 = 32
60: (85) call bpf_perf_event_output#25
last_idx 60 first_idx 53
regs=20 stack=0 before 59: (b7) r5 = 32
61: (bf) r2 = r10
; event.pid = pid;
62: (07) r2 += -16
; bpf_map_delete_elem(&start, &pid);
63: (18) r1 = 0xffff8881f3868000
65: (85) call bpf_map_delete_elem#3
; }
66: (b7) r0 = 0
67: (95) exit

from 44 to 66: safe

from 34 to 66: safe

from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881f3868000
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4

Notice how there is a successful code path from instruction 0 through 67, few
successfully verified jumps (44->66, 34->66), and only after that 11->28 jump
plus error on instruction #32.

AFTER this change (full verifier log, **no truncation**):

; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)
7: (79) r2 = *(u64 *)(r1 +16)
; if (prev->state == TASK_RUNNING)
8: (55) if r2 != 0x0 goto pc+19
 R1_w=ptr_task_struct(id=0,off=0,imm=0) R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; trace_enqueue(prev->tgid, prev->pid);
9: (61) r1 = *(u32 *)(r1 +1184)
10: (63) *(u32 *)(r10 -4) = r1
; if (!pid || (targ_pid && targ_pid != pid))
11: (15) if r1 == 0x0 goto pc+16

from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881db3ce800
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4

Notice how in this case, there are 0-11 instructions + jump from 11 to
28 is recorded + 28-32 instructions with error on insn #32.

test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for
VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose
log at BPF_LOG_LEVEL1.

On success, verbose log will only have a summary of number of processed
instructions, etc, but no branch tracing log. Having just a last succesful
branch tracing seemed weird and confusing. Having small and clean summary log
in success case seems quite logical and nice, though.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200423195850.1259827-1-andriin@fb.com
4 years agobpf: add bpf_ktime_get_boot_ns()
Maciej Żenczykowski [Sun, 26 Apr 2020 16:15:25 +0000 (09:15 -0700)]
bpf: add bpf_ktime_get_boot_ns()

On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.

Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agoxsk: Fix typo in xsk_umem_consume_tx and xsk_generic_xmit comments
Tobias Klauser [Tue, 21 Apr 2020 23:29:27 +0000 (01:29 +0200)]
xsk: Fix typo in xsk_umem_consume_tx and xsk_generic_xmit comments

s/backpreassure/backpressure/

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20200421232927.21082-1-tklauser@distanz.ch
4 years agonet: bpf: Make bpf_ktime_get_ns() available to non GPL programs
Maciej Żenczykowski [Mon, 20 Apr 2020 18:47:50 +0000 (11:47 -0700)]
net: bpf: Make bpf_ktime_get_ns() available to non GPL programs

The entire implementation is in kernel/bpf/helpers.c:

BPF_CALL_0(bpf_ktime_get_ns) {
       /* NMI safe access to clock monotonic */
       return ktime_get_mono_fast_ns();
}

const struct bpf_func_proto bpf_ktime_get_ns_proto = {
       .func           = bpf_ktime_get_ns,
       .gpl_only       = false,
       .ret_type       = RET_INTEGER,
};

and this was presumably marked GPL due to kernel/time/timekeeping.c:
  EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

and while that may make sense for kernel modules (although even that
is doubtful), there is currently AFAICT no other source of time
available to ebpf.

Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC)
which is exposed to userspace (via vdso even to make it performant)...

As such, I see no reason to keep the GPL restriction.
(In the future I'd like to have access to time from Apache licensed ebpf code)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agonet: bpf: Allow TC programs to call BPF_FUNC_skb_change_head
Lorenzo Colitti [Mon, 20 Apr 2020 18:34:08 +0000 (11:34 -0700)]
net: bpf: Allow TC programs to call BPF_FUNC_skb_change_head

This allows TC eBPF programs to modify and forward (redirect) packets
from interfaces without ethernet headers (for example cellular)
to interfaces with (for example ethernet/wifi).

The lack of this appears to simply be an oversight.

Tested:
  in active use in Android R on 4.14+ devices for ipv6
  cellular to wifi tethering offload.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
Stanislav Fomichev [Fri, 24 Apr 2020 23:59:41 +0000 (16:59 -0700)]
bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n

linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.

I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.

[1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72

Fixes: 0456ea170cd6 ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
4 years agobpf, riscv: Fix tail call count off by one in RV32 BPF JIT
Luke Nelson [Tue, 21 Apr 2020 00:28:04 +0000 (17:28 -0700)]
bpf, riscv: Fix tail call count off by one in RV32 BPF JIT

This patch fixes an off by one error in the RV32 JIT handling for BPF
tail call. Currently, the code decrements TCC before checking if it
is less than zero. This limits the maximum number of tail calls to 32
instead of 33 as in other JITs. The fix is to instead check the old
value of TCC before decrementing.

Fixes: 5f316b65e99f ("riscv, bpf: Add RV32G eBPF JIT")
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Xi Wang <xi.wang@gmail.com>
Link: https://lore.kernel.org/bpf/20200421002804.5118-1-luke.r.nels@gmail.com
4 years agobpf_helpers.h: Add note for building with vmlinux.h or linux/types.h
Yoshiki Komachi [Tue, 21 Apr 2020 00:05:27 +0000 (09:05 +0900)]
bpf_helpers.h: Add note for building with vmlinux.h or linux/types.h

The following error was shown when a bpf program was compiled without
vmlinux.h auto-generated from BTF:

 # clang -I./linux/tools/lib/ -I/lib/modules/$(uname -r)/build/include/ \
   -O2 -Wall -target bpf -emit-llvm -c bpf_prog.c -o bpf_prog.bc
 ...
 In file included from linux/tools/lib/bpf/bpf_helpers.h:5:
 linux/tools/lib/bpf/bpf_helper_defs.h:56:82: error: unknown type name '__u64'
 ...

It seems that bpf programs are intended for being built together with
the vmlinux.h (which will have all the __u64 and other typedefs). But
users may mistakenly think "include <linux/types.h>" is missing
because the vmlinux.h is not common for non-bpf developers. IMO, an
explicit comment therefore should be added to bpf_helpers.h as this
patch shows.

Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1587427527-29399-1-git-send-email-komachi.yoshiki@gmail.com
4 years agobpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}
Stanislav Fomichev [Mon, 20 Apr 2020 17:46:10 +0000 (10:46 -0700)]
bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}

Currently the following prog types don't fall back to bpf_base_func_proto()
(instead they have cgroup_base_func_proto which has a limited set of
helpers from bpf_base_func_proto):
* BPF_PROG_TYPE_CGROUP_DEVICE
* BPF_PROG_TYPE_CGROUP_SYSCTL
* BPF_PROG_TYPE_CGROUP_SOCKOPT

I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
every other type of program (except bpf-lirc and, understandably, tracing)
use it, so let's fall back to bpf_base_func_proto for those prog types
as well.

This basically boils down to adding access to the following helpers:
* BPF_FUNC_get_prandom_u32
* BPF_FUNC_get_smp_processor_id
* BPF_FUNC_get_numa_node_id
* BPF_FUNC_tail_call
* BPF_FUNC_ktime_get_ns
* BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
* BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
* BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)

I've also added bpf_perf_event_output() because it's really handy for
logging and debugging.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
4 years agotools/bpf/bpftool: Remove duplicate headers
Jagadeesh Pagadala [Sun, 19 Apr 2020 05:39:17 +0000 (11:09 +0530)]
tools/bpf/bpftool: Remove duplicate headers

Code cleanup: Remove duplicate headers which are included twice.

Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/1587274757-14101-1-git-send-email-jagdsh.linux@gmail.com
4 years agobpf: Remove set but not used variable 'dst_known'
Mao Wenan [Sat, 18 Apr 2020 01:37:35 +0000 (09:37 +0800)]
bpf: Remove set but not used variable 'dst_known'

Fixes gcc '-Wunused-but-set-variable' warning:

kernel/bpf/verifier.c:5603:18: warning: variable â€˜dst_known’
set but not used [-Wunused-but-set-variable], delete this
variable.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200418013735.67882-1-maowenan@huawei.com
4 years agonet: hns3: remove an unnecessary check in hclge_set_umv_space()
Huazhong Tan [Sun, 26 Apr 2020 02:13:48 +0000 (10:13 +0800)]
net: hns3: remove an unnecessary check in hclge_set_umv_space()

Since hclge_set_umv_space() is only called by hclge_init_umv_space(),
parameter 'allocated_size' will not be NULL.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: openvswitch: use div_u64() for 64-by-32 divisions
Tonghao Zhang [Sat, 25 Apr 2020 03:39:48 +0000 (11:39 +0800)]
net: openvswitch: use div_u64() for 64-by-32 divisions

Compile the kernel for arm 32 platform, the build warning found.
To fix that, should use div_u64() for divisions.
| net/openvswitch/meter.c:396: undefined reference to `__udivdi3'

[add more commit msg, change reported tag, and use div_u64 instead
of do_div by Tonghao]

Fixes: e57358873bb5d6ca ("net: openvswitch: use u64 for meter bucket")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: openvswitch: suitable access to the dp_meters
Tonghao Zhang [Sat, 25 Apr 2020 03:39:47 +0000 (11:39 +0800)]
net: openvswitch: suitable access to the dp_meters

To fix the following sparse warning:
| net/openvswitch/meter.c:109:38: sparse: sparse: incorrect type
| in assignment (different address spaces) ...
| net/openvswitch/meter.c:720:45: sparse: sparse: incorrect type
| in argument 1 (different address spaces) ...

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'hinic-add-SR-IOV-support'
David S. Miller [Sun, 26 Apr 2020 03:46:28 +0000 (20:46 -0700)]
Merge branch 'hinic-add-SR-IOV-support'

Luo bin says:

====================
hinic: add SR-IOV support

patch #1 adds mailbox channel support and vf can
communicate with pf or hw through it.
patch #2 adds support for enabling vf and tx/rx
capabilities based on vf.
patch #3 adds support for vf's basic configurations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agohinic: add net_device_ops associated with vf
Luo bin [Sat, 25 Apr 2020 01:21:11 +0000 (01:21 +0000)]
hinic: add net_device_ops associated with vf

adds ndo_set_vf_mac/ndo_set_vf_vlan/ndo_get_vf_config and
ndo_set_vf_trust to configure netdev of virtual function

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agohinic: add sriov feature support
Luo bin [Sat, 25 Apr 2020 01:21:10 +0000 (01:21 +0000)]
hinic: add sriov feature support

adds support of basic sriov feature including initialization and
tx/rx capabilities of virtual function

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agohinic: add mailbox function support
Luo bin [Sat, 25 Apr 2020 01:21:09 +0000 (01:21 +0000)]
hinic: add mailbox function support

virtual function and physical function can communicate with each
other through mailbox channel supported by hw

Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/mlx4_core: Add missing iounmap() in error path
Zou Wei [Fri, 24 Apr 2020 13:53:14 +0000 (21:53 +0800)]
net/mlx4_core: Add missing iounmap() in error path

This fixes the following coccicheck warning:

drivers/net/ethernet/mellanox/mlx4/crdump.c:200:2-8: ERROR: missing iounmap;
ioremap on line 190 and execution via conditional on line 198

Fixes: 7ef19d3b1d5e ("devlink: report error once U32_MAX snapshot ids have been used")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodccp: remove unused inline function dccp_set_seqno
YueHaibing [Fri, 24 Apr 2020 13:13:34 +0000 (21:13 +0800)]
dccp: remove unused inline function dccp_set_seqno

There's no callers in-tree since commit 792b48780e8b ("dccp: Implement
both feature-local and feature-remote Sequence Window feature")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoqlcnic: remove unused inline function qlcnic_hw_write_wx_2M
YueHaibing [Fri, 24 Apr 2020 13:12:56 +0000 (21:12 +0800)]
qlcnic: remove unused inline function qlcnic_hw_write_wx_2M

There's no callers in-tree anymore.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoliquidio: remove unused inline functions
YueHaibing [Fri, 24 Apr 2020 13:11:34 +0000 (21:11 +0800)]
liquidio: remove unused inline functions

commit b6334be64d6f ("net/liquidio: Delete driver version assignment")
left behind this, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoptp: clockmatrix: remove unnecessary comparison
Yang Yingliang [Fri, 24 Apr 2020 12:52:26 +0000 (20:52 +0800)]
ptp: clockmatrix: remove unnecessary comparison

The type of loaddr is u8 which is always '<=' 0xff, so the
loaddr <= 0xff is always true, we can remove this comparison.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Vincent Cheng <vincent.cheng.xh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agohsr: remove unnecessary code in hsr_dev_change_mtu()
Taehee Yoo [Fri, 24 Apr 2020 12:43:09 +0000 (12:43 +0000)]
hsr: remove unnecessary code in hsr_dev_change_mtu()

In the hsr_dev_change_mtu(), the 'dev' and 'master->dev' pointer are
same. So, the 'master' variable and some code are unnecessary.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotcp: mptcp: use mptcp receive buffer space to select rcv window
Florian Westphal [Fri, 24 Apr 2020 10:31:50 +0000 (12:31 +0200)]
tcp: mptcp: use mptcp receive buffer space to select rcv window

In MPTCP, the receive window is shared across all subflows, because it
refers to the mptcp-level sequence space.

MPTCP receivers already place incoming packets on the mptcp socket
receive queue and will charge it to the mptcp socket rcvbuf until
userspace consumes the data.

Update __tcp_select_window to use the occupancy of the parent/mptcp
socket instead of the subflow socket in case the tcp socket is part
of a logical mptcp connection.

This commit doesn't change choice of initial window for passive or active
connections.
While it would be possible to change those as well, this adds complexity
(especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
of a TCP segment at the connection level if it does not also carry a DSS
option with a Data ACK field.'

SYN/SYNACK packets do not carry a DSS option with a Data ACK field.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodpaa2-eth: add channel stat to debugfs
Ioana Ciornei [Fri, 24 Apr 2020 09:33:18 +0000 (12:33 +0300)]
dpaa2-eth: add channel stat to debugfs

Compute the average number of frames processed for each CDAN (Channel
Data Availability Notification) and export it to debugfs detailed
channel stats.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoocteontx2-pf: Remove unneeded semicolon
Zheng Bin [Fri, 24 Apr 2020 09:13:35 +0000 (17:13 +0800)]
octeontx2-pf: Remove unneeded semicolon

Fixes coccicheck warning:

drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h:312:2-3: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: dp83867: Remove unneeded semicolon
Zheng Bin [Fri, 24 Apr 2020 09:08:50 +0000 (17:08 +0800)]
net: phy: dp83867: Remove unneeded semicolon

Fixes coccicheck warning:

drivers/net/phy/dp83867.c:368:2-3: Unneeded semicolon
drivers/net/phy/dp83867.c:403:2-3: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-hns3-refactor-for-MAC-table'
David S. Miller [Sun, 26 Apr 2020 03:29:44 +0000 (20:29 -0700)]
Merge branch 'net-hns3-refactor-for-MAC-table'

Huazhong Tan says:

====================
net: hns3: refactor for MAC table

This patchset refactors the MAC table management, configure
the MAC address asynchronously, instead of synchronously.
Base on this change, it also refines the handle of promisc
mode and filter table entries restoring after reset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: optimize the filter table entries handling when resetting
Jian Shen [Fri, 24 Apr 2020 02:23:13 +0000 (10:23 +0800)]
net: hns3: optimize the filter table entries handling when resetting

Currently, the PF driver removes all (including its VFs') MAC/VLAN
flow director table entries when resetting, and restores them after
reset completed.

In fact, the hardware will clear all table entries only in IMP
reset and global reset. So driver only needs to restore the table
entries in these cases, and needs do nothing when PF reset, FLR
or other function level reset.

This patch optimizes it by removing unnecessary table entries clear
and restoring handling in the reset flow, and doing the restoring
after reset completed.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: use mutex vport_lock instead of mutex umv_lock
Jian Shen [Fri, 24 Apr 2020 02:23:12 +0000 (10:23 +0800)]
net: hns3: use mutex vport_lock instead of mutex umv_lock

Currently, the driver use mutex umv_lock to protect the variable
vport->share_umv_size. And there is already a mutex vport_lock
being defined in the driver, which is designed to protect the
resource of vport. So we can use vport_lock instead of umv_lock.

Furthermore, there is a time window for protect share_umv_size
between checking UMV space and doing MAC configuration in the
lin function hclge_add_uc_addr_common(). It should be extended.

This patch uses mutex vport_lock intead of spin lock umv_lock to
protect share_umv_size, and adjusts the mutex's range.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: refactor the promisc mode setting
Jian Shen [Fri, 24 Apr 2020 02:23:11 +0000 (10:23 +0800)]
net: hns3: refactor the promisc mode setting

As the HNS3 driver doesn't update the MAC address directly in
function hns3_set_rx_mode() now, it can't know whether the
MAC table is full from __dev_uc_sync() and __dev_mc_sync(),
so it's senseless to handle the overflow promisc here.

This patch removes the handle of overflow promisc from function
hns3_set_rx_mode(), and updates the promisc mode in the service
task.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: add support for dumping UC and MC MAC list
Jian Shen [Fri, 24 Apr 2020 02:23:10 +0000 (10:23 +0800)]
net: hns3: add support for dumping UC and MC MAC list

This patch adds support for dumping entries of UC and MC MAC list,
which help checking whether a MAC address being added into hardware
or not.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: refactor the MAC address configure
Jian Shen [Fri, 24 Apr 2020 02:23:09 +0000 (10:23 +0800)]
net: hns3: refactor the MAC address configure

Currently, the HNS3 driver sync and unsync MAC address in function
hns3_set_rx_mode(). For PF, it adds and deletes MAC address directly
in the path of dev_set_rx_mode(). If failed, it won't retry until
next calling of hns3_set_rx_mode(). On the other hand, if request
add and remove a same address many times at a short interval, each
request must be done one by one, can't be merged. For VF, it sends
mailbox messages to PF to request adding or deleting MAC address in
the path of function hns3_set_rx_mode(), no matter the address is
configured success.

This patch refines it by recording the MAC address in function
hns3_set_rx_mode(), and updating MAC address in the service task.
If failed, it will retry by the next calling of periodical service
task. It also uses some state to mark the state of each MAC address
in the MAC list, which can help merge configure request for a same
address. With these changes, when global reset or IMP reset occurs,
we can restore the MAC table with the MAC list.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: replace num_req_vfs with num_alloc_vport in hclge_reset_umv_space()
Jian Shen [Fri, 24 Apr 2020 02:23:08 +0000 (10:23 +0800)]
net: hns3: replace num_req_vfs with num_alloc_vport in hclge_reset_umv_space()

Like the calculation elsewhere, replaces num_req_vfs with
num_alloc_vport in hclge_reset_umv_space().

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: remove unnecessary parameter 'is_alloc' in hclge_set_umv_space()
Jian Shen [Fri, 24 Apr 2020 02:23:07 +0000 (10:23 +0800)]
net: hns3: remove unnecessary parameter 'is_alloc' in hclge_set_umv_space()

Since hclge_set_umv_space() is only called by hclge_init_umv_space(),
so parameter 'is_alloc' is redundant.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: refine for unicast MAC VLAN space management
Jian Shen [Fri, 24 Apr 2020 02:23:06 +0000 (10:23 +0800)]
net: hns3: refine for unicast MAC VLAN space management

Currently, firmware helps manage the unicast MAC VLAN table
space for each PF. PF just needs to tell firmware its wanted
space when initializing, and unnecessary to free it when
un-intializing. So this patch removes the umv space free handle,
and removes the forward statement of hclge_set_umv_space()
by defining hclge_init_umv_space() after it.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
David S. Miller [Sun, 26 Apr 2020 02:24:42 +0000 (19:24 -0700)]
Merge git://git./linux/kernel/git/netdev/net

Simple overlapping changes to linux/vermagic.h

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm...
Linus Torvalds [Sat, 25 Apr 2020 19:25:32 +0000 (12:25 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ebiederm/user-namespace

Pull pid leak fix from Eric Biederman:
 "Oleg noticed that put_pid(thread_pid) was not getting called when proc
  was not compiled in.

  Let's get that fixed before 5.7 is released and causes problems for
  anyone"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Put thread_pid in release_task not proc_flush_pid

4 years agoMerge tag 'timers-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 25 Apr 2020 19:16:48 +0000 (12:16 -0700)]
Merge tag 'timers-urgent-2020-04-25' of git://git./linux/kernel/git/tip/tip

Pull timer fixlet from Ingo Molnar:
 "A single fix for a comment that may show up in DocBook output"

* tag 'timers-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vdso/datapage: Use correct clock mode name in comment

4 years agoMerge tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 25 Apr 2020 19:11:47 +0000 (12:11 -0700)]
Merge tag 'sched-urgent-2020-04-25' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Misc fixes:

   - an uclamp accounting fix

   - three frequency invariance fixes and a readability improvement"

* tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix reset-on-fork from RT with uclamp
  x86, sched: Move check for CPU type to caller function
  x86, sched: Don't enable static key when starting secondary CPUs
  x86, sched: Account for CPUs with less than 4 cores in freq. invariance
  x86, sched: Bail out of frequency invariance if base frequency is unknown

4 years agoMerge tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 25 Apr 2020 19:08:24 +0000 (12:08 -0700)]
Merge tag 'perf-urgent-2020-04-25' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Two changes:

   - fix exit event records

   - extend x86 PMU driver enumeration to add Intel Jasper Lake CPU
     support"

* tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: fix parent pid/tid in task exit events
  perf/x86/cstate: Add Jasper Lake CPU support

4 years agoMerge tag 'objtool-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 25 Apr 2020 18:52:02 +0000 (11:52 -0700)]
Merge tag 'objtool-urgent-2020-04-25' of git://git./linux/kernel/git/tip/tip

Pull objtool fixes from Ingo Molnar:
 "Two fixes: fix an off-by-one bug, and fix 32-bit builds on 64-bit
  systems"

* tag 'objtool-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix off-by-one in symbol_by_offset()
  objtool: Fix 32bit cross builds

4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Sat, 25 Apr 2020 02:17:30 +0000 (19:17 -0700)]
Merge git://git./linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

 1) Fix memory leak in netfilter flowtable, from Roi Dayan.

 2) Ref-count leaks in netrom and tipc, from Xiyu Yang.

 3) Fix warning when mptcp socket is never accepted before close, from
    Florian Westphal.

 4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.

 5) Fix large delays during PTP synchornization in cxgb4, from Rahul
    Lakkireddy.

 6) team_mode_get() can hang, from Taehee Yoo.

 7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
    from Niklas Schnelle.

 8) Fix handling of bpf XADD on BTF memory, from Jann Horn.

 9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.

10) Missing queue memory release in iwlwifi pcie code, from Johannes
    Berg.

11) Fix NULL deref in macvlan device event, from Taehee Yoo.

12) Initialize lan87xx phy correctly, from Yuiko Oshino.

13) Fix looping between VRF and XFRM lookups, from David Ahern.

14) etf packet scheduler assumes all sockets are full sockets, which is
    not necessarily true. From Eric Dumazet.

15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.

16) fib_select_default() needs to handle nexthop objects, from David
    Ahern.

17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.

18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.

19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.

20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
    from Luke Nelson.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
  selftests/bpf: Fix a couple of broken test_btf cases
  tools/runqslower: Ensure own vmlinux.h is picked up first
  bpf: Make bpf_link_fops static
  bpftool: Respect the -d option in struct_ops cmd
  selftests/bpf: Add test for freplace program with expected_attach_type
  bpf: Propagate expected_attach_type when verifying freplace programs
  bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
  bpf, x86_32: Fix logic error in BPF_LDX zero-extension
  bpf, x86_32: Fix clobbering of dst for BPF_JSET
  bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
  bpf: Fix reStructuredText markup
  net: systemport: suppress warnings on failed Rx SKB allocations
  net: bcmgenet: suppress warnings on failed Rx SKB allocations
  macsec: avoid to set wrong mtu
  mac80211: sta_info: Add lockdep condition for RCU list usage
  mac80211: populate debugfs only after cfg80211 init
  net: bcmgenet: correct per TX/RX ring statistics
  net: meth: remove spurious copyright text
  net: phy: bcm84881: clear settings on link down
  chcr: Fix CPU hard lockup
  ...

4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Sat, 25 Apr 2020 01:26:14 +0000 (18:26 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2020-04-24

The following pull-request contains BPF updates for your *net* tree.

We've added 17 non-merge commits during the last 5 day(s) which contain
a total of 19 files changed, 203 insertions(+), 85 deletions(-).

The main changes are:

1) link_update fix, from Andrii.

2) libbpf get_xdp_id fix, from David.

3) xadd verifier fix, from Jann.

4) x86-32 JIT fixes, from Luke and Wang.

5) test_btf fix, from Stanislav.

6) freplace verifier fix, from Toke.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoselftests/bpf: Fix a couple of broken test_btf cases
Stanislav Fomichev [Wed, 22 Apr 2020 00:37:53 +0000 (17:37 -0700)]
selftests/bpf: Fix a couple of broken test_btf cases

Commit 51c39bb1d5d1 ("bpf: Introduce function-by-function verification")
introduced function linkage flag and changed the error message from
"vlen != 0" to "Invalid func linkage" and broke some fake BPF programs.

Adjust the test accordingly.

AFACT, the programs don't really need any arguments and only look
at BTF for maps, so let's drop the args altogether.

Before:
BTF raw test[103] (func (Non zero vlen)): do_test_raw:3703:FAIL expected
err_str:vlen != 0
magic: 0xeb9f
version: 1
flags: 0x0
hdr_len: 24
type_off: 0
type_len: 72
str_off: 72
str_len: 10
btf_total_size: 106
[1] INT (anon) size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[2] INT (anon) size=4 bits_offset=0 nr_bits=32 encoding=(none)
[3] FUNC_PROTO (anon) return=0 args=(1 a, 2 b)
[4] FUNC func type_id=3 Invalid func linkage

BTF libbpf test[1] (test_btf_haskv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0

libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_haskv.o'
do_test_file:4201:FAIL bpf_object__load: -4007
BTF libbpf test[2] (test_btf_newkv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0

libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_newkv.o'
do_test_file:4201:FAIL bpf_object__load: -4007
BTF libbpf test[3] (test_btf_nokv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0

libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_nokv.o'
do_test_file:4201:FAIL bpf_object__load: -4007

Fixes: 51c39bb1d5d1 ("bpf: Introduce function-by-function verification")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200422003753.124921-1-sdf@google.com
4 years agotools/runqslower: Ensure own vmlinux.h is picked up first
Andrii Nakryiko [Wed, 22 Apr 2020 01:24:07 +0000 (18:24 -0700)]
tools/runqslower: Ensure own vmlinux.h is picked up first

Reorder include paths to ensure that runqslower sources are picking up
vmlinux.h, generated by runqslower's own Makefile. When runqslower is built
from selftests/bpf, due to current -I$(BPF_INCLUDE) -I$(OUTPUT) ordering, it
might pick up not-yet-complete vmlinux.h, generated by selftests Makefile,
which could lead to compilation errors like [0]. So ensure that -I$(OUTPUT)
goes first and rely on runqslower's Makefile own dependency chain to ensure
vmlinux.h is properly completed before source code relying on it is compiled.

  [0] https://travis-ci.org/github/libbpf/libbpf/jobs/677905925

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200422012407.176303-1-andriin@fb.com