git.osdn.net Git - uclinux-h8/linux.git/log

samples: bpf: Fix uninitialized variable in xdp_redirect_cpu

While at it, also improve help output when CPU number is greater than
possible.

Fixes: e531a220cc59 ("samples: bpf: Convert xdp_redirect_cpu to XDP samples helper")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210826120910.454081-1-memxor@gmail.com

selftests/bpf: Reduce more flakyness in sockmap_listen

This patch adds similar retry logic to more places where read() is used, to
reduce flakyness in slow CI environment.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825184745.2680830-1-fallentree@fb.com

bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS

This commit fixes linker errors along the lines of:

s390-linux-ld: task_iter.c:(.init.text+0xa4): undefined reference to `btf_task_struct_ids'`

Fix by defining btf_task_struct_ids unconditionally in kernel/bpf/btf.c
since there exists code that unconditionally uses btf_task_struct_ids.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/05d94748d9f4b3eecedc4fddd6875418a396e23c.1629942444.git.dxu@dxuuu.xyz

Merge branch 'bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt'

Martin KaFai says:

====================

This set allows the bpf-tcp-cc to call bpf_setsockopt.  One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).

bpf_getsockopt() is also added to have a symmetrical API, so
less usage surprise.

v2:
- Not allow switching to kernel's tcp_cdg because it is the only
  kernel tcp-cc that stores a pointer to icsk_ca_priv.
  Please see the commit log in patch 1 for details.
  Test is added in patch 4 to check switching to tcp_cdg.
- Refactor the logic finding the offset of a func ptr
  in the "struct tcp_congestion_ops" to prog_ops_moff()
  in patch 1.
- bpf_setsockopt() has been disabled in release() since v1 (please
  see commit log in patch 1 for reason).  bpf_getsockopt() is
  also disabled together in release() in v2 to avoid usage surprise
  because both of them are usually expected to be available together.
  bpf-tcp-cc can already use PTR_TO_BTF_ID to read from tcp_sock.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Add dctcp fallback test

This patch makes the bpf_dctcp test to fallback to cubic by
using setsockopt(TCP_CONGESTION) when the tcp flow is not
ecn ready.

It also checks setsockopt() is not available to release().

The settimeo() from the network_helpers.h is used, so the local
one is removed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173026.3979130-1-kafai@fb.com

bpf: selftests: Add connect_to_fd_opts to network_helpers

The next test requires to setsockopt(TCP_CONGESTION) before
connect(), so a new arg is needed for the connect_to_fd() to specify
the cc's name.

This patch adds a new "struct network_helper_opts" for the future
option needs. It starts with the "cc" and "timeout_ms" option.
A new helper connect_to_fd_opts() is added to take the new
"const struct network_helper_opts *opts" as an arg.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173019.3977910-1-kafai@fb.com

bpf: selftests: Add sk_state to bpf_tcp_helpers.h

Add sk_state define to bpf_tcp_helpers.h. Rename the existing
global variable "sk_state" in the kfunc_call test to "sk_state_res".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173013.3977316-1-kafai@fb.com

bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt

This patch allows the bpf-tcp-cc to call bpf_setsockopt.  One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).

During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
called and this could cause a recursion but it is stopped by the
current trampoline's logic (in the prog->active counter).

While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
the tcp stack calls bpf-tcp-cc's release().  To avoid the retiring
bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
available to the bpf-tcp-cc's release().  This will avoid release()
making setsockopt() call that will potentially allocate new resources.

Although the bpf-tcp-cc already has a more powerful way to read tcp_sock
from the PTR_TO_BTF_ID, it is usually expected that bpf_getsockopt and
bpf_setsockopt are available together.  Thus, bpf_getsockopt() is also
added to all tcp_congestion_ops except release().

When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
to switch to a new cc, the old bpf-tcp-cc will be released by
bpf_struct_ops_put().  Thus, this patch also puts the bpf_struct_ops_map
after a rcu grace period because the trampoline's image cannot be freed
while the old bpf-tcp-cc is still running.

bpf-tcp-cc can only access icsk_ca_priv as SCALAR.  All kernel's
tcp-cc is also accessing the icsk_ca_priv as SCALAR.   The size
of icsk_ca_priv has already been raised a few times to avoid
extra kmalloc and memory referencing.  The only exception is the
kernel's tcp_cdg.c that stores a kmalloc()-ed pointer in icsk_ca_priv.
To avoid the old bpf-tcp-cc accidentally overriding this tcp_cdg's pointer
value stored in icsk_ca_priv after switching and without over-complicating
the bpf's verifier for this one exception in tcp_cdg, this patch does not
allow switching to tcp_cdg.  If there is a need, bpf_tcp_cdg can be
implemented and then use the bpf_sk_storage as the extended storage.

bpf_sk_setsockopt proto has only been recently added and used
in bpf-sockopt and bpf-iter-tcp, so impose the tcp_cdg limitation in the
same proto instead of adding a new proto specifically for bpf-tcp-cc.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173007.3976921-1-kafai@fb.com

Merge branch 'selftests: xsk: various simplifications'

Magnus Karlsson says:

====================

This patch set mainly contains various simplifications to the xsk
selftests. The only exception is the introduction of packet streams
that describes what the Tx process should send and what the Rx process
should receive. If it receives anything else, the test fails. This
mechanism can be used to produce tests were all packets are not
received by the Rx thread or modified in some way. An example of this
is if an XDP program does XDP_PASS on some of the packets.

This patch set will be followed by another patch set that implements a
new structure that will facilitate adding new tests. A couple of new
tests will also be included in that patch set.

v2 -> v3:

* Reworked patch 12 so that it now has functions for creating and
  destroying ifobjects. Simplifies the code. [Maciej]
* The packet stream now allocates the supplied buffer array length,
  instead of the default one. [Maciej]
* pkt_stream_get_pkt() now returns NULL when indexing a non-existing
  packet. [Maciej]
* pkt_validate() is now is_pkt_valid(). [Maciej]
* Slowed down packet sending speed even more in patch 11 so that slow
  systems do not silenty drop packets in skb mode.

v1 -> v2:

* Dropped the patch with per process limit changes as it is not needed
  [Yonghong]
* Improved the commit message of patch 1 [Yonghong]
* Fixed a spelling error in patch 9

Thanks: Magnus
====================

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests: xsk: Preface options with opt

Preface all options with opt_ and make them booleans.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-17-magnus.karlsson@gmail.com

selftests: xsk: Make enums lower case

Make enums lower case as that is the standard. Also drop the
unnecessary TEST_MODE_UNCONFIGURED mode.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-16-magnus.karlsson@gmail.com

selftests: xsk: Generate packets from specification

Generate packets from a specification instead of something hard
coded. The idea is that a test generates one or more packet
specifications and provides it/them to both Tx and Rx. The Tx thread
will generate from this specification and Rx will validate that it
receives what is in the specification. The specification can be the
same on both ends, meaning that everything that was sent should be
received, or different which means that Rx will only receive part of
the sent packets.

Currently, the packet specification is the same for both Rx and Tx and
the same for each test. This will change in later work as features
and tests are added.

The data path functions are also renamed to better reflect what
actions they are performing after introducing this feature.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-15-magnus.karlsson@gmail.com

selftests: xsk: Generate packet directly in umem

Generate the packet directly in the umem instead of in a temporary
buffer that is copied out. Simplifies the code and improves
performance.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-14-magnus.karlsson@gmail.com

selftests: xsk: Simplify cleanup of ifobjects

Simpify the cleanup of ifobjects right before the program exits by
introducing functions for creating and destroying these objects.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-13-magnus.karlsson@gmail.com

selftests: xsk: Decrease sending speed

Decrease sending speed to avoid potentially overflowing some buffers
in the skb case that leads to dropped packets we cannot control (and
thus the tests may generate false negatives). Decrease batch size and
introduce a usleep in the transmit thread to not overflow the
receiver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-12-magnus.karlsson@gmail.com

selftests: xsk: Validate tx stats on tx thread

Validate the tx stats on the Tx thread instead of the Rx
thread. Depending on your settings, you might not be allowed to query
the statistics of a socket you do not own, so better to do this on the
correct thread to start with.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-11-magnus.karlsson@gmail.com

selftests: xsk: Simplify packet validation in xsk tests

Simplify packet validation in the xsk selftests by performing it at
once for every packet. The current code performed this per batch and
did this on copied packet data. Make it simpler and faster by
validating it at once and on the umem packet data thus skipping the
copy and the memory allocation for the temprary buffer.

The optional packet dump feature is also simplified in the same
manner. Memory allocation and copying is removed and the dump is
performed directly on the umem data.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-10-magnus.karlsson@gmail.com

selftests: xsk: Rename worker_* functions that are not thread entry points

Rename worker_* functions that are not thread entry points to
something else. This was confusing. Now only thread entry points are
worker_something.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-9-magnus.karlsson@gmail.com

selftests: xsk: Disassociate umem size with packets sent

Disassociate the number of packets sent with the number of buffers in
the umem. This so we can loop over the umem to test more things. Set
the size of the umem to be a multiple of 2M. A requirement for huge
pages that are needed in unaligned mode.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-8-magnus.karlsson@gmail.com

selftests: xsk: Remove end-of-test packet

Get rid of the end-of-test packet and just count the number of packets
received and quit when the expected number as been
received. Simplifies the code.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-7-magnus.karlsson@gmail.com

selftests: xsk: Simplify the retry code

Simplify the retry code and make it more efficient by waiting first,
instead of trying immediately which always fails due to the
asynchronous nature of xsk socket close. Also decrease the wait time
to significantly lower the run-time of the test suite.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-6-magnus.karlsson@gmail.com

selftests: xsk: Return correct error codes

Return the correct error codes so they can be printed correctly.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-5-magnus.karlsson@gmail.com

selftests: xsk: Remove unused variables

Remove unused variables and typedefs. The *_npkts variables are
incremented but never used.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-4-magnus.karlsson@gmail.com

selftests: xsk: Remove the num_tx_packets option

Remove the number of tx packet option as this should be decided by the
test itself. Also change the number of packets to be sent to 4096
speeding up the execution.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-3-magnus.karlsson@gmail.com

selftests: xsk: Remove color mode

Remove color mode since it does not add any value and having less code
means less maintenance which is a good thing.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210825093722.10219-2-magnus.karlsson@gmail.com

Merge branch 'bpf: Add bpf_task_pt_regs() helper'

Daniel Xu says:

====================

The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.

uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.

More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.

[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk

Changes from v1:
- Conwolidate BTF_ID decls for task_struct
- Enable bpf_get_current_task_btf() for all prog types
- Enable bpf_task_pt_regs() for all prog types
- Use ASSERT_* macros instead of CHECK
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: selftests: Add bpf_task_pt_regs() selftest

This test retrieves the uprobe's pt_regs in two different ways and
compares the contents in an arch-agnostic way.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/5581eb8800f6625ec8813fe21e9dce1fbdef4937.1629772842.git.dxu@dxuuu.xyz

bpf: Add bpf_task_pt_regs() helper

The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.

uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.

More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.

[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e2718ced2d51ef4268590ab8562962438ab82815.1629772842.git.dxu@dxuuu.xyz

bpf: Extend bpf_base_func_proto helpers with bpf_get_current_task_btf()

bpf_get_current_task() is already supported so it's natural to also
include the _btf() variant for btf-powered helpers.

This is required for non-tracing progs to use bpf_task_pt_regs() in the
next commit.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/f99870ed5f834c9803d73b3476f8272b1bb987c0.1629772842.git.dxu@dxuuu.xyz

bpf: Consolidate task_struct BTF_ID declarations

No need to have it defined 5 times. Once is enough.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/6dcefa5bed26fe1226f26683f36819bb53ec19a2.1629772842.git.dxu@dxuuu.xyz

bpf: Add BTF_ID_LIST_GLOBAL_SINGLE macro

Same as BTF_ID_LIST_SINGLE macro except defines a global ID.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/a867a97517df42fd3953eeb5454402b57e74538f.1629772842.git.dxu@dxuuu.xyz

Merge branch 'Improve XDP samples usability and output'

Kumar Kartikeya says:

====================

This set revamps XDP samples related to redirection to show better output and
implement missing features consolidating all their differences and giving them a
consistent look and feel, by implementing common features and command line
options.  Some of the TODO items like reporting redirect error numbers
(ENETDOWN, EINVAL, ENOSPC, etc.) have also been implemented.

Some of the features are:
* Received packet statistics
* xdp_redirect/xdp_redirect_map tracepoint statistics
* xdp_redirect_err/xdp_redirect_map_err tracepoint statistics (with support for
  showing exact errno)
* xdp_cpumap_enqueue/xdp_cpumap_kthread tracepoint statistics
* xdp_devmap_xmit tracepoint statistics
* xdp_exception tracepoint statistics
* Per ifindex pair devmap_xmit stats shown dynamically (for xdp_monitor) to
  decompose the total.
* Use of BPF skeleton and BPF static linking to share BPF programs.
* Use of vmlinux.h and tp_btf for raw_tracepoint support.
* Removal of redundant -N/--native-mode option (enforced by default now)
* ... and massive cleanups all over the place.

All tracepoints also use raw_tp now, and tracepoints like xdp_redirect
are only enabled when requested explicitly to capture successful redirection
statistics.

The set of programs converted as part of this series are:
* xdp_redirect_cpu
* xdp_redirect_map_multi
* xdp_redirect_map
* xdp_redirect
* xdp_monitor

Explanation of the output:

There is now a concise output mode by default that shows primarily four fields:
  rx/s        Number of packets received per second
  redir/s     Number of packets successfully redirected per second
  err,drop/s  Aggregated count of errors per second (including dropped packets)
  xmit/s      Number of packets transmitted on the output device per second

Some examples:
; sudo ./xdp_redirect_map veth0 veth1 -s
Redirecting from veth0 (ifindex 15; driver veth) to veth1 (ifindex 14; driver veth)
veth0->veth1                    0 rx/s                  0 redir/s 0 err,drop/s               0 xmit/s
veth0->veth1            9,998,660 rx/s          9,998,658 redir/s 0 err,drop/s       9,998,654 xmit/s
...

There is also a verbose mode, that can also be enabled by default using -v (--verbose).
The output mode can be switched dynamically at runtime using Ctrl + \ (SIGQUIT).

To make the concise output more useful, the errors that occur are expanded inline
(as if verbose mode was enabled) to let the user pin down the source of the
problem without having to clutter output (or possibly miss it) or always use verbose mode.

For instance, let's consider a case where the output device link state is set to
down while redirection is happening:

[...]
veth0->veth1           24,503,376 rx/s                  0 err,drop/s      24,503,372 xmit/s
veth0->veth1           25,044,775 rx/s                  0 err,drop/s      25,044,783 xmit/s
veth0->veth1           25,263,046 rx/s                  4 err,drop/s      25,263,028 xmit/s
  redirect_err                  4 error/s
    ENETDOWN                    4 error/s
[...]

The same holds for xdp_exception actions.

An example of how a complete xdp_redirect_map session would look:

; sudo ./xdp_redirect_map veth0 veth1
Redirecting from veth0 (ifindex 5; driver veth) to veth1 (ifindex 4; driver veth)
veth0->veth1            7,411,506 rx/s                  0 err,drop/s    7,411,470 xmit/s
veth0->veth1            8,931,770 rx/s                  0 err,drop/s    8,931,771 xmit/s
^\
veth0->veth1            8,787,295 rx/s                  0 err,drop/s    8,787,325 xmit/s
  receive total         8,787,295 pkt/s                 0 drop/s                0 error/s
    cpu:7               8,787,295 pkt/s                 0 drop/s                0 error/s
  redirect_err                  0 error/s
  xdp_exception                 0 hit/s
  xmit veth0->veth1     8,787,325 xmit/s                0 drop/s                0 drv_err/s          2.00 bulk-avg
     cpu:7              8,787,325 xmit/s                0 drop/s                0 drv_err/s          2.00 bulk-avg

veth0->veth1            8,842,610 rx/s                  0 err,drop/s    8,842,606 xmit/s
  receive total         8,842,610 pkt/s                 0 drop/s                0 error/s
    cpu:7               8,842,610 pkt/s                 0 drop/s                0 error/s
  redirect_err                  0 error/s
  xdp_exception                 0 hit/s
  xmit veth0->veth1     8,842,606 xmit/s                0 drop/s                0 drv_err/s          2.00 bulk-avg
     cpu:7              8,842,606 xmit/s                0 drop/s                0 drv_err/s          2.00 bulk-avg

^C
  Packets received    : 33,973,181
  Average packets/s   : 4,246,648
  Packets transmitted : 33,973,172
  Average transmit/s  : 4,246,647

The xdp_redirect tracepoint (for success stats) needs to be enabled explicitly
using --stats/-s. Documentation for entire output and options is provided when
user specifies --help/-h with a sample.

Changelog:
----------
v3 -> v4:
v3: https://lore.kernel.org/bpf/20210728165552.435050-1-memxor@gmail.com

* Address all feedback from Daniel
  * Use READ_ONCE/WRITE_ONCE from linux/compiler.h (cannot directly include
    due to conflicts with vmlinux.h)
  * Fix MAX_CPUS hardcoding by switching to mmapable array maps, that are
    resized based on the value of libbpf_num_possible_cpus
  * s/ELEMENTS_OF/ARRAY_SIZE/g
  * Use tools/include/linux/hashtable.h
  * Coding style fixes
  * Remove hyperlinks for tracepoints
  * Split into smaller reviewable changes
* Restore support for specifying custom xdp_redirect_cpu cpumap prog with some
   enhancements, including built-in programs for common actions (pass, drop,
   redirect). By default, cpumap prog is now disabled.
* Misc bug fixes all over the place

  The printing stuff is a lot more basic without hyperlink support, hence it
  has not been exported into a more general facility.

v2 -> v3
v2: https://lore.kernel.org/bpf/20210721212833.701342-1-memxor@gmail.com

* Address all feedback from Andrii
  * Replace usage of libbpf hashmap (internal API) with custom one
  * Rename ATOMIC_* macros to NO_TEAR_* to better reflect their use
  * Use size_t as a portable word sized data type
  * Set libbpf_set_strict_mode
  * Invert conditions in BPF programs to exit early and reduce nesting
  * Use canonical SEC("xdp") naming for all XDP BPF progams
* Add missing help description for cpumap enqueue and kthread tracepoints
* Move private struct declarations from xdp_sample_user.h to .c file
* Improve help output for cpumap enqueue and cpumap kthread tracepoints
* Fix a bug where keys array for BPF_MAP_LOOKUP_BATCH is overallocated
* Fix some conditions for printing stats (earlier only checked pps, now pps,
   drop, err and print if any is greater than zero)
* Fix alloc_stats_record to properly return and cleanup allocated memory on
   allocation failure instead of calling exit(3)
* Bump bpf_map_lookup_batch count to 32 to reduce lookup time with multiple
   devices in map
* Fix a bug where devmap_xmit_multi stats are not printed when previous record
   is missing (i.e. when the first time stats are printed), by simply using a
   dummy record that is zeroed out
* Also print per-CPU counts for devmap_xmit_multi which we collect already
* Change mac_map to be BPF_MAP_TYPE_HASH instead of array to prevent resizing
   to a large size when max_ifindex is high, in xdp_redirect_map_multi
* Fix instance of strerror(errno) in sample_install_xdp to use saved errno
* Provide a usage function from samples helper
* Provide a fix where incorrect stats are shown for parallel sessions of
   xdp_redirect_* samples by introducing matching support for input device(s),
   output device(s) and cpumap map id for enqueue and kthread stats.
   Only xdp_monitor doesn't filter stats, all others do.

RFC (v1) -> v2
RFC (v1): https://lore.kernel.org/bpf/20210528235250.2635167-1-memxor@gmail.com

* Address all feedback from Andrii
   * Use BPF static linking
   * Use vmlinux.h
   * Use BPF_PROG macro
   * Use global variables instead of maps
* Use of tp_btf for raw_tracepoint progs
* Switch to timerfd for polling
* Use libbpf hashmap for maintaing device sets for per ifindex pair
   devmap_xmit stats
* Fix Makefile to specify object dependencies properly
* Use in-tree bpftool
* ... misc fixes and cleanups all over the place
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: Convert xdp_redirect_map_multi to XDP samples helper

Use the libbpf skeleton facility and other utilities provided by XDP
samples helper. Also adapt to change of type of mac address map, so that
no resizing is required.

Add a new flag for sample mask that skips priting the
from_device->to_device heading for each line, as xdp_redirect_map_multi
may have two devices but the flow of data may be bidirectional, so the
output would be confusing.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-23-memxor@gmail.com

samples: bpf: Convert xdp_redirect_map_multi_kern.o to XDP samples helper

One of the notable changes is using a BPF_MAP_TYPE_HASH instead of array
map to store mac addresses of devices, as the resizing behavior was
based on max_ifindex, which unecessarily maximized the capacity of map
beyond what was needed.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-22-memxor@gmail.com

samples: bpf: Convert xdp_redirect_map to XDP samples helper

Use the libbpf skeleton facility and other utilities provided by XDP
samples helper.

Since get_mac_addr is already provided by XDP samples helper, we drop
it. Also convert to XDP samples helper similar to prior samples to
minimize duplication of code.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-21-memxor@gmail.com

samples: bpf: Convert xdp_redirect_map_kern.o to XDP samples helper

Also update it to use consistent SEC("xdp") and SEC("xdp_devmap")
naming, and use global variable instead of BPF map for copying the mac
address.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-20-memxor@gmail.com

samples: bpf: Convert xdp_redirect_cpu to XDP samples helper

Use the libbpf skeleton facility and other utilities provided by XDP
samples helper.

Similar to xdp_monitor, xdp_redirect_cpu was quite featureful except a
few minor omissions (e.g. redirect errno reporting). All of these have
been moved to XDP samples helper, hence drop the unneeded code and
convert to usage of helpers provided by it.

One of the important changes here is dropping of mprog-disable option,
as we make that the default. Also, we support built-in programs for some
common actions on the packet when it reaches kthread (pass, drop,
redirect to device). If the user still needs to install a custom
program, they can still supply a BPF object, however the program should
be suitably tagged with SEC("xdp_cpumap") annotation so that the
expected attach type is correct when updating our cpumap map element.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-19-memxor@gmail.com

samples: bpf: Convert xdp_redirect_cpu_kern.o to XDP samples helper

Similar to xdp_monitor_kern, a lot of these BPF programs have been
reimplemented properly consolidating missing features from other XDP
samples. Hence, drop the unneeded code and rename to .bpf.c suffix.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-18-memxor@gmail.com

samples: bpf: Convert xdp_redirect to XDP samples helper

Use the libbpf skeleton facility and other utilities provided by XDP
samples helper.

One important note:
The XDP samples helper handles ownership of installed XDP programs on
devices, including responding to SIGINT and SIGTERM, so drop the code
here and use the helpers we provide going forward for all xdp_redirect*
conversions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-17-memxor@gmail.com

samples: bpf: Convert xdp_redirect_kern.o to XDP samples helper

We moved swap_src_dst_mac to xdp_sample.bpf.h to be shared with other
potential users, so drop it while moving code to the new file.
Also, consistently use SEC("xdp") naming instead.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-16-memxor@gmail.com

samples: bpf: Convert xdp_monitor to XDP samples helper

Use the libbpf skeleton facility and other utilities provided by XDP
samples helper.

A lot of the code in xdp_monitor and xdp_redirect_cpu has been moved to
the xdp_sample_user.o helper, so we remove the duplicate functions here
that are no longer needed.

Thanks to BPF skeleton, we no longer depend on order of tracepoints to
uninstall them on startup. Instead, the sample mask is used to install
the needed tracepoints.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-15-memxor@gmail.com

samples: bpf: Convert xdp_monitor_kern.o to XDP samples helper

We already moved all the functionality it provided in XDP samples helper
userspace and kernel BPF object, so just delete the unneeded code.

We also add generation of BPF skeleton and compilation using clang
-target bpf for files ending with .bpf.c suffix (to denote that they use
vmlinux.h).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-14-memxor@gmail.com

samples: bpf: Add vmlinux.h generation support

Also, take this opportunity to depend on in-tree bpftool, so that we can
use static linking support in subsequent commits for XDP samples BPF
helper object.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-13-memxor@gmail.com

samples: bpf: Add devmap_xmit tracepoint statistics support

This adds support for retrieval and printing for devmap_xmit total and
mutli mode tracepoint. For multi mode, we keep a hash map entry for each
redirection stream, such that we can dynamically add and remove entries
on output.

The from_match and to_match will be set by individual samples when
setting up the XDP program on these devices.

The multi mode tracepoint is also handy for xdp_redirect_map_multi,
where up to 32 devices can be specified.

Also add samples_init_pre_load macro to finally set up the resized maps
and mmap them in place for low overhead stats retrieval.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-12-memxor@gmail.com

samples: bpf: Add BPF support for devmap_xmit tracepoint

This adds support for the devmap_xmit tracepoint, and its multi device
variant that can be used to obtain streams for each individual
net_device to net_device redirection. This is useful for decomposing
total xmit stats in xdp_monitor.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-11-memxor@gmail.com

samples: bpf: Add cpumap tracepoint statistics support

This consolidates retrieval and printing into the XDP sample helper. For
the kthread stats, it expands xdp_stats separately with its own per-CPU
stats. For cpumap enqueue, we display FROM->TO stats also with its
per-CPU stats.

The help out explains in detail the various aspects of the output.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-10-memxor@gmail.com

samples: bpf: Add BPF support for cpumap tracepoints

These are invoked in two places, when the XDP frame or SKB (for generic
XDP) enqueued to the ptr_ring (cpumap_enqueue) and when kthread processes
the frame after invoking the CPUMAP program for it (returning stats for
the batch).

We use cpumap_map_id to filter on the map_id as a way to avoid printing
incorrect stats for parallel sessions of xdp_redirect_cpu.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-9-memxor@gmail.com

samples: bpf: Add xdp_exception tracepoint statistics support

This implements the retrieval and printing, as well the help output.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-8-memxor@gmail.com

samples: bpf: Add BPF support for xdp_exception tracepoint

This would allow us to store stats for each XDP action, including their
per-CPU counts. Consolidating this here allows all redirect samples to
detect xdp_exception events.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-7-memxor@gmail.com

samples: bpf: Add redirect tracepoint statistics support

This implements per-errno reporting (for the ones we explicitly
recognize), adds some help output, and implements the stats retrieval
and printing functions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-6-memxor@gmail.com

samples: bpf: Add BPF support for redirect tracepoint

This adds the shared BPF file that will be used going forward for
sharing tracepoint programs among XDP redirect samples.

Since vmlinux.h conflicts with tools/include for READ_ONCE/WRITE_ONCE
and ARRAY_SIZE, they are copied in to xdp_sample.bpf.h along with other
helpers that will be required.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-5-memxor@gmail.com

samples: bpf: Add basic infrastructure for XDP samples

This file implements some common helpers to consolidate differences in
features and functionality between the various XDP samples and give them
a consistent look, feel, and reporting capabilities.

This commit only adds support for receive statistics, which does not
rely on any tracepoint, but on the XDP program installed on the device
by each XDP redirect sample.

Some of the key features are:
* A concise output format accompanied by helpful text explaining its
   fields.
* An elaborate output format building upon the concise one, and folding
   out details in case of errors and staying out of view otherwise.
* Printing driver names for devices redirecting packets.
* Getting mac address for interface.
* Printing summarized total statistics for the entire session.
* Ability to dynamically switch between concise and verbose mode, using
   SIGQUIT (Ctrl + \).

In later patches, the support will be extended for each tracepoint with
its own custom output in concise and verbose mode.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-4-memxor@gmail.com

tools: include: Add ethtool_drvinfo definition to UAPI header

Instead of copying the whole header in, just add the struct definitions
we need for now. In the future it can be synced as a copy of in-tree
header if required.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-3-memxor@gmail.com

samples: bpf: Fix a couple of warnings

cookie_uid_helper_example.c: In function ‘main’:
cookie_uid_helper_example.c:178:69: warning: ‘ -j ACCEPT’ directive
writing 10 bytes into a region of size between 8 and 58
[-Wformat-overflow=]
  178 |  sprintf(rules, "iptables -A OUTPUT -m bpf --object-pinned %s -j ACCEPT",
      |        ^~~~~~~~~~
/home/kkd/src/linux/samples/bpf/cookie_uid_helper_example.c:178:9: note:
‘sprintf’ output between 53 and 103 bytes into a destination of size 100
  178 |  sprintf(rules, "iptables -A OUTPUT -m bpf --object-pinned %s -j ACCEPT",
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  179 |         file);
      |         ~~~~~

Fix by using snprintf and a sufficiently sized buffer.

tracex4_user.c:35:15: warning: ‘write’ reading 12 bytes from a region of
size 11 [-Wstringop-overread]
   35 |         key = write(1, "\e[1;1H\e[2J", 12); /* clear screen */
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use size as 11.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210821002010.845777-2-memxor@gmail.com

bpf: Fix possible out of bound write in narrow load handling

Fix a verifier bug found by smatch static checker in [0].

This problem has never been seen in prod to my best knowledge. Fixing it
still seems to be a good idea since it's hard to say for sure whether
it's possible or not to have a scenario where a combination of
convert_ctx_access() and a narrow load would lead to an out of bound
write.

When narrow load is handled, one or two new instructions are added to
insn_buf array, but before it was only checked that

cnt >= ARRAY_SIZE(insn_buf)

And it's safe to add a new instruction to insn_buf[cnt++] only once. The
second try will lead to out of bound write. And this is what can happen
if `shift` is set.

Fix it by making sure that if the BPF_RSH instruction has to be added in
addition to BPF_AND then there is enough space for two more instructions
in insn_buf.

The full report [0] is below:

kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array

kernel/bpf/verifier.c
    12282
    12283 insn->off = off & ~(size_default - 1);
    12284 insn->code = BPF_LDX | BPF_MEM | size_code;
    12285 }
    12286
    12287 target_size = 0;
    12288 cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
    12289 &target_size);
    12290 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bounds check.

    12291     (ctx_field_size && !target_size)) {
    12292 verbose(env, "bpf verifier is misconfigured\n");
    12293 return -EINVAL;
    12294 }
    12295
    12296 if (is_narrower_load && size < target_size) {
    12297 u8 shift = bpf_ctx_narrow_access_offset(
    12298 off, size, size_default) * 8;
    12299 if (ctx_field_size <= 4) {
    12300 if (shift)
    12301 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
                                                         ^^^^^
increment beyond end of array

    12302 insn->dst_reg,
    12303 shift);
--> 12304 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
                                                 ^^^^^
out of bounds write

    12305 (1 << size * 8) - 1);
    12306 } else {
    12307 if (shift)
    12308 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
    12309 insn->dst_reg,
    12310 shift);
    12311 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
                                        ^^^^^^^^^^^^^^^
Same.

    12312 (1ULL << size * 8) - 1);
    12313 }
    12314 }
    12315
    12316 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
    12317 if (!new_prog)
    12318 return -ENOMEM;
    12319
    12320 delta += cnt - 1;
    12321
    12322 /* keep walking new program and skip insns we just inserted */
    12323 env->prog = new_prog;
    12324 insn      = new_prog->insnsi + i + delta;
    12325 }
    12326
    12327 return 0;
    12328 }

[0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/

v1->v2:
- clarify that problem was only seen by static checker but not in prod;

Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com

Merge branch 'bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SK_MSG'

Xu Liu says:

====================

We'd like to be able to identify netns from sk_msg hooks
to accelerate local process communication form different netns.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test for get_netns_cookie

Add test to use get_netns_cookie() from BPF_PROG_TYPE_SK_MSG.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820071712.52852-3-liuxu623@gmail.com

bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SK_MSG

We'd like to be able to identify netns from sk_msg hooks
to accelerate local process communication form different netns.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820071712.52852-2-liuxu623@gmail.com

Merge branch 'selftests/bpf: minor fixups'

Li Zhijian says:

====================
Fix a few issues reported by 0Day/LKP during runing selftests/bpf.

Changelog:
V2:
- folded previous similar standalone patch to [1/5], and add acked tag
from Song Liu
- add acked tag to [2/5], [3/5] from Song Liu
- [4/5]: move test_bpftool.py to TEST_PROGS_EXTENDED, files in TEST_GEN_PROGS_EXTENDED
are generated by make. Otherwise, it will break out-of-tree install:
'make O=/kselftest-build SKIP_TARGETS= V=1 -C tools/testing/selftests install INSTALL_PATH=/kselftest-install'
- [5/5]: new patch
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Exit with KSFT_SKIP if no Makefile found

This would happend when we run the tests after install kselftests
root@lkp-skl-d01 ~# /kselftests/run_kselftest.sh -t bpf:test_doc_build.sh
TAP version 13
1..1
# selftests: bpf: test_doc_build.sh
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
         LANGUAGE = (unset),
         LC_ALL = (unset),
         LC_ADDRESS = "en_US.UTF-8",
         LC_NAME = "en_US.UTF-8",
         LC_MONETARY = "en_US.UTF-8",
         LC_PAPER = "en_US.UTF-8",
         LC_IDENTIFICATION = "en_US.UTF-8",
         LC_TELEPHONE = "en_US.UTF-8",
         LC_MEASUREMENT = "en_US.UTF-8",
         LC_TIME = "en_US.UTF-8",
         LC_NUMERIC = "en_US.UTF-8",
         LANG = "en_US.UTF-8"
     are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
# skip:    bpftool files not found!
#
ok 1 selftests: bpf: test_doc_build.sh # SKIP

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820025549.28325-1-lizhijian@cn.fujitsu.com

selftests/bpf: Add missing files required by test_bpftool.sh for installing

test_bpftool.sh relies on bpftool and test_bpftool.py.

'make install' will install bpftool to INSTALL_PATH/bpf/bpftool, and
export it to PATH so that it can be used after installing.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820015556.23276-5-lizhijian@cn.fujitsu.com

selftests/bpf: Add default bpftool built by selftests to PATH

For 'make run_tests':
selftests will build bpftool into tools/testing/selftests/bpf/tools/sbin/bpftool
by default.

==================
root@lkp-skl-d01 /opt/rootfs/v5.14-rc4# make -C tools/testing/selftests/bpf run_tests
make: Entering directory '/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf'
  MKDIR    include
  MKDIR    libbpf
  MKDIR    bpftool
[...]
  GEN     /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/tools/build/bpftool/profiler.skel.h
  CC      /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/tools/build/bpftool/prog.o
  GEN     /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/tools/build/bpftool/pid_iter.skel.h
  CC      /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/tools/build/bpftool/pids.o
  LINK    /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/tools/build/bpftool/bpftool
  INSTALL bpftool
  GEN      vmlinux.h
[...]
# test_feature_dev_json (test_bpftool.TestBpftool) ... ERROR
# test_feature_kernel (test_bpftool.TestBpftool) ... ERROR
# test_feature_kernel_full (test_bpftool.TestBpftool) ... ERROR
# test_feature_kernel_full_vs_not_full (test_bpftool.TestBpftool) ... ERROR
# test_feature_macros (test_bpftool.TestBpftool) ... Error: bug: failed to retrieve CAP_BPF status: Invalid argument
# ERROR
#
# ======================================================================
# ERROR: test_feature_dev_json (test_bpftool.TestBpftool)
# ----------------------------------------------------------------------
# Traceback (most recent call last):
#   File "/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/test_bpftool.py", line 57, in wrapper
#     return f(*args, iface, **kwargs)
#   File "/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/test_bpftool.py", line 82, in test_feature_dev_json
#     res = bpftool_json(["feature", "probe", "dev", iface])
#   File "/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/test_bpftool.py", line 42, in bpftool_json
#     res = _bpftool(args)
#   File "/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/test_bpftool.py", line 34, in _bpftool
#     return subprocess.check_output(_args)
#   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
#     **kwargs).stdout
#   File "/usr/lib/python3.7/subprocess.py", line 487, in run
#     output=stdout, stderr=stderr)
# subprocess.CalledProcessError: Command '['bpftool', '-j', 'feature', 'probe', 'dev', 'dummy0']' returned non-zero exit status 255.
#
==================

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210820015556.23276-4-lizhijian@cn.fujitsu.com

selftests/bpf: Make test_doc_build.sh work from script directory

Previously, it fails as below:
-------------
root@lkp-skl-d01 /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf# ./test_doc_build.sh
++ realpath --relative-to=/opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf ./test_doc_build.sh
+ SCRIPT_REL_PATH=test_doc_build.sh
++ dirname test_doc_build.sh
+ SCRIPT_REL_DIR=.
++ realpath /opt/rootfs/v5.14-rc4/tools/testing/selftests/bpf/./../../../../
+ KDIR_ROOT_DIR=/opt/rootfs/v5.14-rc4
+ cd /opt/rootfs/v5.14-rc4
+ for tgt in docs docs-clean
+ make -s -C /opt/rootfs/v5.14-rc4/. docs
make: *** No rule to make target 'docs'. Stop.
+ for tgt in docs docs-clean
+ make -s -C /opt/rootfs/v5.14-rc4/. docs-clean
make: *** No rule to make target 'docs-clean'. Stop.
-----------

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210820015556.23276-3-lizhijian@cn.fujitsu.com

selftests/bpf: Enlarge select() timeout for test_maps

0Day robot observed that it's easily timeout on a heavy load host.
-------------------
# selftests: bpf: test_maps
# Fork 1024 tasks to 'test_update_delete'
# Fork 1024 tasks to 'test_update_delete'
# Fork 100 tasks to 'test_hashmap'
# Fork 100 tasks to 'test_hashmap_percpu'
# Fork 100 tasks to 'test_hashmap_sizes'
# Fork 100 tasks to 'test_hashmap_walk'
# Fork 100 tasks to 'test_arraymap'
# Fork 100 tasks to 'test_arraymap_percpu'
# Failed sockmap unexpected timeout
not ok 3 selftests: bpf: test_maps # exit=1
# selftests: bpf: test_lru_map
# nr_cpus:8
-------------------
Since this test will be scheduled by 0Day to a random host that could have
only a few cpus(2-8), enlarge the timeout to avoid a false NG report.

In practice, i tried to pin it to only one cpu by 'taskset 0x01 ./test_maps',
and knew 10S is likely enough, but i still perfer to a larger value 30.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210820015556.23276-2-lizhijian@cn.fujitsu.com

selftests/bpf: Reduce flakyness in timer_mim

This patch extends wait time in timer_mim. As observed in slow CI environment,
it is possible to have interrupt/preemption long enough to cause the test to
fail, almost 1 failure in 5 runs.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210823213629.3519641-1-fallentree@fb.com

Merge branch 'Refactor cgroup_bpf internals to use more specific attach_type'

Dave Marchevsky says:

====================

The cgroup_bpf struct has a few arrays (effective, progs, and flags) of
size MAX_BPF_ATTACH_TYPE. These are meant to separate progs by their
attach type, currently represented by the bpf_attach_type enum.

There are some bpf_attach_type values which are not valid attach types
for cgroup bpf programs. Programs with these attach types will never be
handled by cgroup_bpf_{attach,detach} and thus will never be held in
cgroup_bpf structs. Even if such programs did make it into their
reserved slot in those arrays, they would never be executed.

Accordingly we can migrate to a new internal cgroup_bpf-specific enum
for these arrays, saving some bytes per cgroup and making it more
obvious which BPF programs belong there. netns_bpf_attach_type is an
existing example of this pattern, let's do similar for cgroup_bpf.

v1->v2: Address Daniel's comments
* Reverse xmas tree ordering for def changes
* Helper macro to reduce to_cgroup_bpf_attach_type boilerplate
* checkpatch.pl complains: "ERROR: Macros with complex values should
be enclosed in parentheses". Found some existing macros (do 'git grep
"define case"') which get same complaint. Think it's fine to keep
as-is since it's immediately undef'd.
* Remove CG_BPF_ prefix from cgroup_bpf_attach_type
* Although I agree that the prefix is redundant, the de-prefixed
names feel a bit too 'general' given the internal use of the enum.
e.g. when someone sees CGROUP_INET6_BIND it's not obvious that it
should only be used in certain ways internally.
* Don't feel strongly about this, just my thoughts as a noob to the
internals.
* Rebase onto latest bpf-next/master
* No significant conflicts, some small boilerplate adjustments
needed to catch up to Andrii's "bpf: Refactor BPF_PROG_RUN_ARRAY
family of macros into functions" change
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum

Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
attach types and a function to map bpf_attach_type values to the new
enum. Inspired by netns_bpf_attach_type.

Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
possible. Functionality is unchanged as attach_type_to_prog_type
switches in bpf/syscall.c were preventing non-cgroup programs from
making use of the invalid cgroup_bpf array slots.

As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
arrays were sized using MAX_BPF_ATTACH_TYPE.

bpf_cgroup_storage is notably not migrated as struct
bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
member which is not meant to be opaque. Similarly, bpf_cgroup_link
continues to report its bpf_attach_type member to userspace via fdinfo
and bpf_link_info.

To ease disambiguation, bpf_attach_type variables are renamed from
'type' to 'atype' when changed to cgroup_bpf_attach_type.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com

af_unix: Fix NULL pointer bug in unix_shutdown

Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
introduced a bug for af_unix SEQPACKET type. In unix_shutdown, the
unhash function will call prot->unhash(), which is NULL for SEQPACKET.
And kernel will panic. On ARM32, it will show following messages: (it
likely affects x86 too).

Fix the bug by checking the prot->unhash is NULL or not first.

Kernel log:
<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address
00000000
pgd = 2fba1ffb
*pgd=00000000
Internal error: Oops: 80000005 [#1] PREEMPT SMP THUMB2
Modules linked in:
CPU: 1 PID: 1999 Comm: falkon Tainted: G        W
5.14.0-rc5-01175-g94531cfcbe79-dirty #9240
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at 0x0
LR is at unix_shutdown+0x81/0x1a8
pc : [<00000000>]    lr : [<c08f3311>]    psr: 600f0013
sp : e45aff70  ip : e463a3c0  fp : beb54f04
r10: 00000125  r9 : e45ae000  r8 : c4a56664
r7 : 00000001  r6 : c4a56464  r5 : 00000001  r4 : c4a56400
r3 : 00000000  r2 : c5a6b180  r1 : 00000000  r0 : c4a56400
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 50c5387d  Table: 05aa804a  DAC: 00000051
Register r0 information: slab PING start c4a56400 pointer offset 0
Register r1 information: NULL pointer
Register r2 information: slab task_struct start c5a6b180 pointer offset 0
Register r3 information: NULL pointer
Register r4 information: slab PING start c4a56400 pointer offset 0
Register r5 information: non-paged memory
Register r6 information: slab PING start c4a56400 pointer offset 100
Register r7 information: non-paged memory
Register r8 information: slab PING start c4a56400 pointer offset 612
Register r9 information: non-slab/vmalloc memory
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: slab filp start e463a3c0 pointer offset 0
Process falkon (pid: 1999, stack limit = 0x9ec48895)
Stack: (0xe45aff70 to 0xe45b0000)
ff60:                                     e45ae000 c5f26a00 00000000 00000125
ff80: c0100264 c07f7fa3 beb54f04 fffffff7 00000001 e6f3fc0e b5e5e9ec beb54ec4
ffa0: b5da0ccc c010024b b5e5e9ec beb54ec4 0000000f 00000000 00000000 beb54ebc
ffc0: b5e5e9ec beb54ec4 b5da0ccc 00000125 beb54f58 00785238 beb5529c beb54f04
ffe0: b5da1e24 beb54eac b301385c b62b6ee8 600f0030 0000000f 00000000 00000000
[<c08f3311>] (unix_shutdown) from [<c07f7fa3>] (__sys_shutdown+0x2f/0x50)
[<c07f7fa3>] (__sys_shutdown) from [<c010024b>]
(__sys_trace_return+0x1/0x16)
Exception stack(0xe45affa8 to 0xe45afff0)

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Reported-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/bpf/20210821180738.1151155-1-jiang.wang@bytedance.com

selftests/bpf: Add tests for {set|get} socket option from setsockopt BPF

Adding selftests for the newly added functionality to call bpf_setsockopt()
and bpf_getsockopt() from setsockopt BPF programs.

Test Details:

1. BPF Program

   Checks for changes in IPV6_TCLASS(SOL_IPV6) via setsockopt
   If the cca for the socket is not cubic do nothing
   If the newly set value for IPV6_TCLASS is 45 (0x2d) (as per our use-case)
   then change the cc from cubic to reno

2. User Space Program

   Creates an AF_INET6 socket and set the cca for that to be "cubic"
   Attach the program and set the IPV6_TCLASS to 0x2d using setsockopt
   Verify the cca for the socket changed to reno

Signed-off-by: Prankur Gupta <prankgup@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210817224221.3257826-3-prankgup@fb.com

bpf: Add support for {set|get} socket options from setsockopt BPF

Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
programs. An example use case is when the user sets the IPV6_TCLASS socket
option, we would also like to change the tcp-cc for that socket.

We don't have any use case for calling bpf_setsockopt() from supposedly read-
only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
at this point.

Signed-off-by: Prankur Gupta <prankgup@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com

bpf: Use kvmalloc for map keys in syscalls

Same as previous patch but for the keys. memdup_bpfptr is renamed
to kvmemdup_bpfptr (and converted to kvmalloc).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-2-sdf@google.com

bpf: Use kvmalloc for map values in syscall

Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).

Can be reproduced with netcnt test on qemu with "-smp 255".

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com

selftests/bpf: Adding delay in socketmap_listen to reduce flakyness

This patch adds a 1ms delay to reduce flakyness of the test.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210819163609.2583758-1-fallentree@fb.com

bpf: Fix NULL event->prog pointer access in bpf_overflow_handler

Andrii reported that libbpf CI hit the following oops when
running selftest send_signal:
  [ 1243.160719] BUG: kernel NULL pointer dereference, address: 0000000000000030
  [ 1243.161066] #PF: supervisor read access in kernel mode
  [ 1243.161066] #PF: error_code(0x0000) - not-present page
  [ 1243.161066] PGD 0 P4D 0
  [ 1243.161066] Oops: 0000 [#1] PREEMPT SMP NOPTI
  [ 1243.161066] CPU: 1 PID: 882 Comm: new_name Tainted: G           O      5.14.0-rc5 #1
  [ 1243.161066] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
  [ 1243.161066] RIP: 0010:bpf_overflow_handler+0x9a/0x1e0
  [ 1243.161066] Code: 5a 84 c0 0f 84 06 01 00 00 be 66 02 00 00 48 c7 c7 6d 96 07 82 48 8b ab 18 05 00 00 e8 df 55 eb ff 66 90 48 8d 75 48 48 89 e7 <ff> 55 30 41 89 c4 e8 fb c1 f0 ff 84 c0 0f 84 94 00 00 00 e8 6e 0f
  [ 1243.161066] RSP: 0018:ffffc900000c0d80 EFLAGS: 00000046
  [ 1243.161066] RAX: 0000000000000002 RBX: ffff8881002e0dd0 RCX: 00000000b4b47cf8
  [ 1243.161066] RDX: ffffffff811dcb06 RSI: 0000000000000048 RDI: ffffc900000c0d80
  [ 1243.161066] RBP: 0000000000000000 R08: 0000000000000000 R09: 1a9d56bb00000000
  [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: 0000000000000000
  [ 1243.161066] R13: ffffc900000c0e00 R14: ffffc900001c3c68 R15: 0000000000000082
  [ 1243.161066] FS:  00007fc0be2d3380(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
  [ 1243.161066] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1243.161066] CR2: 0000000000000030 CR3: 0000000104f8e000 CR4: 00000000000006e0
  [ 1243.161066] Call Trace:
  [ 1243.161066]  <IRQ>
  [ 1243.161066]  __perf_event_overflow+0x4f/0xf0
  [ 1243.161066]  perf_swevent_hrtimer+0x116/0x130
  [ 1243.161066]  ? __lock_acquire+0x378/0x2730
  [ 1243.161066]  ? __lock_acquire+0x372/0x2730
  [ 1243.161066]  ? lock_is_held_type+0xd5/0x130
  [ 1243.161066]  ? find_held_lock+0x2b/0x80
  [ 1243.161066]  ? lock_is_held_type+0xd5/0x130
  [ 1243.161066]  ? perf_event_groups_first+0x80/0x80
  [ 1243.161066]  ? perf_event_groups_first+0x80/0x80
  [ 1243.161066]  __hrtimer_run_queues+0x1a3/0x460
  [ 1243.161066]  hrtimer_interrupt+0x110/0x220
  [ 1243.161066]  __sysvec_apic_timer_interrupt+0x8a/0x260
  [ 1243.161066]  sysvec_apic_timer_interrupt+0x89/0xc0
  [ 1243.161066]  </IRQ>
  [ 1243.161066]  asm_sysvec_apic_timer_interrupt+0x12/0x20
  [ 1243.161066] RIP: 0010:finish_task_switch+0xaf/0x250
  [ 1243.161066] Code: 31 f6 68 90 2a 09 81 49 8d 7c 24 18 e8 aa d6 03 00 4c 89 e7 e8 12 ff ff ff 4c 89 e7 e8 ca 9c 80 00 e8 35 af 0d 00 fb 4d 85 f6 <58> 74 1d 65 48 8b 04 25 c0 6d 01 00 4c 3b b0 a0 04 00 00 74 37 f0
  [ 1243.161066] RSP: 0018:ffffc900001c3d18 EFLAGS: 00000282
  [ 1243.161066] RAX: 000000000000031f RBX: ffff888104cf4980 RCX: 0000000000000000
  [ 1243.161066] RDX: 0000000000000000 RSI: ffffffff82095460 RDI: ffffffff820adc4e
  [ 1243.161066] RBP: ffffc900001c3d58 R08: 0000000000000001 R09: 0000000000000001
  [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: ffff88813bd2bc80
  [ 1243.161066] R13: ffff8881002e8000 R14: ffff88810022ad80 R15: 0000000000000000
  [ 1243.161066]  ? finish_task_switch+0xab/0x250
  [ 1243.161066]  ? finish_task_switch+0x70/0x250
  [ 1243.161066]  __schedule+0x36b/0xbb0
  [ 1243.161066]  ? _raw_spin_unlock_irqrestore+0x2d/0x50
  [ 1243.161066]  ? lockdep_hardirqs_on+0x79/0x100
  [ 1243.161066]  schedule+0x43/0xe0
  [ 1243.161066]  pipe_read+0x30b/0x450
  [ 1243.161066]  ? wait_woken+0x80/0x80
  [ 1243.161066]  new_sync_read+0x164/0x170
  [ 1243.161066]  vfs_read+0x122/0x1b0
  [ 1243.161066]  ksys_read+0x93/0xd0
  [ 1243.161066]  do_syscall_64+0x35/0x80
  [ 1243.161066]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The oops can also be reproduced with the following steps:
  ./vmtest.sh -s
  # at qemu shell
  cd /root/bpf && while true; do ./test_progs -t send_signal

Further analysis showed that the failure is introduced with
commit b89fbfbb854c ("bpf: Implement minimal BPF perf link").
With the above commit, the following scenario becomes possible:
    cpu1                        cpu2
                                hrtimer_interrupt -> bpf_overflow_handler
    (due to closing link_fd)
    bpf_perf_link_release ->
    perf_event_free_bpf_prog ->
    perf_event_free_bpf_handler ->
      WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler)
      event->prog = NULL
                                bpf_prog_run(event->prog, &ctx)

In the above case, the event->prog is NULL for bpf_prog_run, hence
causing oops.

To fix the issue, check whether event->prog is NULL or not. If it
is, do not call bpf_prog_run. This seems working as the above
reproducible step runs more than one hour and I didn't see any
failures.

Fixes: b89fbfbb854c ("bpf: Implement minimal BPF perf link")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210819155209.1927994-1-yhs@fb.com

bpf: Undo off-by-one in interpreter tail call count limit

The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing
up to 33 tail calls (however odd that number may be!). Recently, this was
changed for the interpreter to reduce it down to 32 with the assumption that
this should have been the actual limit "which is in line with the behavior of
the x86 JITs" according to b61a28cf11d61 ("bpf: Fix off-by-one in tail call
count limiting").

Paul recently reported:

  I'm a bit surprised by this because I had previously tested the tail call
  limit of several JIT compilers and found it to be 33 (i.e., allowing chains
  of up to 34 programs). I've just extended a test program I had to validate
  this again on the x86-64 JIT, and found a limit of 33 tail calls again [1].

  Also note we had previously changed the RISC-V and MIPS JITs to allow up to
  33 tail calls [2, 3], for consistency with other JITs and with the interpreter.
  We had decided to increase these two to 33 rather than decrease the other
  JITs to 32 for backward compatibility, though that probably doesn't matter
  much as I'd expect few people to actually use 33 tail calls.

  [1] https://github.com/pchaigno/tail-call-bench/commit/ae7887482985b4b1745c9b2ef7ff9ae506c82886
  [2] 96bc4432f5ad ("bpf, riscv: Limit to 33 tail calls")
  [3] e49e6f6db04e ("bpf, mips: Limit to 33 tail calls")

Therefore, revert b61a28cf11d61 to re-align interpreter to limit a maximum of
33 tail calls. While it is unlikely to hit the limit for the vast majority,
programs in the wild could one way or another depend on this, so lets rather
be a bit more conservative, and lets align the small remainder of JITs to 33.
If needed in future, this limit could be slightly increased, but not decreased.

Fixes: b61a28cf11d61 ("bpf: Fix off-by-one in tail call count limiting")
Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com

selftests/bpf: Test for get_netns_cookie

Add test to use get_netns_cookie() from BPF_PROG_TYPE_SOCK_OPS.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818105820.91894-3-liuxu623@gmail.com

bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SOCK_OPS

We'd like to be able to identify netns from sockops hooks to
accelerate local process communication form different netns.

Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818105820.91894-2-liuxu623@gmail.com

libbpf: Rename libbpf documentation index file

This patch renames a documentation libbpf.rst to index.rst. In order
for readthedocs.org to pick this file up and properly build the
documentation site.

It also changes the title type of the ABI subsection in the
naming convention doc. This is so that readthedocs.org doesn't treat this
section as a separate document.

Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210818151313.49992-1-grantseltzer@gmail.com

bpf: Remove redundant initialization of variable allow

The variable allow is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com

Merge branch 'selftests/bpf: fix flaky send_signal test'

Yonghong Song says:

====================

The bpf selftest send_signal() is flaky for its subtests trying to
send signals in softirq/nmi context. To reduce flakiness, the
signal-targetted process priority is boosted, which should minimize
preemption of that process and improve the possibility that
the underlying task in softirq/nmi context is the bpf_send_signal()
wanted task.

Patch #1 did a refactoring to use ASSERT_* instead of old CHECK macros.
Patch #2 did actual change of boosting priority.

Changelog:
  v1 -> v2:
    remove skip logic where the underlying task in interrupt context
    is not the intended one.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Fix flaky send_signal test

libbpf CI has reported send_signal test is flaky although
I am not able to reproduce it in my local environment.
But I am able to reproduce with on-demand libbpf CI ([1]).

Through code analysis, the following is possible reason.
The failed subtest runs bpf program in softirq environment.
Since bpf_send_signal() only sends to a fork of "test_progs"
process. If the underlying current task is
not "test_progs", bpf_send_signal() will not be triggered
and the subtest will fail.

To reduce the chances where the underlying process is not
the intended one, this patch boosted scheduling priority to
-20 (highest allowed by setpriority() call). And I did
10 runs with on-demand libbpf CI with this patch and I
didn't observe any failures.

[1] https://github.com/libbpf/libbpf/actions/workflows/ondemand.yml

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817190923.3186725-1-yhs@fb.com

selftests/bpf: Replace CHECK with ASSERT_* macros in send_signal.c

Replace CHECK in send_signal.c with ASSERT_* macros as
ASSERT_* macros are generally preferred. There is no
funcitonality change.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817190918.3186400-1-yhs@fb.com

Merge branch 'selftests/bpf: Improve the usability of test_progs'

Yucong Sun says:

====================

This short series adds two new switches to test_progs, "-a" and "-d",
adding support for both exact string matching, as well as '*' wildcards.
It also cleans up the output to make it possible to generate
allowlist/denylist using common cli tools.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Support glob matching for test selector.

This patch adds '-a' and '-d' arguments supporting both exact string match as
well as using '*' wildcard in test/subtests selection. '-a' and '-t' can
co-exists, same as '-d' and '-b', in which case they just add to the list of
allowed or denied test selectors.

Caveat: Same as the current substring matching mechanism, test and subtest
selector applies independently, 'a*/b*' will execute all tests matching "a*",
and with subtest name matching "b*", but tests matching "a*" that has no
subtests will also be executed.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-5-fallentree@fb.com

selftests/bpf: Also print test name in subtest status message

This patch add test name in subtest status message line, making it possible to
grep ':OK' in the output to generate a list of passed test+subtest names, which
can be processed to generate argument list to be used with "-a", "-d" exact
string matching.

Example:

#1/1 align/mov:OK
..
#1/12 align/pointer variable subtraction:OK
#1 align:OK

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-4-fallentree@fb.com

selftests/bpf: Correctly display subtest skip status

In skip_account(), test->skip_cnt is set to 0 at the end, this makes next print
statement never display SKIP status for the subtest. This patch moves the
accounting logic after the print statement, fixing the issue.

This patch also added SKIP status display for normal tests.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-3-fallentree@fb.com

selftests/bpf: Skip loading bpf_testmod when using -l to list tests.

When using "-l", test_progs often is executed as non-root user,
load_bpf_testmod() will fail and output errors. This patch skips loading bpf
testmod when "-l" is specified, making output cleaner.

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-2-fallentree@fb.com

selftests/bpf: Add exponential backoff to map_delete_retriable in test_maps

Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817045713.3307985-1-fallentree@fb.com

selftests/bpf: Add exponential backoff to map_update_retriable in test_maps

Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).

Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210816175250.296110-1-fallentree@fb.com

Merge branch 'sockmap: add sockmap support for unix stream socket'

Jiang Wang says:

====================

This patch series add support for unix stream type
for sockmap. Sockmap already supports TCP, UDP,
unix dgram types. The unix stream support is similar
to unix dgram.

Also add selftests for unix stream type in sockmap tests.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftest/bpf: Add new tests in sockmap for unix stream to tcp.

Add two new test cases in sockmap tests, where unix stream is
redirected to tcp and vice versa.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-6-jiang.wang@bytedance.com

selftest/bpf: Change udp to inet in some function names

This is to prepare for adding new unix stream tests.
Mostly renames, also pass the socket types as an argument.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-5-jiang.wang@bytedance.com

selftest/bpf: Add tests for sockmap with unix stream type.

Add two tests for unix stream to unix stream redirection
in sockmap tests.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-4-jiang.wang@bytedance.com

af_unix: Add unix_stream_proto for sockmap

Previously, sockmap for AF_UNIX protocol only supports
dgram type. This patch add unix stream type support, which
is similar to unix_dgram_proto. To support sockmap, dgram
and stream cannot share the same unix_proto anymore, because
they have different implementations, such as unhash for stream
type (which will remove closed or disconnected sockets from the map),
so rename unix_proto to unix_dgram_proto and add a new
unix_stream_proto.

Also implement stream related sockmap functions.
And add dgram key words to those dgram specific functions.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com

af_unix: Add read_sock for stream socket types

To support sockmap for af_unix stream type, implement
read_sock, which is similar to the read_sock for unix
dgram sockets.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-2-jiang.wang@bytedance.com

selftests/bpf: Test btf__load_vmlinux_btf/btf__load_module_btf APIs

Add test for btf__load_vmlinux_btf/btf__load_module_btf APIs. The test
loads bpf_testmod module BTF and check existence of a symbol which is
known to exist.

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210815081035.205879-1-hengqi.chen@gmail.com

bpf: Reconfigure libbpf docs to remove unversioned API

This removes the libbpf_api.rst file from the kernel documentation.
The intention for this file was to pull documentation from comments
above API functions in libbpf. However, due to limitations of the
kernel documentation system, this API documentation could not be
versioned, which is counterintuative to how users expect to use it.
There is also currently no doc comments, making this a blank page.

Once the kernel comment documentation is actually contributed, it
will still exist in the kernel repository, just in the code itself.

A seperate site is being spun up to generate documentaiton from those
comments in a way in which it can be versioned properly.

This also reconfigures the bpf documentation index page to make it
easier to sync to the previously mentioned documentaiton site.

Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810020508.280639-1-grantseltzer@gmail.com

Merge branch 'bpf-perf-link'

Andrii Nakryiko says:

====================
This patch set implements an ability for users to specify custom black box u64
value for each BPF program attachment, bpf_cookie, which is available to BPF
program at runtime. This is a feature that's critically missing for cases when
some sort of generic processing needs to be done by the common BPF program
logic (or even exactly the same BPF program) across multiple BPF hooks (e.g.,
many uniformly handled kprobes) and it's important to be able to distinguish
between each BPF hook at runtime (e.g., for additional configuration lookup).

The choice of restricting this to a fixed-size 8-byte u64 value is an explicit
design decision. Making this configurable by users adds unnecessary complexity
(extra memory allocations, extra complications on the verifier side to validate
accesses to variable-sized data area) while not really opening up new
possibilities. If user's use case requires storing more data per attachment,
it's possible to use either global array, or ARRAY/HASHMAP BPF maps, where
bpf_cookie would be used as an index into respective storage, populated by
user-space code before creating BPF link. This gives user all the flexibility
and control while keeping BPF verifier and BPF helper API simple.

Currently, similar functionality can only be achieved through:

  - code-generation and BPF program cloning, which is very complicated and
    unmaintainable;
  - on-the-fly C code generation and further runtime compilation, which is
    what BCC uses and allows to do pretty simply. The big downside is a very
    heavy-weight Clang/LLVM dependency and inefficient memory usage (due to
    many BPF program clones and the compilation process itself);
  - in some cases (kprobes and sometimes uprobes) it's possible to do function
    IP lookup to get function-specific configuration. This doesn't work for
    all the cases (e.g., when attaching uprobes to shared libraries) and has
    higher runtime overhead and additional programming complexity due to
    BPF_MAP_TYPE_HASHMAP lookups. Up until recently, before bpf_get_func_ip()
    BPF helper was added, it was also very complicated and unstable (API-wise)
    to get traced function's IP from fentry/fexit and kretprobe.

With libbpf and BPF CO-RE, runtime compilation is not an option, so to be able
to build generic tracing tooling simply and efficiently, ability to provide
additional bpf_cookie value for each *attachment* (as opposed to each BPF
program) is extremely important. Two immediate users of this functionality are
going to be libbpf-based USDT library (currently in development) and retsnoop
([0]), but I'm sure more applications will come once users get this feature in
their kernels.

To achieve above described, all perf_event-based BPF hooks are made available
through a new BPF_LINK_TYPE_PERF_EVENT BPF link, which allows to use common
LINK_CREATE command for program attachments and generally brings
perf_event-based attachments into a common BPF link infrastructure.

With that, LINK_CREATE gets ability to pass throught bpf_cookie value during
link creation (BPF program attachment) time. bpf_get_attach_cookie() BPF
helper is added to allow fetching this value at runtime from BPF program side.
BPF cookie is stored either on struct perf_event itself and fetched from the
BPF program context, or is passed through ambient BPF run context, added in
c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current").

On the libbpf side of things, BPF perf link is utilized whenever is supported
by the kernel instead of using PERF_EVENT_IOC_SET_BPF ioctl on perf_event FD.
All the tracing attach APIs are extended with OPTS and bpf_cookie is passed
through corresponding opts structs.

Last part of the patch set adds few self-tests utilizing new APIs.

There are also a few refactorings along the way to make things cleaner and
easier to work with, both in kernel (BPF_PROG_RUN and BPF_PROG_RUN_ARRAY), and
throughout libbpf and selftests.

Follow-up patches will extend bpf_cookie to fentry/fexit programs.

While adding uprobe_opts, also extend it with ref_ctr_offset for specifying
USDT semaphore (reference counter) offset. Update attach_probe selftests to
validate its functionality. This is another feature (along with bpf_cookie)
required for implementing libbpf-based USDT solution.

  [0] https://github.com/anakryiko/retsnoop

v4->v5:
  - rebase on latest bpf-next to resolve merge conflict;
  - add ref_ctr_offset to uprobe_opts and corresponding selftest;
v3->v4:
  - get rid of BPF_PROG_RUN macro in favor of bpf_prog_run() (Daniel);
  - move #ifdef CONFIG_BPF_SYSCALL check into bpf_set_run_ctx (Daniel);
v2->v3:
  - user_ctx -> bpf_cookie, bpf_get_user_ctx -> bpf_get_attach_cookie (Peter);
  - fix BPF_LINK_TYPE_PERF_EVENT value fix (Jiri);
  - use bpf_prog_run() from bpf_prog_run_pin_on_cpu() (Yonghong);
v1->v2:
  - fix build failures on non-x86 arches by gating on CONFIG_PERF_EVENTS.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Add ref_ctr_offset selftests

Extend attach_probe selftests to specify ref_ctr_offset for uprobe/uretprobe
and validate that its value is incremented from zero.

Turns out that once uprobe is attached with ref_ctr_offset, uretprobe for the
same location/function *has* to use ref_ctr_offset as well, otherwise
perf_event_open() fails with -EINVAL. So this test uses ref_ctr_offset for
both uprobe and uretprobe, even though for the purpose of test uprobe would be
enough.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-17-andrii@kernel.org

libbpf: Add uprobe ref counter offset support for USDT semaphores

When attaching to uprobes through perf subsystem, it's possible to specify
offset of a so-called USDT semaphore, which is just a reference counted u16,
used by kernel to keep track of how many tracers are attached to a given
location. Support for this feature was added in [0], so just wire this through
uprobe_opts. This is important to enable implementing USDT attachment and
tracing through libbpf's bpf_program__attach_uprobe_opts() API.

[0] a6ca88b241d5 ("trace_uprobe: support reference counter in fd-based uprobe")

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-16-andrii@kernel.org