git.osdn.net Git - tomoyo/tomoyo-test1.git/log

bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY

This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: Add ID (if needed) to sock_reuseport and expose reuseport_lock

A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map.  Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
will be lockless.

This patch also adds an ID to sock_reuseport.  A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map.  It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk,  this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).

The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow().  The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.

If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
patch is a no-op.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket

Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()".  The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").

The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case.  However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.

When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie.  For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
   was updated.
2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
   and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
   There are other ordering that will trigger the similar situation
   below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
   "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
   all syncookies sent by sk1 will be handled (and rejected)
   by sk2 while sk1 is still alive.

The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).

The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map.  Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.

A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[]  OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group.  However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.

This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.

"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group.  Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.

A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed.  Before and
after the change is ~9Mpps in IPv4.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-btf-for-htab-lru'

Yonghong Song says:

====================
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.

The following example shows the pretty-print result of a pinned
hashmap. Without this patch set, user will get an error instead.

    struct map_value {
            int count_a;
            int count_b;
    };

    cat /sys/fs/bpf/pinned_hash_map:

    87907: {87907,87908}
    57354: {37354,57355}
    76625: {76625,76626}
    ...

Patch #1 fixed a bug in bpffs map_seq_next() function so that
all elements in the hash table will be traversed.
Patch #2 implemented map_seq_show_elem() and map_check_btf()
callback functions for hash and lru hash maps.
Patch #3 enhanced tools/testing/selftests/bpf/test_btf.c to
test bpffs hash and lru hash map pretty print.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tools/bpf: add bpffs pretty print btf test for hash/lru_hash maps

Pretty print tests for hash/lru_hash maps are added in test_btf.c.
The btf type blob is the same as pretty print array map test.
The test result:
  $ mount -t bpf bpf /sys/fs/bpf
  $ ./test_btf -p
    BTF pretty print array......OK
    BTF pretty print hash......OK
    BTF pretty print lru hash......OK
    PASS:3 SKIP:0 FAIL:0

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: btf: add pretty print for hash/lru_hash maps

Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of
a pinned hashmap:

    struct map_value {
            int count_a;
            int count_b;
    };

    cat /sys/fs/bpf/pinned_hash_map:

    87907: {87907,87908}
    57354: {37354,57355}
    76625: {76625,76626}
    ...

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: fix bpffs non-array map seq_show issue

In function map_seq_next() of kernel/bpf/inode.c,
the first key will be the "0" regardless of the map type.
This works for array. But for hash type, if it happens
key "0" is in the map, the bpffs map show will miss
some items if the key "0" is not the first element of
the first bucket.

This patch fixed the issue by guaranteeing to get
the first element, if the seq_show is just started,
by passing NULL pointer key to map_get_next_key() callback.
This way, no missing elements will occur for
bpffs hash table show even if key "0" is in the map.

Fixes: a26ca7c982cb5 ("bpf: btf: Add pretty print support to the basic arraymap")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-veth-xdp-support'

Toshiaki Makita says:

====================
This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
(XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

Envisioned use-cases
--------------------

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

Implementation
--------------

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
- patch 1: Export a function needed for veth XDP.
- patch 2-3: Basic implementation of veth XDP.
- patch 4-6: Add ndo_xdp_xmit.
- patch 7-9: Add XDP_TX and XDP_REDIRECT.
- patch 10: Performance optimization for multi-queue env.

Tests and performance numbers
-----------------------------

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP action    Flows    Mpps
================================
DROP                   1    10.6
DROP                   2    21.2
DROP                 100    36.0
TX                     1     5.0
TX                     2    10.0
TX                   100    31.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction         Flows   Gbps
==============================
external->veth        1   20.8
external->veth        2   23.5
external->veth      100   23.6
veth->external        1    9.0
veth->external        2   17.8
veth->external      100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v8:
- Don't use xdp_frame pointer address to calculate skb->head, headroom,
  and xdp_buff.data_hard_start.

v7:
- Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
  use it instead of memset().

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).
====================

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Support per queue XDP ring

Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Add XDP TX and REDIRECT

This allows further redirection of xdp_frames like

NIC   -> veth--veth -> veth--veth
(XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v8:
- Don't use xdp_frame pointer address for data_hard_start of xdp_buff.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xdp: Helpers for disabling napi_direct of xdp_return_frame

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Make redirect_info accessible from modules

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Add ndo_xdp_xmit

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Handle xdp_frames in xdp napi ring

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v8:
- Don't use xdp_frame pointer address to calculate skb->head and
  headroom.

v7:
- Use xdp_scrub_frame() instead of memset().

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xdp: Helper function to clear kernel pointers in xdp_frame

xdp_frame has kernel pointers which should not be readable from bpf
programs. When we want to reuse xdp_frame region but it may be read by
bpf programs later, we can use this helper to clear kernel pointers.
This is more efficient than calling memset() for the entire struct.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Avoid drops by oversized packets when XDP is enabled

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

veth: Add driver XDP

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: Export skb_headers_offset_update

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-sample-cpumap-lb'

Jesper Dangaard Brouer says:

====================
Background: cpumap moves the SKB allocation out of the driver code,
and instead allocate it on the remote CPU, and invokes the regular
kernel network stack with the newly allocated SKB.

The idea behind the XDP CPU redirect feature, is to use XDP as a
load-balancer step in-front of regular kernel network stack.  But the
current sample code does not provide a good example of this.  Part of
the reason is that, I have implemented this as part of Suricata XDP
load-balancer.

Given this is the most frequent feature request I get.  This patchset
implement the same XDP load-balancing as Suricata does, which is a
symmetric hash based on the IP-pairs + L4-protocol.

The expected setup for the use-case is to reduce the number of NIC RX
queues via ethtool (as XDP can handle more per core), and via
smp_affinity assign these RX queues to a set of CPUs, which will be
handling RX packets.  The CPUs that runs the regular network stack is
supplied to the sample xdp_redirect_cpu tool by specifying
the --cpu option multiple times on the cmdline.

I do note that cpumap SKB creation is not feature complete yet, and
more work is coming.  E.g. given GRO is not implemented yet, do expect
TCP workloads to be slower.  My measurements do indicate UDP workloads
are faster.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

samples/bpf: xdp_redirect_cpu load balance like Suricata

This implement XDP CPU redirection load-balancing across available
CPUs, based on the hashing IP-pairs + L4-protocol.  This equivalent to
xdp-cpu-redirect feature in Suricata, which is inspired by the
Suricata 'ippair' hashing code.

An important property is that the hashing is flow symmetric, meaning
that if the source and destination gets swapped then the selected CPU
will remain the same.  This is helps locality by placing both directions
of a flows on the same CPU, in a forwarding/routing scenario.

The hashing INITVAL (15485863 the 10^6th prime number) was fairly
arbitrary choosen, but experiments with kernel tree pktgen scripts
(pktgen_sample04_many_flows.sh +pktgen_sample05_flow_per_thread.sh)
showed this improved the distribution.

This patch also change the default loaded XDP program to be this
load-balancer.  As based on different user feedback, this seems to be
the expected behavior of the sample xdp_redirect_cpu.

Link: https://github.com/OISF/suricata/commit/796ec08dd7a63
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

samples/bpf: add Paul Hsieh's (LGPL 2.1) hash function SuperFastHash

Adjusted function call API to take an initval. This allow the API
user to set the initial value, as a seed. This could also be used for
inputting the previous hash.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Revert "xdp: add NULL pointer check in __xdp_return()"

This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.

The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.

For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge ra./pub/scm/linux/kernel/git/davem/net

Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst
adding some tracepoints.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Add-support-for-XGMAC2-in-stmmac'

Jose Abreu says:

====================
Add support for XGMAC2 in stmmac

This series adds support for 10Gigabit IP in stmmac. The IP is called XGMAC2
and has many similarities with GMAC4. Due to this, its relatively easy to
incorporate this new IP into stmmac driver by adding a new block and
filling the necessary callbacks.

The functionality added by this series is still reduced but its only a
starting point which will later be expanded.

I splitted the patches into funcionality and to ease the review. Only the
patch 8/9 really enables the XGMAC2 block by adding a new compatible string.

Version 4 addresses review comments of Florian Fainelli and Rob Herring.

NOTE: Although the IP supports 10G, for now it was only possible to test it
at 1G speed due to 10G PHY HW shipping problems. Here follows iperf3
results at 1G:

Connecting to host 192.168.0.10, port 5201
[  4] local 192.168.0.3 port 39178 connected to 192.168.0.10 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   110 MBytes   920 Mbits/sec    0    482 KBytes
[  4]   1.00-2.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
[  4]   2.00-3.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
[  4]   3.00-4.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
[  4]   4.00-5.00   sec   112 MBytes   935 Mbits/sec    0    482 KBytes
[  4]   5.00-6.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
[  4]   6.00-7.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
[  4]   7.00-8.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
[  4]   8.00-9.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
[  4]   9.00-10.00  sec   113 MBytes   946 Mbits/sec    0    482 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec                  receiver
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: stmmac: Add the bindings documentation for XGMAC2.

Adds the documentation for XGMAC2 DT bindings.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: devicetree@vger.kernel.org
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add the bindings parsing for XGMAC2

Add the bindings parsing for XGMAC2 IP block.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Integrate XGMAC into main driver flow

Now that we have all the XGMAC related callbacks, lets start integrating
this IP block into main driver.

Also, we corrected the initialization flow to only start DMA after
setting descriptors length.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add PTP support for XGMAC2

XGMAC2 uses the same engine of timestamping as GMAC4. Let's use the same
callbacks.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add MDIO related functions for XGMAC2

Add the MDIO related funcionalities for the new IP block XGMAC2.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add descriptor related callbacks for XGMAC2

Add the descriptor related callbacks for the new IP block XGMAC2.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add DMA related callbacks for XGMAC2

Add the DMA related callbacks for the new IP block XGMAC2.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add MAC related callbacks for XGMAC2

Add the MAC related callbacks for the new IP block XGMAC2.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add XGMAC 2.10 HWIF entry

Add a new entry to HWIF table for XGMAC 2.10. For now we fill it with
empty callbacks which will be added in posterior patches.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'More-complete-PHYLINK-support-for-mv88e6xxx'

Andrew Lunn says:

====================
More complete PHYLINK support for mv88e6xxx

Previous patches added sufficient PHYLINK support to the mv88e6xxx
that it did not break existing use cases, basically fixed-link phys.

This patchset builds out the support so that SFP modules, up to
2.5Gbps can be supported, on mv88e6390X, on ports 9 and 10. It also
provides a framework which can be extended to support SFPs on ports
2-8 of mv88e6390X, 10Gbps PHYs, and SFP support on the 6352 family.

Russell King did much of the initial work, implementing the validate
and mac_link_state calls. However, there is an important TODO in the
commit message:

needs to call phylink_mac_change() when the port link comes up/goes down.

The remaining patches implement this, by adding more support for the
SERDES interfaces, in particular, interrupt support so we get notified
when the SERDES gains/looses sync.

This has been tested on the ZII devel C, using a Clearfog as peer
device.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Re-setup interrupts on CMODE change.

When a port changes CMODE, the SERDES interface being used can change.
Disable interrupts for the old SERDES interface, and enable interrupts
on the new.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add SERDES phydev_mac_change up for 6390

phylink wants to know when the MAC layers notices a change in the
link. For the 6390 family, this is a change in the SERDES state.

Add interrupt support for the SERDES interface used to implement
SGMII/1000Base-X/2500Base-X. This is currently limited to ports 9 and
10. Support for the 10G SERDES and other ports will be added later,
building on this basic framework.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: link mv88e6xxx_port to mv88e6xxx_chip

An up coming change will register interrupts for individual switch
ports, using the mv88e6xxx_port as the interrupt context information.
Add members to the mv88e6xxx_port structure so we can link it back to
the mv88e6xxx_chip member the port belongs to and the port number of
the port.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Power on/off SERDES on cmode change

The 6390 family has a number of SERDES interfaces per port. When the
cmode changes, eg 1000Base-X to XAUI, the SERDES interface in use will
also change. Power down the old SERDES interface and power up the new
SERDES interface.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Cache the port cmode

The ports CMODE indicates the type of link between the MAC and the
PHY. It is used often in the SERDES code. Rather than read it each
time, cache its value.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: 2500Base-X uses the 1000Base-X SERDES

The 6390 has three different SERDES interface types. 2500Base-X is
implemented by the SGMII/1000Base-X SERDES. So power on/off the
correct SERDES.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add serdes register read/write helper

Add a helper for accessing SERDES registers of the 6390 family.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Rename sgmii/10g power functions

There is a need to add more functions manipulating the SERDES
interfaces. Cleanup the namespace.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: 6390 vs 6390X SERDES support

The 6390 has two SERDES interfaces, used by ports 9 and 10. The 6390X
has eight SERDES interfaces. These allow ports 9 and 10 to do 10G. Or
if lower speeds are used, some of the SERDES interfaces can be used by
ports 2-8 for 1000Base-X.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor SERDES lane code

The 6390 family has 8 SERDES lanes. What ports use these lanes depends
on how ports 9 and 10 are configured. If 9 and 10 does not make use of
a line, one of the lower ports can use it.

Add a function to return the lane a port is using, if any, and simplify
the code to power up/down the lane.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add phylink support

Add rudimentary phylink support to mv88e6xxx.

TODO:
- needs to call phylink_mac_change() when the port link comes up/goes down.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: add helper for configuring 2500BaseX modes

Add a helper for MAC drivers to use in their validate callback to deal
with 2500BaseX vs 1000BaseX modes, where the hardware supports both
but it is not possible to automatically select between them.

This helper defaults to 1000BaseX, as that is the 802.3 standard, and
will allow users to select 2500BaseX either by forcing the speed if
AN is disabled, or by changing the advertising mask if AN is enabled.
Disabling AN is not recommended as it is only the speed that we're
interested in controlling, not the duplex or pause mode parameters.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add support to enabling pause

The 6185 can enable/disable 802.3z pause be setting the MyPause bit in
the port status register. Add an op to support this.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Various-updates'

Ido Schimmel says:

====================
mlxsw: Various updates

Patches 1-3 update the driver to use a new firmware version. Due to a
recently discovered issue, the version (and future ones) does not
support matching on VLAN ID at egress. This is enforced in the driver
and reported back to the user via extack.

Patch 4 adds a new selftest for the recently introduced algorithmic
TCAM.

Patch 5 converts the driver to use SPDX identifiers.

Patches 6-7 fix a bug in ethtool stats reporting and expose counters for
all 16 TCs, following recent MC-aware changes that utilize TCs 8-15.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Expose counter for all 16 TCs

Before MC-aware mode was enabled in commit 7b8195306694 ("mlxsw:
spectrum: Configure MC-aware mode on mlxsw ports"), only 8 traffic
classes were used. Under MC-aware regime, however, besides using TCs
0-7 for UC traffic, it additionally uses TCs 8-15 for BUM traffic. It
is therefore desirable to show counters for these TCs as well.

Update ethtool stats pool length, mlxsw_sp_port_get_strings() and
mlxsw_sp_port_get_stats() to include artifacts for all 16 TCs. For
consistency and simplicity, expose tc_no_buffer_discard_uc_tc for BUM
TCs as well, even though it ought to stay at 0 all the time.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Include RFC-2819 counters in stats length

The function mlxsw_sp_port_get_sset_count() is supposed to return the
total number of ethtool strings that mlxsw supports. Specifically for
names of statistic counters (the only string type that mlxsw supports
as of now), that number is stored in MLXSW_SP_PORT_ETHTOOL_STATS_LEN.
However, when adding RFC-2891 counters, that define wasn't updated to
include the new counters. As a result, ethtool snips out the counters
towards the end of the list, which contains per-TC counters, and only
the first three traffic classes end up being reported.

Fix by adding MLXSW_SP_PORT_HW_RFC_2819_STATS_LEN as appropriate.

Fixes: 1222d15a01c7 ("mlxsw: spectrum: Expose counters for various packet sizes")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Replace license text with SPDX identifiers and adjust copyrights

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Add TC flower test for Spectrum-2

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Reset FW after flash

Recent FW fixes a bug and allows to load newly flashed FW image after
reset. So make sure the reset happens after flash. Indicate the need
down to PCI layer by -EAGAIN.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Update the supported firmware to version 13.1702.6

This new firmware contains:
        - Support for new types of cables
        - Support for flashing future firmware without reboot
        - Support for Router ARP BC and UC traps

Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_flower: Disallow usage of vlan_id key on egress

As recent spectrum FW imposes a limitation on using vlan_id key for
egress ACL, disallow the usage of that key accordingly and return a
proper extack message.

Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
"This fixes a performance regression in arm64 NEON crypto as well as a
  crash in x86 aegis/morus on unsupported CPUs"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: x86/aegis,morus - Fix and simplify CPUID checks
  crypto: arm64 - revert NEON yield for fast AEAD implementations

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) The real fix for the ipv6 route metric leak Sabrina was seeing, from
    Cong Wang.

2) Fix syzbot triggers AF_PACKET v3 ring buffer insufficient room
    conditions, from Willem de Bruijn.

3) vsock can reinitialize active work struct, fix from Cong Wang.

4) RXRPC keepalive generator can wedge a cpu, fix from David Howells.

5) Fix locking in AF_SMC ioctl, from Ursula Braun.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  dsa: slave: eee: Allow ports to use phylink
  net/smc: move sock lock in smc_ioctl()
  net/smc: allow sysctl rmem and wmem defaults for servers
  net/smc: no shutdown in state SMC_LISTEN
  net: aquantia: Fix IFF_ALLMULTI flag functionality
  rxrpc: Fix the keepalive generator [ver #2]
  net/mlx5e: Cleanup of dcbnl related fields
  net/mlx5e: Properly check if hairpin is possible between two functions
  vhost: reset metadata cache when initializing new IOTLB
  llc: use refcount_inc_not_zero() for llc_sap_find()
  dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()
  tipc: fix an interrupt unsafe locking scenario
  vsock: split dwork to avoid reinitializations
  net: thunderx: check for failed allocation lmac->dmacs
  cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts
  packet: refine ring v3 block size test to hold one frame
  ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit
  ipv6: fix double refcount of fib6_metrics

Merge branch 'mlx5-next'

Saeed Mahameed says:

====================
Mellanox, mlx5 next updates 2018-08-09

This series includes mlx5 core driver updates and mostly simple
cleanups.

From Denis: Use max #EQs reported by firmware to request MSIX vectors.

From Eli: Trivial cleanups, unused arguments/functions and reduce
command polling interval when command interface is in polling mode.

From Eran: Rename vport state enums, to better reflect their actual
usage.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Reduce command polling interval

Use cond_resched() instead of usleep_range() to decrease the time
between polling attempts thus reducing overall driver load time.

Below is a comparison before and after the change, of loading eight
virtual functions.

Before:
real    0m8.785s
user    0m0.093s
sys     0m0.090s

After:
real    0m5.730s
user    0m0.097s
sys     0m0.087s

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Unexport functions that need not be exported

mlx5_query_vport_state() and mlx5_modify_vport_admin_state() are used
only from within mlx5_core - unexport them.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Remove unused mlx5_query_vport_admin_state

mlx5_query_vport_admin_state() is not used anywhere. Remove it.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: E-Switch, Remove unused argument when creating legacy FDB

Remove unused nvports argument.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Rename modify/query_vport state related enums

Modify and query vport state commands share the same admin_state and
op_mod values, rename the enums to fit them both.

In addition, remove the esw prefix from the admin state enum as this
also applied for vnic.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Use max_num_eqs for calculation of required MSIX vectors

New firmware has defined new HCA capability field called "max_num_eqs",
that is the number of available EQs after subtracting reserved FW EQs.

Before this capability the FW reported the EQ number in "log_max_eqs",
the reported value also contained FW reserved EQs, but the driver might
be failing to load on 320 cpus systems due to the fact that FW
reserved EQs were not available to the driver.

Now the driver has to obtain max_num_eqs value from new FW to get real
number of EQs available.

Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dsa: slave: eee: Allow ports to use phylink

For a port to be able to use EEE, both the MAC and the PHY must
support EEE. A phy can be provided by both a phydev or phylink. Verify
at least one of these exist, not just phydev.

Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'smc-fixes'

Ursula Braun says:

====================
net/smc: fixes 2018-08-08

here are small fixes for SMC: The first patch makes sure, shutdown code
is not executed for sockets in state SMC_LISTEN. The second patch resets
send and receive buffer values for accepted sockets, since TCP buffer size
optimizations for the internal CLC socket should not be forwarded to the
outer SMC socket. The third patch solves a race between connect and ioctl
reported by syzbot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: move sock lock in smc_ioctl()

When an SMC socket is connecting it is decided whether fallback to
TCP is needed. To avoid races between connect and ioctl move the
sock lock before the use_fallback check.

Reported-by: syzbot+5b2cece1a8ecb2ca77d8@syzkaller.appspotmail.com
Reported-by: syzbot+19557374321ca3710990@syzkaller.appspotmail.com
Fixes: 1992d99882af ("net/smc: take sock lock in smc_ioctl()")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: allow sysctl rmem and wmem defaults for servers

Without setsockopt SO_SNDBUF and SO_RCVBUF settings, the sysctl
defaults net.ipv4.tcp_wmem and net.ipv4.tcp_rmem should be the base
for the sizes of the SMC sndbuf and rcvbuf. Any TCP buffer size
optimizations for servers should be ignored.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: no shutdown in state SMC_LISTEN

Invoking shutdown for a socket in state SMC_LISTEN does not make
sense. Nevertheless programs like syzbot fuzzing the kernel may
try to do this. For SMC this means a socket refcounting problem.
This patch makes sure a shutdown call for an SMC socket in state
SMC_LISTEN simply returns with -ENOTCONN.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Fix IFF_ALLMULTI flag functionality

It was noticed that NIC always pass all multicast traffic to the host
regardless of IFF_ALLMULTI flag on the interface.
The rule in MC Filter Table in NIC, that is configured to accept any
multicast packets, is turning on if IFF_MULTICAST flag is set on the
interface. It leads to passing all multicast traffic to the host.
This fix changes the condition to turn on that rule by checking
IFF_ALLMULTI flag as it should.

Fixes: b21f502f84be ("net:ethernet:aquantia: Fix for multicast filter handling.")
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rxrpc: Fix the keepalive generator [ver #2]

AF_RXRPC has a keepalive message generator that generates a message for a
peer ~20s after the last transmission to that peer to keep firewall ports
open.  The implementation is incorrect in the following ways:

(1) It mixes up ktime_t and time64_t types.

(2) It uses ktime_get_real(), the output of which may jump forward or
     backward due to adjustments to the time of day.

(3) If the current time jumps forward too much or jumps backwards, the
     generator function will crank the base of the time ring round one slot
     at a time (ie. a 1s period) until it catches up, spewing out VERSION
     packets as it goes.

Fix the problem by:

(1) Only using time64_t.  There's no need for sub-second resolution.

(2) Use ktime_get_seconds() rather than ktime_get_real() so that time
     isn't perceived to go backwards.

(3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
     parts:

     (a) The "worker" function that manages the buckets and the timer.

     (b) The "dispatch" function that takes the pending peers and
      potentially transmits a keepalive packet before putting them back
      in the ring into the slot appropriate to the revised last-Tx time.

(4) Taking everything that's pending out of the ring and splicing it into
     a temporary collector list for processing.

     In the case that there's been a significant jump forward, the ring
     gets entirely emptied and then the time base can be warped forward
     before the peers are processed.

     The warping can't happen if the ring isn't empty because the slot a
     peer is in is keepalive-time dependent, relative to the base time.

(5) Limit the number of iterations of the bucket array when scanning it.

(6) Set the timer to skip any empty slots as there's no point waking up if
     there's nothing to do yet.

This can be triggered by an incoming call from a server after a reboot with
AF_RXRPC and AFS built into the kernel causing a peer record to be set up
before userspace is started.  The system clock is then adjusted by
userspace, thereby potentially causing the keepalive generator to have a
meltdown - which leads to a message like:

watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
...
Workqueue: krxrpcd rxrpc_peer_keepalive_worker
EIP: lock_acquire+0x69/0x80
...
Call Trace:
? rxrpc_peer_keepalive_worker+0x5e/0x350
? _raw_spin_lock_bh+0x29/0x60
? rxrpc_peer_keepalive_worker+0x5e/0x350
? rxrpc_peer_keepalive_worker+0x5e/0x350
? __lock_acquire+0x3d3/0x870
? process_one_work+0x110/0x340
? process_one_work+0x166/0x340
? process_one_work+0x110/0x340
? worker_thread+0x39/0x3c0
? kthread+0xdb/0x110
? cancel_delayed_work+0x90/0x90
? kthread_stop+0x70/0x70
? ret_from_fork+0x19/0x24

Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx5-fixes'

Saeed Mahameed says:

====================
Mellanox, mlx5e fixes 2018-08-07

I know it is late into 4.18 release, and this is why I am submitting
only two mlx5e ethernet fixes.

The first one from Or, is needed for -stable and it fixes hairpin
for "same device" check.

The second fix is a non risk fix from Huy which cleans up and improves
error return value reporting for dcbnl_ieee_setapp.

For -stable v4.16
- net/mlx5e: Properly check if hairpin is possible between two functions
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Cleanup of dcbnl related fields

Remove unused netdev_registered_init/remove in en.h
Return ENOSUPPORT if the check MLX5_DSCP_SUPPORTED fails.
Remove extra white space

Fixes: 2a5e7a1344f4 ("net/mlx5e: Add dcbnl dscp to priority support")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Cc: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Properly check if hairpin is possible between two functions

The current check relies on function BDF addresses and can get
us wrong e.g when two VFs are assigned into a VM and the PCI
v-address is set by the hypervisor.

Fixes: 5c65c564c962 ('net/mlx5e: Support offloading TC NIC hairpin flows')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Alaa Hleihel <alaa@mellanox.com>
Tested-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: hwsim: fix missing unlock on error in hwsim_add_one()

Add the missing unlock before return from function hwsim_add_one()
in the error handling case.

Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: hwsim: fix copy-paste error in hwsim_set_edge_lqi()

The return value from kzalloc() is not checked correctly. The
test is done against a wrong variable. This patch fix it.

Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: hwsim: fix rcu handling

This patch adds missing rcu_assign_pointer()/rcu_dereference() to used rcu
pointers. There was already a previous commit c5d99d2b35da ("ieee802154:
hwsim: fix rcu address annotation"), but there was more which was
pointed out on my side by using newest sparse version.

Cc: Stefan Schmidt <stefan@datenfreihafen.org>
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

parisc: Define mb() and add memory barriers to assembler unlock sequences

For years I thought all parisc machines executed loads and stores in
order. However, Jeff Law recently indicated on gcc-patches that this is
not correct. There are various degrees of out-of-order execution all the
way back to the PA7xxx processor series (hit-under-miss). The PA8xxx
series has full out-of-order execution for both integer operations, and
loads and stores.

This is described in the following article:
http://web.archive.org/web/20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml

For this reason, we need to define mb() and to insert a memory barrier
before the store unlocking spinlocks. This ensures that all memory
accesses are complete prior to unlocking. The ldcw instruction performs
the same function on entry.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>

parisc: Enable CONFIG_MLONGCALLS by default

Enable the -mlong-calls compiler option by default, because otherwise in most
cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F
relocations. This fixes building the 64-bit defconfig.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>

net-next: hinic: fix a problem in free_tx_poll()

This patch fixes the problem below. The problem can be reproduced by the
following steps:
1) Connecting all HiNIC interfaces
2) On server side
    # sudo ifconfig eth0 192.168.100.1 up #Using MLX CX4 card
    # iperf -s
3) On client side
    # sudo ifconfig eth0 192.168.100.2 up #Using our HiNIC card
    # iperf -c 192.168.101.1 -P 10 -t 100000

after hours of testing, we will see errors:

    hinic 0000:05:00.0: No MGMT msg handler, mod = 0
    hinic 0000:05:00.0: No MGMT msg handler, mod = 0
    hinic 0000:05:00.0: No MGMT msg handler, mod = 0
    hinic 0000:05:00.0: No MGMT msg handler, mod = 0

The errors are caused by the following problem.
1) The hinic_get_wqe() checks the "wq->delta" to allocate new WQEs:

if (atomic_sub_return(num_wqebbs, &wq->delta) <= 0) {
atomic_add(num_wqebbs, &wq->delta);
return ERR_PTR(-EBUSY);
}

If the WQE occupies multiple pages, the shadow WQE will be used. Then the
hinic_xmit_frame() fills the WQE.

2) While in parallel with 1), the free_tx_poll() checks the "wq->delta"
to free old WQEs:

if ((atomic_read(&wq->delta) + num_wqebbs) > wq->q_depth)
return ERR_PTR(-EBUSY);

There is a probability that the shadow WQE which hinic_xmit_frame() is
using will be damaged by copy_wqe_to_shadow():

if (curr_pg != end_pg) {
void *shadow_addr = &wq->shadow_wqe[curr_pg * wq->max_wqe_size];

copy_wqe_to_shadow(wq, shadow_addr, num_wqebbs, *cons_idx);
return shadow_addr;
}

This can cause WQE data error and you will see the above error messages.
This patch fixes the problem.

Signed-off-by: Zhao Chen <zhaochen6@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost: reset metadata cache when initializing new IOTLB

We need to reset metadata cache during new IOTLB initialization,
otherwise the stale pointers to previous IOTLB may be still accessed
which will lead a use after free.

Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com
Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:mod: remove unneeded variable 'ret' in init_p9

The ret is modified after initalization, so just remove it and
return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:af_iucv: get rid of the unneeded variable 'err' in afiucv_pm_freeze

We will not use the variable 'err' after initalization, So remove it and
return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: nixge: Get rid of unused struct member 'last_link'

Get rid of unused struct member 'last_link'

Signed-off-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ethernet-Mark-expected-switch-fall-throughs'

Gustavo A. R. Silva says:

====================
net: ethernet: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, this patchset aims
to add some annotations in order to mark switch cases where we are
expecting to fall through.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpts: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114813 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tlan: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 141440 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sfc: falcon: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1384500 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: sxgbe: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1357414 ("Missing break in switch")
Addresses-Coverity-ID: 1357415 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114811 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlcnic: Mark expected switch fall-througs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1410181 ("Missing break in switch")
Addresses-Coverity-ID: 1410184 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: qede_fp: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1384501 ("Missing break in switch")
Addresses-Coverity-ID: 1398869 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netxen_nic: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1410182 ("Missing break in switch")
Addresses-Coverity-ID: 1410183 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: qed_dev: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case, I replaced the code comments with
a proper "fall through" annotation, which is what GCC is expecting
to find.

Addresses-Coverity-ID: 114809 ("Missing break in switch")
Addresses-Coverity-ID: 114810 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114808 ("Missing break in switch")
Addresses-Coverity-ID: 114802 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxge: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114796 ("Missing break in switch")
Addresses-Coverity-ID: 114804 ("Missing break in switch")
Addresses-Coverity-ID: 114806 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igbvf: netdev: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114801 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: e1000_phy: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114800 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: e1000_82575: Mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114799 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb_main: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 200521 ("Missing break in switch")
Addresses-Coverity-ID: 114797 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>