OSDN Git Service
Steen Hegelund [Thu, 24 Jun 2021 07:07:56 +0000 (09:07 +0200)]
net: sparx5: add calendar bandwidth allocation support
This configures the Sparx5 calendars according to the bandwidth
requested in the Device Tree nodes.
It also checks if the total requested bandwidth is within the
specs of the detected Sparx5 models limits.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:55 +0000 (09:07 +0200)]
net: sparx5: add switching support
This adds SwitchDev support by hardware offloading the
software bridge.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:54 +0000 (09:07 +0200)]
net: sparx5: add vlan support
This adds Sparx5 VLAN support.
Sparx5 has more VLAN features than provided here, but these will be added
in later series. For now we only add the basic L2 features.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:53 +0000 (09:07 +0200)]
net: sparx5: add mactable support
This adds the Sparx5 MAC tables: listening for MAC table updates and
updating on request.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:52 +0000 (09:07 +0200)]
net: sparx5: add port module support
This add configuration of the Sparx5 port module instances.
Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33
physical SerDes connections (S0 to S32). The 65th port (D64) is fixed
allocated to SerDes0 (S0). The remaining 64 ports can in various
multiplexing scenarios be connected to the remaining 32 SerDes using
QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1
mapping to the 32 SerDes.
Some additional ports (D65 to D69) are internal to the device and do not
connect to port modules or SerDes macros. For example, internal ports are
used for frame injection and extraction to the CPU queues.
The 65 logical ports are split up into the following blocks.
- 13 x 5G ports (D0-D11, D64)
- 32 x 2G5 ports (D16-D47)
- 12 x 10G ports (D12-D15, D48-D55)
- 8 x 25G ports (D56-D63)
Each logical port supports different line speeds, and depending on the
speeds supported, different port modules (MAC+PCS) are needed. A port
supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a
DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5
Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it
will have a shadow DEV2G5 port module to support the lower speeds
(10/100/1000/2500Mbps). When a port needs to operate at lower speed and the
shadow DEV2G5 needs to be connected to its corresponding SerDes
Not all interface modes are supported in this series, but will be added at
a later stage.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:51 +0000 (09:07 +0200)]
net: sparx5: add hostmode with phylink support
This patch adds netdevs and phylink support for the ports in the switch.
It also adds register based injection and extraction for these ports.
Frame DMA support for injection and extraction will be added in a later
series.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:50 +0000 (09:07 +0200)]
net: sparx5: add the basic sparx5 driver
This adds the Sparx5 basic SwitchDev driver framework with IO range
mapping, switch device detection and core clock configuration.
Support for ports, phylink, netdev, mactable etc. are in the following
patches.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steen Hegelund [Thu, 24 Jun 2021 07:07:49 +0000 (09:07 +0200)]
dt-bindings: net: sparx5: Add sparx5-switch bindings
Document the Sparx5 switch device driver bindings
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Thu, 24 Jun 2021 00:51:51 +0000 (02:51 +0200)]
net: mdiobus: fix fwnode_mdbiobus_register() fallback case
The fallback case of fwnode_mdbiobus_register()
(relevant for !CONFIG_FWNODE_MDIO) was defined with wrong
argument name, causing a compilation error. Fix that.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 23 Jun 2021 21:44:38 +0000 (14:44 -0700)]
net: ip: avoid OOM kills with large UDP sends over loopback
Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.
This is entirely avoidable, we can use an skb with page frags.
af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.
v4: pre-calculate all the additions to alloclen so
we can be sure it won't go over order-2
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenz Bauer [Wed, 23 Jun 2021 13:56:46 +0000 (15:56 +0200)]
tools/testing: add a selftest for SO_NETNS_COOKIE
Make sure that SO_NETNS_COOKIE returns a non-zero value, and
that sockets from different namespaces have a distinct cookie
value.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martynas Pumputis [Wed, 23 Jun 2021 13:56:45 +0000 (15:56 +0200)]
net: retrieve netns cookie via getsocketopt
It's getting more common to run nested container environments for
testing cloud software. One of such examples is Kind [1] which runs a
Kubernetes cluster in Docker containers on a single host. Each container
acts as a Kubernetes node, and thus can run any Pod (aka container)
inside the former. This approach simplifies testing a lot, as it
eliminates complicated VM setups.
Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
programs are used for load-balancing. The load-balancer BPF program
needs to detect whether a request originates from the host netns or a
container netns in order to allow some access, e.g. to a service via a
loopback IP address. Typically, the programs detect this by comparing
netns cookies with the one of the init ns via a call to
bpf_get_netns_cookie(NULL). However, in nested environments the latter
cannot be used given the Kubernetes node's netns is outside the init ns.
To fix this, we need to pass the Kubernetes node netns cookie to the
program in a different way: by extending getsockopt() with a
SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
node netns can retrieve the cookie and pass it to the program instead.
Thus, this is following up on Eric's commit
3d368ab87cf6 ("net:
initialize net->net_cookie at netns setup") to allow retrieval via
SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie
via SO_COOKIE.
[1] https://kind.sigs.k8s.io/
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Jun 2021 22:46:25 +0000 (15:46 -0700)]
Merge branch 'devlink-rate-limit-fixes'
Dmytro Linkin says:
====================
Fixes for devlink rate objects API
Patch #1 fixes not decreased refcount of parent node for destroyed leaf
object.
Patch #2 fixes incorect eswitch mode check.
Patch #3 protects list traversing with a lock.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 23 Jun 2021 13:43:15 +0000 (16:43 +0300)]
devlink: Protect rate list with lock while switching modes
Devlink eswitch set command doesn't hold devlink->lock, which makes
possible race condition between rate list traversing and others devlink
rate KAPI calls, like devlink_rate_nodes_destroy().
Hold devlink lock while traversing the list.
Fixes:
a8ecb93ef03d ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 23 Jun 2021 13:43:14 +0000 (16:43 +0300)]
devlink: Remove eswitch mode check for mode set call
When eswitch is disabled, querying its current mode results in error.
Due to this when trying to set the eswitch mode for mlx5 devices, it
fails to set the eswitch switchdev mode.
Hence remove such check.
Fixes:
a8ecb93ef03d ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 23 Jun 2021 13:43:13 +0000 (16:43 +0300)]
devlink: Decrease refcnt of parent rate object on leaf destroy
Port functions, like SFs, can be deleted by the user when its leaf rate
object has parent node. In such case node refcnt won't be decreased
which blocks the node from deletion later.
Do simple refcnt decrease, since driver in cleanup stage. This:
1) assumes that driver took proper internal parent unset action;
2) allows to avoid nested callbacks call and deadlock.
Fixes:
d75559845078 ("devlink: Allow setting parent node of rate objects")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xianting Tian [Wed, 23 Jun 2021 15:16:22 +0000 (11:16 -0400)]
virtio_net: Use virtio_find_vqs_ctx() helper
virtio_find_vqs_ctx() is defined but never be called currently,
it is the right place to use it.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Wed, 23 Jun 2021 06:06:34 +0000 (15:06 +0900)]
net/tls: Remove the __TLS_DEC_STATS() macro.
The commit
d26b698dd3cd ("net/tls: add skeleton of MIB statistics")
introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
not defined also. Let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima [Tue, 22 Jun 2021 23:35:29 +0000 (08:35 +0900)]
tcp: Add stats for socket migration.
This commit adds two stats for the socket migration feature to evaluate the
effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
If the migration fails because of the own_req race in receiving ACK and
sending SYN+ACK paths, we do not increment the failure stat. Then another
CPU is responsible for the req.
Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Wilder [Tue, 22 Jun 2021 21:52:15 +0000 (14:52 -0700)]
ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM.
TCP checksums on received packets may be set to NULL by the sender if CSO
is enabled. The hypervisor flags these packets as check-sum-ok and the
skb is then flagged CHECKSUM_UNNECESSARY. If these packets are then
forwarded the sender will not request CSO due to the CHECKSUM_UNNECESSARY
flag. The result is a TCP packet sent with a bad checksum. This change
sets up CHECKSUM_PARTIAL on these packets causing the sender to correctly
request CSUM offload.
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Tested-by: Cristobal Forno <cforno12@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Jun 2021 19:48:07 +0000 (12:48 -0700)]
Merge tag 'mlx5-net-next-2021-06-22' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-net-next-2021-06-22
1) Various minor cleanups and fixes from net-next branch
2) Optimize mlx5 feature check on tx and
a fix to allow Vxlan with Ipsec offloads
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Jun 2021 19:31:28 +0000 (12:31 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
from Phil Sutter.
2) Simplify TCP option sanity check for TCP packets, also from Phil.
3) Add a new expression to store when the rule has been used last time.
4) Pass the hook state object to log function, from Florian Westphal.
5) Document the new sysctl knobs to tune the flowtable timeouts,
from Oz Shlomo.
6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
from Dan Carpenter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrea Righi [Tue, 22 Jun 2021 07:46:48 +0000 (09:46 +0200)]
selftests: icmp_redirect: support expected failures
According to a comment in commit
99513cfa16c6 ("selftest: Fixes for
icmp_redirect test") the test "IPv6: mtu exception plus redirect" is
expected to fail, because of a bug in the IPv6 logic that hasn't been
fixed yet apparently.
We should probably consider this failure as an "expected failure",
therefore change the script to return XFAIL for that particular test and
also report the total amount of expected failures at the end of the run.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 Jun 2021 19:17:35 +0000 (12:17 -0700)]
Merge branch 'lockless-qdisc-opts'
Yunsheng Lin says:
====================
Some optimization for lockless qdisc
Patch 1: remove unnecessary seqcount operation.
Patch 2: implement TCQ_F_CAN_BYPASS.
Patch 3: remove qdisc->empty.
Performance data for pktgen in queue_xmit mode + dummy netdev
with pfifo_fast:
threads unpatched patched delta
1 2.60Mpps 3.21Mpps +23%
2 3.84Mpps 5.56Mpps +44%
4 5.52Mpps 5.58Mpps +1%
8 2.77Mpps 2.76Mpps -0.3%
16 2.24Mpps 2.23Mpps -0.4%
Performance for IP forward testing: 1.05Mpps increases to
1.16Mpps, about 10% improvement.
V3: Add 'Acked-by' from Jakub and 'Tested-by' from Vladimir,
and resend based on latest net-next.
V2: Adjust the comment and commit log according to discussion
in V1.
V1: Drop RFC tag, add nolock_qdisc_is_empty() and do the qdisc
empty checking without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.
RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
qdisc, and add patch 1 and 3.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Tue, 22 Jun 2021 06:49:57 +0000 (14:49 +0800)]
net: sched: remove qdisc->empty for lockless qdisc
As MISSED and DRAINING state are used to indicate a non-empty
qdisc, qdisc->empty is not longer needed, so remove it.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Tue, 22 Jun 2021 06:49:56 +0000 (14:49 +0800)]
net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().
This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.
As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit
a90c57f2cedd ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.
In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.
As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:
thread1 thread2 thread3
qdisc_run_begin() # true
qdisc_run_begin(q)
set(MISSED)
pfifo_fast_dequeue
clear(MISSED)
# recheck the queue
qdisc_run_end()
enqueue skb1
qdisc empty # true
qdisc_run_begin() # true
sch_direct_xmit() # skb2
qdisc_run_begin()
set(MISSED)
When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.
This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.
There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
dequeue all skb in one round and call __netif_schedule(), which
might result in a non-empty qdisc without MISSED set. In order
to avoid this, the MISSED state is set for lockless qdisc and
__netif_schedule() will be called at the end of qdisc_run_end.
2. The MISSED state also need to be set for lockless qdisc instead
of calling __netif_schedule() directly when requeuing a skb for
a similar reason.
3. For netdev queue stopped case, the MISSED case need clearing
while the netdev queue is stopped, otherwise there may be
unnecessary __netif_schedule() calling. So a new DRAINING state
is added to indicate this case, which also indicate a non-empty
qdisc.
4. As there is already netif_xmit_frozen_or_stopped() checking in
dequeue_skb() and sch_direct_xmit(), which are both within the
protection of qdisc->seqlock, but the same checking in
__dev_xmit_skb() is without the protection, which might cause
empty indication of a lockless qdisc to be not reliable. So
remove the checking in __dev_xmit_skb(), and the checking in
the protection of qdisc->seqlock seems enough to avoid the cpu
consumption problem for netdev queue stopped case.
1. https://lkml.org/lkml/2021/5/29/215
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Tue, 22 Jun 2021 06:49:55 +0000 (14:49 +0800)]
net: sched: avoid unnecessary seqcount operation for lockless qdisc
qdisc->running seqcount operation is mainly used to do heuristic
locking on q->busylock for locked qdisc, see qdisc_is_running()
and __dev_xmit_skb().
So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
flag.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huy Nguyen [Mon, 14 Jun 2021 14:33:49 +0000 (17:33 +0300)]
net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload
The packet is VXLAN packet over IPsec transport mode tunnel
which has the following format: [IP1 | ESP | UDP | VXLAN | IP2 | TCP]
NVIDIA ConnectX card cannot do checksum offload for two L4 headers.
The solution is using the checksum partial offload similar to
VXLAN | TCP packet. Hardware calculates IP1, IP2 and TCP checksums and
software calculates UDP checksum. However, unlike VXLAN | TCP case,
IPsec's mlx5 driver cannot access the inner plaintext IP protocol type.
Therefore, inner_ipproto is added in the sec_path structure
to provide this information. Also, utilize the skb's csum_start to
program L4 inner checksum offset.
While at it, remove the call to mlx5e_set_eseg_swp and setup software parser
fields directly in mlx5e_ipsec_set_swp. mlx5e_set_eseg_swp is not
needed as the two features (GENEVE and IPsec) are different and adding
this sharing layer creates unnecessary complexity and affect
performance.
For the case VXLAN packet over IPsec tunnel mode tunnel, checksum offload
is disabled because the hardware does not support checksum offload for
three L3 (IP) headers.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Huy Nguyen [Mon, 14 Jun 2021 14:33:48 +0000 (17:33 +0300)]
net/xfrm: Add inner_ipproto into sec_path
The inner_ipproto saves the inner IP protocol of the plain
text packet. This allows vendor's IPsec feature making offload
decision at skb's features_check and configuring hardware at
ndo_start_xmit.
For example, ConnectX6-DX IPsec device needs the plaintext's
IP protocol to support partial checksum offload on
VXLAN/GENEVE packet over IPsec transport mode tunnel.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Huy Nguyen [Mon, 14 Jun 2021 14:33:47 +0000 (17:33 +0300)]
net/mlx5: Optimize mlx5e_feature_checks for non IPsec packet
mlx5e_ipsec_feature_check belongs to mlx5e_tunnel_features_check.
Also, IPsec is not the default configuration so it should be
checked at the end instead of the beginning of mlx5e_features_check.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
caihuoqing [Thu, 17 Jun 2021 02:32:15 +0000 (10:32 +0800)]
net/mlx5: remove "default n" from Kconfig
remove "default n" and "No" is default
Signed-off-by: caihuoqing <caihuoqing@baidu.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Colin Ian King [Wed, 16 Jun 2021 14:19:50 +0000 (15:19 +0100)]
net/mlx5: Fix spelling mistake "enught" -> "enough"
There is a spelling mistake in a mlx5_core_err error message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Nathan Chancellor [Fri, 18 Jun 2021 00:03:59 +0000 (17:03 -0700)]
net/mlx5: Use cpumask_available() in mlx5_eq_create_generic()
When CONFIG_CPUMASK_OFFSTACK is unset, cpumask_var_t is not a pointer
but a single element array, meaning its address in a structure cannot be
NULL as long as it is not the first element, which it is not. This
results in a clang warning:
drivers/net/ethernet/mellanox/mlx5/core/eq.c:715:14: warning: address of
array 'param->affinity' will always evaluate to 'true'
[-Wpointer-bool-conversion]
if (!param->affinity)
~~~~~~~~^~~~~~~~
1 warning generated.
The helper cpumask_available was added in commit
f7e30f01a9e2 ("cpumask:
Add helper cpumask_available()") to handle situations like this so use
it to keep the meaning of the code the same while resolving the warning.
Fixes:
e4e3f24b822f ("net/mlx5: Provide cpumask at EQ creation phase")
Link: https://github.com/ClangBuiltLinux/linux/issues/1400
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jiapeng Chong [Tue, 22 Jun 2021 06:09:21 +0000 (14:09 +0800)]
net/mlx5: Fix missing error code in mlx5_init_fs()
The error code is missing in this code scenario, add the error code
'-ENOMEM' to the return value 'err'.
Eliminate the follow smatch warning:
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c:2973 mlx5_init_fs()
warn: missing error code 'err'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes:
4a98544d1827 ("net/mlx5: Move chains ft pool to be used by all firmware steering").
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
David S. Miller [Tue, 22 Jun 2021 21:36:01 +0000 (14:36 -0700)]
Merge branch 'mptcp-C-flag-and-fixes'
Mat Martineau says:
====================
mptcp: Connection-time 'C' flag and two fixes
Here are six more patches from the MPTCP tree.
Most of them add support for the 'C' flag in the MPTCP connection-time
option headers. This flag affects how the initial address and port are
treated by each peer. Normally one peer may send MP_JOIN requests to the
remote address and port that were used when initiating the MPTCP
connection. The 'C' bit indicates that MP_JOINs should only be sent to
remote addresses that have been advertised with ADD_ADDR.
The other two patches are unrelated improvements.
Patches 1-4: Add the 'C' flag feature, a sysctl to optionally enable it,
and a selftest.
Patch 5: Adjust rp_filter settings in a selftest.
Patch 6: Improve rbuf cleanup for MPTCP sockets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 22 Jun 2021 19:25:23 +0000 (12:25 -0700)]
mptcp: refine mptcp_cleanup_rbuf
The current cleanup rbuf tries a bit too hard to avoid acquiring
the subflow socket lock. We may end-up delaying the needed ack,
or skip acking a blocked subflow.
Address the above extending the conditions used to trigger the cleanup
to reflect more closely what TCP does and invoking tcp_cleanup_rbuf()
on all the active subflows.
Note that we can't replicate the exact tests implemented in
tcp_cleanup_rbuf(), as MPTCP lacks some of the required info - e.g.
ping-pong mode.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonglong Li [Tue, 22 Jun 2021 19:25:22 +0000 (12:25 -0700)]
selftests: mptcp: turn rp_filter off on each NIC
To turn rp_filter off we should:
echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter
and
echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
before NIC created.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Tue, 22 Jun 2021 19:25:21 +0000 (12:25 -0700)]
selftests: mptcp: add deny_join_id0 testcases
This patch added a new argument '-d' for mptcp_join.sh script, to invoke
the testcases for the MP_CAPABLE 'C' flag.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Tue, 22 Jun 2021 19:25:20 +0000 (12:25 -0700)]
mptcp: add deny_join_id0 in mptcp_options_received
This patch added a new flag named deny_join_id0 in struct
mptcp_options_received. Set it when MP_CAPABLE with the flag
MPTCP_CAP_DENYJOIN_ID0 is received.
Also add a new flag remote_deny_join_id0 in struct mptcp_pm_data. When the
flag deny_join_id0 is set, set this remote_deny_join_id0 flag.
In mptcp_pm_create_subflow_or_signal_addr, if the remote_deny_join_id0 flag
is set, and the remote address id is zero, stop this connection.
Suggested-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Tue, 22 Jun 2021 19:25:19 +0000 (12:25 -0700)]
mptcp: add allow_join_id0 in mptcp_out_options
This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit,
labeled "C" of the MP_CAPABLE option.
Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is
set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Tue, 22 Jun 2021 19:25:18 +0000 (12:25 -0700)]
mptcp: add sysctl allow_join_initial_addr_port
This patch added a new sysctl, named allow_join_initial_addr_port, to
control whether allow peers to send join requests to the IP address and
port number used by the initial subflow.
Suggested-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 18:28:52 +0000 (11:28 -0700)]
Merge branch 'sctp-packetization-path-MTU'
Xin Long says:
====================
sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport
Overview(From RFC8899):
In contrast to PMTUD, Packetization Layer Path MTU Discovery
(PLPMTUD) [RFC4821] introduces a method that does not rely upon
reception and validation of PTB messages. It is therefore more
robust than Classical PMTUD. This has become the recommended
approach for implementing discovery of the PMTU [BCP145].
It uses a general strategy in which the PL sends probe packets to
search for the largest size of unfragmented datagram that can be sent
over a network path. Probe packets are sent to explore using a
larger packet size. If a probe packet is successfully delivered (as
determined by the PL), then the PLPMTU is raised to the size of the
successful probe. If a black hole is detected (e.g., where packets
of size PLPMTU are consistently not received), the method reduces the
PLPMTU.
SCTP Probe Packets:
As the RFC suggested, the probe packets consist of an SCTP common header
followed by a HEARTBEAT chunk and a PAD chunk. The PAD chunk is used to
control the length of the probe packet. The HEARTBEAT chunk is used to
trigger the sending of a HEARTBEAT ACK chunk to confirm this probe on
the HEARTBEAT sender.
The HEARTBEAT chunk also carries a Heartbeat Information parameter that
includes the probe size to help an implementation associate a HEARTBEAT
ACK with the size of probe that was sent. The sender use the nonce and
the probe size to verify the information returned.
Detailed Implementation on SCTP:
+------+
+------->| Base |-----------------+ Connectivity
| +------+ | or BASE_PLPMTU
| | | confirmation failed
| | v
| | Connectivity +-------+
| | and BASE_PLPMTU | Error |
| | confirmed +-------+
| | | Consistent
| v | connectivity
Black Hole | +--------+ | and BASE_PLPMTU
detected | | Search |<---------------+ confirmed
| +--------+
| ^ |
| | |
| Raise | | Search
| timer | | algorithm
| expired | | completed
| | |
| | v
| +-----------------+
+---| Search Complete |
+-----------------+
When PLPMTUD is enabled, it's in Base state, and starts to probe with
BASE_PLPMTU (1200). If this probe succeeds, it goes to Search state;
If this probe fails, it goes to Error state under which pl.pmtu goes
down to MIN_PLPMTU (512) and keeps probing with BASE_PLPMTU until it
succeeds and goes to Search state.
During the Search state, the probe size is growing by a Big step (32)
every time when the last probe succeeds at the beginning. Once a probe
(such as 1420) fails after trying MAX_PROBES (3) times, the probe_size
goes back to the last one (1420 - 32 = 1388), meanwhile 'probe_high'
is set to 1420 and the growing step becomes a Small one (4). Then the
probe is continuing with a Small step grown each round. Until it gets
the optimal size (such as 1400) when probe with its next probe size
(1404) fails, it sync this size to pathmtu and goes to Complete state.
In Complete state, it will only does a probe check for the pathmtu just
set, if it fails, which means a Black Hole is detected and it goes back
to Base state. If it succeeds, it goes back to Search state again, and
probe is continuing with growing a Small step (1400 + 4). If this probe
fails, probe_high is set and goes back to 1388 and then Complete state,
which is kind of a loop normally. However if the env's pathmtu changes
to a big size somehow, this probe will succeed and then probe continues
with growing a Big step (1400 + 32) each round until another probe fails.
PTB Messages Process:
PLPMTUD doesn't rely on these package to find the pmtu, and shouldn't
trust it either. When processing them, it only changes the probe_size
to PL_PTB_SIZE(info - hlen) if 'pl.pmtu < PL_PTB_SIZE < the current
probe_size' druing Search state. As this could help probe_size to get
to the optimal size faster, for exmaple:
pl.pmtu = 1388, probe_size = 1420, while the env's pathmtu = 1400.
When probe_size is 1420, a Toobig packet with 1400 comes back. If probe
size changes to use 1400, it will save quite a few rounds to get there.
But of course after having this value, PLPMTUD will still verify it on
its own before using it.
Patches:
- Patch 1-6: introduce some new constants/variables from the RFC, systcl
and members in transport, APIs for the following patches, chunks and
a timer for the probe sending and some codes for the probe receiving.
- Patch 7-9: implement the state transition on the tx path, rx path and
toobig ICMP packet processing. This is the main algorithm part.
- Patch 10: activate this feature
- Patch 11-14: improve the process for ICMP packets for SCTP over UDP,
so that it can also be covered by this feature.
Tests:
- do sysctl and setsockopt tests for this feature's enabling and disabling.
- get these pr_debug points for this feature by
# cat /sys/kernel/debug/dynamic_debug/control | grep PLP
and enable them on kernel dynamic debug, then play with the pathmtu and
check if the state transition and plpmtu change match the RFC.
- do the above tests for SCTP over IPv4/IPv6 and SCTP over UDP.
v1->v2:
- See Patch 06/14.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:05:00 +0000 (14:05 -0400)]
sctp: process sctp over udp icmp err on sctp side
Previously, sctp over udp was using udp tunnel's icmp err process, which
only does sk lookup on sctp side. However for sctp's icmp error process,
there are more things to do, like syncing assoc pmtu/retransmit packets
for toobig type err, and starting proto_unreach_timer for unreach type
err etc.
Now after adding PLPMTUD, which also requires to process toobig type err
on sctp side. This patch is to process icmp err on sctp side by parsing
the type/code/info in .encap_err_lookup and call sctp's icmp processing
functions. Note as the 'redirect' err process needs to know the outer
ip(v6) header's, we have to leave it to udp(v6)_err to handle it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:59 +0000 (14:04 -0400)]
sctp: extract sctp_v4_err_handle function from sctp_v4_err
This patch is to extract sctp_v4_err_handle() from sctp_v4_err() to
only handle the icmp err after the sock lookup, and it also makes
the code clearer.
sctp_v4_err_handle() will be used in sctp over udp's err handling
in the following patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:58 +0000 (14:04 -0400)]
sctp: extract sctp_v6_err_handle function from sctp_v6_err
This patch is to extract sctp_v6_err_handle() from sctp_v6_err() to
only handle the icmp err after the sock lookup, and it also makes
the code clearer.
sctp_v6_err_handle() will be used in sctp over udp's err handling
in the following patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:57 +0000 (14:04 -0400)]
sctp: remove the unessessary hold for idev in sctp_v6_err
Same as in tcp_v6_err() and __udp6_lib_err(), there's no need to
hold idev in sctp_v6_err(), so just call __in6_dev_get() instead.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:56 +0000 (14:04 -0400)]
sctp: enable PLPMTUD when the transport is ready
sctp_transport_pl_reset() is called whenever any of these 3 members in
transport is changed:
- probe_interval
- param_flags & SPP_PMTUD_ENABLE
- state == ACTIVE
If all are true, start the PLPMTUD when it's not yet started. If any of
these is false, stop the PLPMTUD when it's already running.
sctp_transport_pl_update() is called when the transport dst has changed.
It will restart the PLPMTUD probe. Again, the pathmtu won't change but
use the dst's mtu until the Search phase is done.
Note that after using PLPMTUD, the pathmtu is only initialized with the
dst mtu when the transport dst changes. At other time it is updated by
pl.pmtu. So sctp_transport_pmtu_check() will be called only when PLPMTUD
is disabled in sctp_packet_config().
After this patch, the PLPMTUD feature from RFC8899 will be activated
and can be used by users.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:55 +0000 (14:04 -0400)]
sctp: do state transition when receiving an icmp TOOBIG packet
PLPMTUD will short-circuit the old process for icmp TOOBIG packets.
This part is described in rfc8899#section-4.6.2 (PL_PTB_SIZE =
PTB_SIZE - other_headers_len). Note that from rfc8899#section-5.2
State Machine, each case below is for some specific states only:
a) PL_PTB_SIZE < MIN_PLPMTU || PL_PTB_SIZE >= PROBED_SIZE,
discard it, for any state
b) MIN_PLPMTU < PL_PTB_SIZE < BASE_PLPMTU,
Base -> Error, for Base state
c) BASE_PLPMTU <= PL_PTB_SIZE < PLPMTU,
Search -> Base or Complete -> Base, for Search and Complete states.
d) PLPMTU < PL_PTB_SIZE < PROBED_SIZE,
set pl.probe_size to PL_PTB_SIZE then verify it, for Search state.
The most important one is case d), which will help find the optimal
fast during searching. Like when pathmtu = 1392 for SCTP over IPv4,
the search will be (20 is iphdr_len):
1. probe with 1200 - 20
2. probe with 1232 - 20
3. probe with 1264 - 20
...
7. probe with 1388 - 20
8. probe with 1420 - 20
When sending the probe with 1420 - 20, TOOBIG may come with PL_PTB_SIZE =
1392 - 20. Then it matches case d), and saves some rounds to try with the
1392 - 20 probe. But of course, PLPMTUD doesn't trust TOOBIG packets, and
it will go back to the common searching once the probe with the new size
can't be verified.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:54 +0000 (14:04 -0400)]
sctp: do state transition when a probe succeeds on HB ACK recv path
As described in rfc8899#section-5.2, when a probe succeeds, there might
be the following state transitions:
- Base -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_BIG_STEP,
- Error -> Search, occurs when probe succeeds with BASE_PLPMTU,
pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU,
pl.probe_size increases by SCTP_PL_BIG_STEP.
- Search -> Search Complete, occurs when probe succeeds with the probe
size SCTP_MAX_PLPMTU less than pl.probe_high,
pl.pmtu is not changing, but update *pathmtu* with it,
pl.probe_size is set back to pl.pmtu to double check it.
- Search Complete -> Search, occurs when probe succeeds with the probe
size equal to pl.pmtu,
pl.pmtu is not changing,
pl.probe_size increases by SCTP_PL_MIN_STEP.
So search process can be described as:
1. When it just enters 'Search' state, *pathmtu* is not updated with
pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP)
each round.
2. Until pl.probe_high is set when a probe fails, and probe_size
decreases back to pl.pmtu, as described in the last patch.
3. When the probe with the new size succeeds, probe_size changes to
increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high
is set.
4. Until probe_size is next to pl.probe_high, the searching finishes and
it goes to 'Complete' state and updates *pathmtu* with pl.pmtu, and
then probe_size is set to pl.pmtu to confirm by once more probe.
5. This probe occurs after "30 * probe_inteval", a much longer time than
that in Search state. Once it is done it goes to 'Search' state again
with probe_size increased by SCTP_PL_MIN_STEP.
As we can see above, during the searching, pl.pmtu changes while *pathmtu*
doesn't. *pathmtu* is only updated when the search finishes by which it
gets an optimal value for it. A big step is used at the beginning until
it gets close to the optimal value, then it changes to a small step until
it has this optimal value.
The small step is also used in 'Complete' until it goes to 'Search' state
again and the probe with 'pmtu + the small step' succeeds, which means a
higher size could be used. Then probe_size changes to increase by a big
step again until it gets close to the next optimal value.
Note that anytime when black hole is detected, it goes directly to 'Base'
state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:53 +0000 (14:04 -0400)]
sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path
The state transition is described in rfc8899#section-5.2,
PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the
state transition includes:
- Base -> Error, occurs when BASE_PLPMTU Confirmation Fails,
pl.pmtu is set to SCTP_MIN_PLPMTU,
probe_size is still SCTP_BASE_PLPMTU;
- Search -> Base, occurs when Black Hole Detected,
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
- Search Complete -> Base, occurs when Black Hole Detected
pl.pmtu is set to SCTP_BASE_PLPMTU,
probe_size is set back to SCTP_BASE_PLPMTU;
Note a black hole is encountered when a sender is unaware that packets
are not being delivered to the destination endpoint. So it includes the
probe failures with equal probe_size to pl.pmtu, and definitely not
include that with greater probe_size than pl.pmtu. The later one is the
normal probe failure where probe_size should decrease back to pl.pmtu
and pl.probe_high is set. pl.probe_high would be used on HB ACK recv
path in the next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:52 +0000 (14:04 -0400)]
sctp: do the basic send and recv for PLPMTUD probe
This patch does exactly what rfc8899#section-6.2.1.2 says:
The SCTP sender needs to be able to determine the total size of a
probe packet. The HEARTBEAT chunk could carry a Heartbeat
Information parameter that includes, besides the information
suggested in [RFC4960], the probe size to help an implementation
associate a HEARTBEAT ACK with the size of probe that was sent. The
sender could also use other methods, such as sending a nonce and
verifying the information returned also contains the corresponding
nonce. The length of the PAD chunk is computed by reducing the
probing size by the size of the SCTP common header and the HEARTBEAT
chunk.
Note that HB ACK chunk will carry back whatever HB chunk carried, including
the probe_size we put it in; We also check hbinfo->probe_size in the HB ACK
against link->pl.probe_size to validate this HB ACK chunk.
v1->v2:
- Remove the unused 'sp' and add static for sctp_packet_bundle_pad().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:51 +0000 (14:04 -0400)]
sctp: add the probe timer in transport for PLPMTUD
There are 3 timers described in rfc8899#section-5.1.1:
PROBE_TIMER, PMTU_RAISE_TIMER, CONFIRMATION_TIMER
This patches adds a 'probe_timer' in transport, and it works as either
PROBE_TIMER or PMTU_RAISE_TIMER. At most time, it works as PROBE_TIMER
and expires every a 'probe_interval' time to send the HB probe packet.
When transport pl enters COMPLETE state, it works as PMTU_RAISE_TIMER
and expires in 'probe_interval * 30' time to go back to SEARCH state
and do searching again.
SCTP HB is an acknowledged packet, CONFIRMATION_TIMER is not needed.
The timer will start when transport pl enters BASE state and stop
when it enters DISABLED state.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:50 +0000 (14:04 -0400)]
sctp: add the constants/variables and states and some APIs for transport
These are 4 constants described in rfc8899#section-5.1.2:
MAX_PROBES, MIN_PLPMTU, MAX_PLPMTU, BASE_PLPMTU;
And 2 variables described in rfc8899#section-5.1.3:
PROBED_SIZE, PROBE_COUNT;
And 5 states described in rfc8899#section-5.2:
DISABLED, BASE, SEARCH, SEARCH_COMPLETE, ERROR;
And these 4 APIs are used to reset/update PLPMTUD, check if PLPMTUD is
enabled, and calculate the additional headers length for a transport.
Note the member 'probe_high' in transport will be set to the probe
size when a probe fails with this probe size in the next patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:49 +0000 (14:04 -0400)]
sctp: add SCTP_PLPMTUD_PROBE_INTERVAL sockopt for sock/asoc/transport
With this socket option, users can change probe_interval for
a transport, asoc or sock after it's created.
Note that if the change is for an asoc, also apply the change
to each transport in this asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:48 +0000 (14:04 -0400)]
sctp: add probe_interval in sysctl and sock/asoc/transport
PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'.
'n' is the interval for PLPMTUD probe timer in milliseconds, and it
can't be less than 5000 if it's not 0.
All asoc/transport's PLPMTUD in a new socket will be enabled by default.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 22 Jun 2021 18:04:47 +0000 (14:04 -0400)]
sctp: add pad chunk and its make function and event table
This chunk is defined in rfc4820#section-3, and used to pad an
SCTP packet. The receiver must discard this chunk and continue
processing the rest of the chunks in the packet.
Add it now, as it will be bundled with a heartbeat chunk to probe
pmtu in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Tue, 22 Jun 2021 17:18:31 +0000 (19:18 +0200)]
net: marvell: return csum computation result from mvneta_rx_csum/mvpp2_rx_csum
This is a preliminary patch to add hw csum hint support to
mvneta/mvpp2 xdp implementation
Tested-by: Matteo Croce <mcroce@linux.microsoft.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 17:52:40 +0000 (10:52 -0700)]
Merge branch 'tc-testing-dnat-tuple-collision'
Marcelo Ricardo Leitner says:
====================
tc-testing: add test for ct DNAT tuple collision
That was fixed in
13c62f5371e3 ("net/sched: act_ct: handle DNAT tuple
collision").
For that, it requires that tdc is able to send diverse packets with
scapy, which is then done on the 2nd patch of this series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Tue, 22 Jun 2021 15:05:02 +0000 (12:05 -0300)]
tc-testing: add test for ct DNAT tuple collision
When this test fails, /proc/net/nf_conntrack gets only 1 entry:
ipv4 2 tcp 6 119 SYN_SENT src=10.0.0.10 dst=10.0.0.10 sport=5000 dport=10 [UNREPLIED] src=20.0.0.1 dst=10.0.0.10 sport=10 dport=5000 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
When it works, it gets 2 entries:
ipv4 2 tcp 6 119 SYN_SENT src=10.0.0.10 dst=10.0.0.20 sport=5000 dport=10 [UNREPLIED] src=20.0.0.1 dst=10.0.0.10 sport=10 dport=58203 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4 2 tcp 6 119 SYN_SENT src=10.0.0.10 dst=10.0.0.10 sport=5000 dport=10 [UNREPLIED] src=20.0.0.1 dst=10.0.0.10 sport=10 dport=5000 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
The missing entry is because the 2nd packet hits a tuple collusion and the
conntrack entry doesn't get allocated.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Tue, 22 Jun 2021 15:05:01 +0000 (12:05 -0300)]
tc-testing: add support for sending various scapy packets
It can be worth sending different scapy packets on a given test, as in the
last patch of this series. For that, lets listify the scapy attribute and
simply iterate over it.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Tue, 22 Jun 2021 15:05:00 +0000 (12:05 -0300)]
tc-testing: fix list handling
python lists don't have an 'add' method, but 'append'.
Fixes:
14e5175e9e04 ("tc-testing: introduce scapyPlugin for basic traffic")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Loic Poulain [Tue, 22 Jun 2021 14:21:40 +0000 (16:21 +0200)]
MAINTAINERS: network: add entry for WWAN
This patch adds maintainer info for drivers/net/wwan subdir, including
WWAN core and drivers. Adding Sergey and myself as maintainers and
Johannes as reviewer.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Aaron Conole [Tue, 22 Jun 2021 14:02:33 +0000 (10:02 -0400)]
openvswitch: add trace points
This makes openvswitch module use the event tracing framework
to log the upcall interface and action execution pipeline. When
using openvswitch as the packet forwarding engine, some types of
debugging are made possible simply by using the ovs-vswitchd's
ofproto/trace command. However, such a command has some
limitations:
1. When trying to trace packets that go through the CT action,
the state of the packet can't be determined, and probably
would be potentially wrong.
2. Deducing problem packets can sometimes be difficult as well
even if many of the flows are known
3. It's possible to use the openvswitch module even without
the ovs-vswitchd (although, not common use).
Introduce the event tracing points here to make it possible for
working through these problems in kernel space. The style is
copied from the mac80211 driver-trace / trace code for
consistency - this creates some checkpatch splats, but the
official 'guide' for adding tracepoints, as well as the existing
examples all add the same splats so it seems acceptable.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Tue, 22 Jun 2021 11:51:43 +0000 (14:51 +0300)]
stmmac: dwmac-loongson: fix uninitialized variable in loongson_dwmac_probe()
The "mdio" variable is never set to false. Also it should be a bool
type instead of int.
Fixes:
30bba69d7db4 ("stmmac: pci: Add dwmac support for Loongson")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 17:40:55 +0000 (10:40 -0700)]
Merge branch 'ethtool-eeprom'
Ido Schimmel says:
====================
ethtool: Module EEPROM API improvements
This patchset contains various improvements to recently introduced
module EEPROM netlink API. Noticed these while adding module EEPROM
write support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:52 +0000 (09:50 +0300)]
ethtool: Validate module EEPROM offset as part of policy
Validate the offset to read from module EEPROM as part of the netlink
policy and remove the corresponding check from the code.
This also makes it possible to query the offset range from user space:
$ genl ctrl policy name ethtool
...
ID: 0x14 policy[32]:attr[2]: type=U32 range:[0,255]
...
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:51 +0000 (09:50 +0300)]
ethtool: Validate module EEPROM length as part of policy
Validate the number of bytes to read from the module EEPROM as part of
the netlink policy and remove the corresponding check from the code.
This also makes it possible to query the length range from user space:
$ genl ctrl policy name ethtool
...
ID: 0x14 policy[32]:attr[3]: type=U32 range:[1,128]
...
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:50 +0000 (09:50 +0300)]
ethtool: Use kernel data types for internal EEPROM struct
The struct is not visible to user space and therefore should not use the
user visible data types.
Instead, use internal data types like other structures in the file.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:49 +0000 (09:50 +0300)]
ethtool: Document behavior when module EEPROM bank attribute is omitted
The kernel assumes bank 0 when 'ETHTOOL_MSG_MODULE_EEPROM_GET' is sent
without 'ETHTOOL_A_MODULE_EEPROM_BANK'.
Document it as part of the interface documentation.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:48 +0000 (09:50 +0300)]
ethtool: Decrease size of module EEPROM get policy array
The 'ETHTOOL_A_MODULE_EEPROM_DATA' attribute is not part of the get
request.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:47 +0000 (09:50 +0300)]
ethtool: Document correct attribute type
'ETHTOOL_A_MODULE_EEPROM_DATA' is a binary attribute, not a nested one.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 22 Jun 2021 06:50:46 +0000 (09:50 +0300)]
ethtool: Use correct command name in title
The command is called 'ETHTOOL_MSG_MODULE_EEPROM_GET', not
'ETHTOOL_MSG_MODULE_EEPROM'.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gushengxian [Tue, 22 Jun 2021 06:05:19 +0000 (23:05 -0700)]
bridge: cfm: remove redundant return
Return statements are not needed in Void function.
Signed-off-by: gushengxian <gushengxian@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Mon, 21 Jun 2021 22:21:12 +0000 (15:21 -0700)]
hv_netvsc: Avoid field-overflowing memcpy()
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
Add flexible array to represent start of buf_info, improving readability
and avoid future warning where memcpy() thinks it is writing past the
end of the structure.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 21 Jun 2021 22:10:55 +0000 (15:10 -0700)]
net: dsa: b53: Create default VLAN entry explicitly
In case CONFIG_VLAN_8021Q is not set, there will be no call down to the
b53 driver to ensure that the default PVID VLAN entry will be configured
with the appropriate untagged attribute towards the CPU port. We were
implicitly relying on dsa_slave_vlan_rx_add_vid() to do that for us,
instead make it explicit.
Reported-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Mon, 21 Jun 2021 21:54:19 +0000 (14:54 -0700)]
octeontx2-af: Avoid field-overflowing memcpy()
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
To avoid having memcpy() think a u64 "prof" is being written beyond,
adjust the prof member type by adding struct nix_bandprof_s to the union
to match the other structs. This silences the following future warning:
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:10,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/cpumask.h:5,
from ./arch/x86/include/asm/msr.h:11,
from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/timex.h:5,
from ./include/linux/timex.h:65,
from ./include/linux/time32.h:13,
from ./include/linux/time.h:60,
from ./include/linux/stat.h:19,
from ./include/linux/module.h:13,
from drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:11:
In function '__fortify_memcpy_chk',
inlined from '__fortify_memcpy' at ./include/linux/fortify-string.h:310:2,
inlined from 'rvu_nix_blk_aq_enq_inst' at drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:910:5:
./include/linux/fortify-string.h:268:4: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); please use struct_group() [-Wattribute-warning]
268 | __write_overflow_field();
| ^~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:
...
else if (req->ctype == NIX_AQ_CTYPE_BANDPROF)
memcpy(&rsp->prof, ctx,
sizeof(struct nix_bandprof_s));
...
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Subbaraya Sundeep<sbhatta@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 17:01:17 +0000 (10:01 -0700)]
Merge branch 'wwan-link-creation-improvements'
Sergey Ryazanov says:
====================
net: WWAN link creation improvements
This series is intended to make the WWAN network links management easier
for WWAN device drivers.
The series begins with adding support for network links creation to the
WWAN HW simulator to facilitate code testing. Then there are a couple of
changes that prepe the WWAN core code for further modifications. The
following patches (4-6) simplify driver unregistering procedures by
performing the created links cleanup in the WWAN core. 7th patch is to
avoid the odd hold of a driver module. Next patches (8th and 9th) make
it easier for drivers to create a network interface for a default data
channel. Finally, 10th patch adds support for reporting of data link
(aka channel aka context) id to make user aware which network
interface is bound to which WWAN device data channel.
All core changes have been tested with the HW simulator. The MHI and
IOSM drivers were only compile tested as I have no access to this
hardware. So the coresponding patches require ACK from the driver
authors.
Changelog:
v1 -> v2:
* rebased on top of latest net-next
* patch that reworks the creation of mhi_net default netdev was
dropped; as Loic explained, this network device has different
purpose depending on a driver mode; Loic has a plan to rework the
mhi_net driver, so we will defer the default netdev creation
reworkings
* add a new patch that creates a default network interface for IOSM
modems
* 7th, 8th, 10th patches have a minor updates (see the patches for
details)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:51:00 +0000 (01:51 +0300)]
wwan: core: add WWAN common private data for netdev
The WWAN core not only multiplex the netdev configuration data, but
process it too, and needs some space to store its private data
associated with the netdev. Add a structure to keep common WWAN core
data. The structure will be stored inside the netdev private data before
WWAN driver private data and have a field to make it easier to access
the driver data. Also add a helper function that simplifies drivers
access to their data.
At the moment we use the common WWAN private data to store the WWAN data
link (channel) id at the time the link is created, and report it back to
user using the .fill_info() RTNL callback. This should help the user to
be aware which network interface is bound to which WWAN device data
channel.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
CC: M Chetan Kumar <m.chetan.kumar@intel.com>
CC: Intel Corporation <linuxwwan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:59 +0000 (01:50 +0300)]
net: iosm: create default link via WWAN core
Utilize the just introduced WWAN core feature to create a default netdev
for the default data (IP MUX) channel.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
CC: M Chetan Kumar <m.chetan.kumar@intel.com>
CC: Intel Corporation <linuxwwan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:58 +0000 (01:50 +0300)]
wwan: core: support default netdev creation
Most, if not each WWAN device driver will create a netdev for the
default data channel. Therefore, add an option for the WWAN netdev ops
registration function to create a default netdev for the WWAN device.
A WWAN device driver should pass a default data channel link id to the
ops registering function to request the creation of a default netdev, or
a special value WWAN_NO_DEFAULT_LINK to inform the WWAN core that the
default netdev should not be created.
For now, only wwan_hwsim utilize the default link creation option. Other
drivers will be reworked next.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
CC: M Chetan Kumar <m.chetan.kumar@intel.com>
CC: Intel Corporation <linuxwwan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:57 +0000 (01:50 +0300)]
wwan: core: no more hold netdev ops owning module
The WWAN netdev ops owner holding was used to protect from the
unexpected memory disappear. This approach causes a dependency cycle
(driver -> core -> driver) and effectively prevents a WWAN driver
unloading. E.g. WWAN hwsim could not be unloaded until all simulated
devices are removed:
~# modprobe wwan_hwsim devices=2
~# lsmod | grep wwan
wwan_hwsim 16384 2
wwan 20480 1 wwan_hwsim
~# rmmod wwan_hwsim
rmmod: ERROR: Module wwan_hwsim is in use
~# echo > /sys/kernel/debug/wwan_hwsim/hwsim0/destroy
~# echo > /sys/kernel/debug/wwan_hwsim/hwsim1/destroy
~# lsmod | grep wwan
wwan_hwsim 16384 0
wwan 20480 1 wwan_hwsim
~# rmmod wwan_hwsim
For a real device driver this will cause an inability to unload module
until a served device is physically detached.
Since the last commit we are removing all child netdev(s) when a driver
unregister the netdev ops. This allows us to permit the driver
unloading, since any sane driver will call ops unregistering on a device
deinitialization. So, remove the holding of an ops owner to make it
easier to unload a driver module. The owner field has also beed removed
from the ops structure as there are no more users of this field.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:56 +0000 (01:50 +0300)]
net: iosm: drop custom netdev(s) removing
Since the last commit, the WWAN core will remove all our network
interfaces for us at the time of the WWAN netdev ops unregistering.
Therefore, we can safely drop the custom code that cleans the list of
created netdevs. Anyway it no longer removes any netdev, since all
netdevs were removed earlier in the wwan_unregister_ops() call.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: M Chetan Kumar <m.chetan.kumar@intel.com>
CC: M Chetan Kumar <m.chetan.kumar@intel.com>
CC: Intel Corporation <linuxwwan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:55 +0000 (01:50 +0300)]
wwan: core: remove all netdevs on ops unregistering
We use the ops owner module hold to protect against ops memory
disappearing. But this approach does not protect us from a driver that
unregisters ops but forgets to remove netdev(s) that were created using
this ops. In such case, we are left with netdev(s), which can not be
removed since ops is gone. Moreover, batch netdevs removing on
deinitialization is a desireable option for WWAN drivers as it is a
quite common task.
Implement deletion of all created links on WWAN netdev ops unregistering
in the same way that RTNL removes all links on RTNL ops unregistering.
Simply remove all child netdevs of a device whose WWAN netdev ops is
unregistering. This way we protecting the kernel from buggy drivers and
make it easier to write a driver deinitialization code.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:54 +0000 (01:50 +0300)]
wwan: core: multiple netdevs deletion support
Use unregister_netdevice_queue() instead of simple
unregister_netdevice() if the WWAN netdev ops does not provide a dellink
callback. This will help to accelerate deletion of multiple netdevs.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:53 +0000 (01:50 +0300)]
wwan: core: require WWAN netdev setup callback existence
The setup callback will be unconditionally passed to the
alloc_netdev_mqs(), where the NULL pointer dereference will cause the
kernel panic. So refuse to register WWAN netdev ops with warning
generation if the setup callback is not provided.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:52 +0000 (01:50 +0300)]
wwan: core: relocate ops registering code
It is unlikely that RTNL callbacks will call WWAN ops (un-)register
functions, but it is highly likely that the ops (un-)register functions
will use RTNL link create/destroy handlers. So move the WWAN network
interface ops (un-)register functions below the RTNL callbacks to be
able to call them without forward declarations.
No functional changes, just code relocation.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Ryazanov [Mon, 21 Jun 2021 22:50:51 +0000 (01:50 +0300)]
wwan_hwsim: support network interface creation
Add support for networking interface creation via the WWAN core by
registering the WWAN netdev creation ops for each simulated WWAN device.
Implemented minimalistic netdev support where the xmit callback just
consumes all egress skbs.
This should help with WWAN network interfaces creation testing.
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 16:57:45 +0000 (09:57 -0700)]
Merge branch 'mptcp-optimizations'
Mat Martineau says:
====================
mptcp: A few optimizations
Here is a set of patches that we've accumulated and tested in the MPTCP
tree.
Patch 1 removes the MPTCP-level tx skb cache that added complexity but
did not provide a meaningful benefit.
Patch 2 uses the fast socket lock in more places.
Patch 3 improves handling of a data-ready flag.
Patch 4 deletes an unnecessary and racy connection state check.
Patch 5 adds a MIB counter for one type of invalid MPTCP header.
Patch 6 improves self test failure output.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Matthieu Baerts [Mon, 21 Jun 2021 22:54:38 +0000 (15:54 -0700)]
selftests: mptcp: display proper reason to abort tests
Without this modification, we were often displaying this error messages:
FAIL: Could not even run loopback test
But $ret could have been set to a non 0 value in many different cases:
- net.mptcp.enabled=0 is not working as expected
- setsockopt(..., TCP_ULP, "mptcp", ...) is allowed
- ping between each netns are failing
- tests between ns1 as a receiver and ns>1 are failing
- other tests not involving ns1 as a receiver are failing
So not only for the loopback test.
Now a clearer message, including the time it took to run all tests, is
displayed.
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 21 Jun 2021 22:54:37 +0000 (15:54 -0700)]
mptcp: add MIB counter for invalid mapping
Account this exceptional events for better introspection.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 21 Jun 2021 22:54:36 +0000 (15:54 -0700)]
mptcp: drop redundant test in move_skbs_to_msk()
Currently we check the msk state to avoid enqueuing new
skbs at msk shutdown time.
Such test is racy - as we can't acquire the msk socket lock -
and useless, as the caller already checked the subflow
field 'disposable', covering the same scenario in a race
free manner - read and updated under the ssk socket lock.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 21 Jun 2021 22:54:35 +0000 (15:54 -0700)]
mptcp: don't clear MPTCP_DATA_READY in sk_wait_event()
If we don't flush entirely the receive queue, we need set
again such bit later. We can simply avoid clearing it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 21 Jun 2021 22:54:34 +0000 (15:54 -0700)]
mptcp: use fast lock for subflows when possible
There are a bunch of callsite where the ssk socket
lock is acquired using the full-blown version eligible for
the fast variant. Let's move to the latter.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 21 Jun 2021 22:54:33 +0000 (15:54 -0700)]
mptcp: drop tx skb cache
The mentioned cache was introduced to reduce the number of skb
allocation in atomic context, but the required complexity is
excessive.
This change remove the mentioned cache.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 22 Jun 2021 16:54:55 +0000 (09:54 -0700)]
Merge branch 'marvell-mdio-ACPI'
Marcin Wojtas says:
====================
ACPI MDIO support for Marvell controllers
The third version of the patchset main change is
dropping a clock handling optimisation patch
for mvmdio driver. Other than that it sets
explicit dependency on FWNODE_MDIO for CONFIG_FSL_XGMAC_MDIO
and applies minor cosmetic improvements (please see the
'Changelog' below).
The firmware ACPI description is exposed in the public github branch:
https://github.com/semihalf-wojtas-marcin/edk2-platforms/commits/acpi-mdio-r20210613
There is also MacchiatoBin firmware binary available for testing:
https://drive.google.com/file/d/1eigP_aeM4wYQpEaLAlQzs3IN_w1-kQr0
I'm looking forward to the comments or remarks.
Best regards,
Marcin
Changelog:
v2->v3
* Rebase on top of net-next/master.
* Drop "net: mvmdio: simplify clock handling" patch.
* 1/6 - fix code block comments.
* 2/6 - unchanged
* 3/6 - add "depends on FWNODE_MDIO" for CONFIG_FSL_XGMAC_MDIO
* 4/6 - drop mention about the clocks from the commit message.
* 5/6 - unchanged
* 6/6 - add Andrew's RB.
v1->v2
* 1/7 - new patch
* 2/7 - new patch
* 3/7 - new patch
* 4/7 - new patch
* 5/7 - remove unnecessary `if (has_acpi_companion())` and rebase onto
the new clock handling
* 6/7 - remove deprecated comment
* 7/7 - no changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Mon, 21 Jun 2021 17:30:28 +0000 (19:30 +0200)]
net: mvpp2: remove unused 'has_phy' field
The 'has_phy' field from struct mvpp2_port is no longer used.
Remove it.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Mon, 21 Jun 2021 17:30:27 +0000 (19:30 +0200)]
net: mvpp2: enable using phylink with ACPI
Now that the MDIO and phylink are supported in the ACPI
world, enable to use them in the mvpp2 driver. Ensure a backward
compatibility with the firmware whose ACPI description does
not contain the necessary elements for the proper phy handling
and fall back to relying on the link interrupts instead.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Mon, 21 Jun 2021 17:30:26 +0000 (19:30 +0200)]
net: mvmdio: add ACPI support
This patch introducing ACPI support for the mvmdio driver by adding
acpi_match_table with two entries:
* "MRVL0100" for the SMI operation
* "MRVL0101" for the XSMI mode
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Mon, 21 Jun 2021 17:30:25 +0000 (19:30 +0200)]
net/fsl: switch to fwnode_mdiobus_register
Utilize the newly added helper routine
for registering the MDIO bus via fwnode_
interface.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcin Wojtas [Mon, 21 Jun 2021 17:30:24 +0000 (19:30 +0200)]
net: mdiobus: Introduce fwnode_mdbiobus_register()
This patch introduces a new helper function that
wraps acpi_/of_ mdiobus_register() and allows its
usage via common fwnode_ interface.
Fall back to raw mdiobus_register() in case CONFIG_FWNODE_MDIO
is not enabled, in order to satisfy compatibility
in all future user drivers.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>