git.osdn.net Git - uclinux-h8/linux.git/log

igc: Add Host Good Packets Transmitted Count

This counter counts the number of good (non-erred) packets
transmitted sent by the host.
A good transmit packet is considered one that is 64 or more bytes
in length (from <Destination Address> through <CRC>,
inclusively) in length

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

igc: Remove MULR mask define

Multiple Tx Data Read Requests is hardware pipeline feature and
is not controlled by software

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

igc: Remove igc_set_fw_version comment

i225 device not supported and do not plan to support
configuration of fw version string for ethtool

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

igc: Clean up nvm_operations structure

valid_led_default function pointer not in use and can
be removed from nvm_operations structure.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

net: fec: Silence M5272 build warnings

If CONFIG_M5272=y:

    drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_restart’:
    drivers/net/ethernet/freescale/fec_main.c:948:6: warning: unused variable ‘val’ [-Wunused-variable]
      948 |  u32 val;
  |      ^~~
    drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_get_mac’:
    drivers/net/ethernet/freescale/fec_main.c:1667:28: warning: unused variable ‘pdata’ [-Wunused-variable]
     1667 |  struct fec_platform_data *pdata = dev_get_platdata(&fep->pdev->dev);
  |                            ^~~~~

Fix this by moving the variable declarations inside the existing #ifdef
blocks.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210202130650.865023-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: do not export inet_gro_{receive|complete}

inet_gro_receive() and inet_gro_complete() are part
of GRO engine which can not be modular.

Similarly, inet_gso_segment() does not need to be exported,
being part of GSO stack.

In other words, net/ipv6/ip6_offload.o is part of vmlinux,
regardless of CONFIG_IPV6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20210202154145.1568451-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mac80211-next-for-net-next-2021-02-02' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time, only RTNL locking reduction fallout.
- cfg80211_dev_rename() requires RTNL
- cfg80211_change_iface() and cfg80211_set_encryption()
   require wiphy mutex (was missing in wireless extensions)
- cfg80211_destroy_ifaces() requires wiphy mutex
- netdev registration can fail due to notifiers, and then
   notifiers are "unrolled", need to handle this properly

* tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
  cfg80211: fix netdev registration deadlock
  cfg80211: call cfg80211_destroy_ifaces() with wiphy lock held
  wext: call cfg80211_set_encryption() with wiphy lock held
  wext: call cfg80211_change_iface() with wiphy lock held
  nl80211: call cfg80211_dev_rename() under RTNL
====================

Link: https://lore.kernel.org/r/20210202144106.38207-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mlx5-updates-2021-02-01' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-02-01

mlx5 netdev updates:

1) Trivial refactoring ahead of the upcoming uplink representor series.
2) Increased RSS table size to 256, for better results
3) Misc. Cleanup and very trivial improvements

* tag 'mlx5-updates-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Avoid unnecessary csum recalculation on supporting devices
  net/mlx5e: CT: remove useless conversion to PTR_ERR then ERR_PTR
  net/mlx5e: accel, remove redundant space
  net/mlx5e: kTLS, Improve TLS RX workqueue scope
  net/mlx5e: remove h from printk format specifier
  net/mlx5e: Increase indirection RQ table size to 256
  net/mlx5e: Enable napi in channel's activation stage
  net/mlx5e: Move representor neigh init into profile enable
  net/mlx5e: Avoid false lock depenency warning on tc_ht
  net/mlx5e: Move set vxlan nic info to profile init
  net/mlx5e: Move netif_carrier_off() out of mlx5e_priv_init()
  net/mlx5e: Refactor mlx5e_netdev_init/cleanup to mlx5e_priv_init/cleanup
  net/mxl5e: Add change profile method
  net/mlx5e: Separate between netdev objects and mlx5e profiles initialization
====================

Link: https://lore.kernel.org/r/20210202065457.613312-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-add_addr-enhancements'

Mat Martineau says:

====================
mptcp: ADD_ADDR enhancements

This patch series from the MPTCP tree contains enhancements and
associated tests for the ADD_ADDR ("add address") MPTCP option. This
option allows already-connected MPTCP peers to share additional IP
addresses with each other, which can then be used to create additional
subflows within those MPTCP connections.

Patches 1 & 2 remove duplicated data in the per-connection path manager
structure.

Patches 3-6 initiate additional subflows when an address is added using
the netlink path manager interface and improve ADD_ADDR signaling
reliability, subject to configured limits. Self tests are also updated.

Patches 7-15 add new support for optional port numbers in ADD_ADDR. This
includes creating an additional in-kernel TCP listening socket for the
requested port number, validating the port number when processing
incoming subflow connections, including the port number in netlink
interfaces, and adding some new MIBs. New self test cases are added for
subflows connecting with alternate port numbers.
====================

Link: https://lore.kernel.org/r/20210201230920.66027-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add testcases for ADD_ADDR with port

This patch adds testcases for ADD_ADDR with port and the related MIB
counters check in chk_add_nr. The output looks like this:

24 signal address with port           syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       add[ ok ] - echo  [ ok ] - pt [ ok ]
                                       syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       syn[ ok ] - ack   [ ok ]
25 subflow and signal with port       syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       add[ ok ] - echo  [ ok ] - pt [ ok ]
                                       syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       syn[ ok ] - ack   [ ok ]
26 remove single address with port    syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       add[ ok ] - echo  [ ok ] - pt [ ok ]
                                       syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       syn[ ok ] - ack   [ ok ]
                                       rm [ ok ] - sf    [ ok ]

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add the mibs for ADD_ADDR with port

This patch adds the mibs for ADD_ADDR with port:

MPTCP_MIB_PORTADD for received ADD_ADDR suboption with a port number.

MPTCP_MIB_PORTSYNRX, MPTCP_MIB_PORTSYNACKRX, MPTCP_MIB_PORTACKRX, for
received MP_JOIN's SYN or SYN/ACK or ACK with a port number which is
different from the msk's port number.

MPTCP_MIB_MISMATCHPORTSYNRX and MPTCP_MIB_MISMATCHPORTACKRX, for
received SYN or ACK MP_JOIN with a mismatched port-number.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add port argument for pm_nl_ctl

This patch adds a new argument for pm_nl_ctl tool. We can use it like
this:

# pm_nl_ctl add 10.0.2.1 flags signal port 10100
# pm_nl_ctl dump
id 1 flags signal 10.0.2.1 10100

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: deal with MPTCP_PM_ADDR_ATTR_PORT in PM netlink

This patch adds MPTCP_PM_ADDR_ATTR_PORT filling and parsing in PM
netlink.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: enable use_port when invoke addresses_equal

When dealing with the addresses list local_addr_list or anno_list, we
should enable the function addresses_equal's parameter use_port. And
enable it in address_zero too.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add port number check for MP_JOIN

This patch adds two new helpers, subflow_use_different_sport and
subflow_use_different_dport, to check whether the subflow's source or
destination port number is different from the msk's port number. When
receiving the MP_JOIN's SYN/SYNACK/ACK, we do these port number checks
and print out the different port numbers.

And furthermore, when receiving the MP_JOIN's SYN/ACK, we also use a new
helper mptcp_pm_sport_in_anno_list to check whether this port number is
announced. If it isn't, we need to abort this connection.

This patch also populates the local address's port field in
local_address.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add a new helper subflow_req_create_thmac

This patch adds a new helper named subflow_req_create_thmac, which is
extracted from subflow_token_join_request. It initializes subflow_req's
local_nonce and thmac fields, those are the more expensive to populate.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop unused skb in subflow_token_join_request

This patch drops the unused parameter skb in subflow_token_join_request.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: create the listening socket for new port

This patch creates a listening socket when an address with a port-number
is added by PM netlink. Then binds the new port to the socket, and
listens for new connections.

When the address is removed or the addresses are flushed by PM netlink,
release the listening socket.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add testcases for newly added addresses

This patch adds testcases to create subflows or signal addresses for the
newly added IPv4 or IPv6 addresses.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: use minus values for removing address numbers

This patch changes the removing addresses numbers to minus values, left
the plus values for the adding addresses numbers.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: send ack for every add_addr

This patch changes the sending ACK conditions for the ADD_ADDR, send an
ACK packet for any ADD_ADDR, not just when ipv6 addresses or port
numbers are included.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/139
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: create subflow or signal addr for newly added address

Currently, when a new MPTCP endpoint is added, the existing MPTCP
sockets are not affected.

This patch implements a new function mptcp_nl_add_subflow_or_signal_addr,
invoked when an address is added from PM netlink. This function traverses
the MPTCP sockets list and invokes mptcp_pm_create_subflow_or_signal_addr
to try to create a subflow or signal an address for the newly added
address, if local constraint allows that.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/19
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop *_max fields in mptcp_pm_data

This patch drops the per-msk values add_addr_signal_max,
add_addr_accept_max, local_addr_max and subflows_max fields in struct
mptcp_pm_data, uses the pernet *_max values instead. And adds four new
helpers to get the pernet *_max values separately.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: use WRITE_ONCE for the pernet *_max

This patch uses WRITE_ONCE() for all the pernet add_addr_signal_max,
add_addr_accept_max, local_addr_max and subflows_max fields in struct
pm_nl_pernet to avoid concurrency issues.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: Add support for another RTL8168FP

According to the vendor driver, the new chip with XID 0x54b is
essentially the same as the one with XID 0x54a, but it doesn't need the
firmware.

So add support accordingly.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/20210202044813.1304266-1-kai.heng.feng@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-notifications-when-route-hardware-flags-change'

Ido Schimmel says:

====================
Add notifications when route hardware flags change

Routes installed to the kernel can be programmed to capable devices, in
which case they are marked with one of two flags. RTM_F_OFFLOAD for
routes that offload traffic from the kernel and RTM_F_TRAP for routes
that trap packets to the kernel for processing (e.g., host routes).

These flags are of interest to routing daemons since they would like to
delay advertisement of routes until they are installed in hardware. This
allows them to avoid packet loss or misrouted packets. Currently,
routing daemons do not receive any notifications when these flags are
changed, requiring them to poll the kernel tables for changes which is
inefficient.

This series addresses the issue by having the kernel emit RTM_NEWROUTE
notifications whenever these flags change. The behavior is controlled by
two sysctls (net.ipv4.fib_notify_on_flag_change and
net.ipv6.fib_notify_on_flag_change) that default to 0 (no
notifications).

Note that even if route installation in hardware is improved to be more
synchronous, these notifications are still of interest. For example, a
multipath route can change from RTM_F_OFFLOAD to RTM_F_TRAP if its
neighbours become invalid. A routing daemon can choose to withdraw /
replace the route in that case. In addition, the deletion of a route
from the kernel can prompt the installation of an identical route
(already in kernel, with an higher metric) to hardware.

For testing purposes, netdevsim is aligned to simulate a "real" driver
that programs routes to hardware.

Series overview:

Patches #1-#2 align netdevsim to perform route programming in a
non-atomic context

Patches #3-#5 add sysctl to control IPv4 notifications

Patches #6-#8 add sysctl to control IPv6 notifications

Patch #9 extends existing fib tests to set sysctls before running tests

Patch #10 adds test for fib notifications over netdevsim
====================

Link: https://lore.kernel.org/r/20210201194757.3463461-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netdevsim: Add fib_notifications test

Add test to check fib notifications behavior.

The test checks route addition, route deletion and route replacement for
both IPv4 and IPv6.

When fib_notify_on_flag_change=0, expect single notification for route
addition/deletion/replacement.

When fib_notify_on_flag_change=1, expect:
- two notification for route addition/replacement, first without RTM_F_TRAP
  and second with RTM_F_TRAP.
- single notification for route deletion.

$ ./fib_notifications.sh
TEST: IPv4 route addition                                           [ OK ]
TEST: IPv4 route deletion                                           [ OK ]
TEST: IPv4 route replacement                                        [ OK ]
TEST: IPv6 route addition                                           [ OK ]
TEST: IPv6 route deletion                                           [ OK ]
TEST: IPv6 route replacement                                        [ OK ]

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: Extend fib tests to run with and without flags notifications

Run the test cases with both `fib_notify_on_flag_change` sysctls set to
'1', and then with both sysctls set to '0' to verify there are no
regressions in the test when notifications are added.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv6: Emit notification when fib hardware flags are changed

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead
to a routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Move fib6_info_hw_flags_set() to C file because it is no longer a short
function.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Do not call fib6_info_hw_flags_set() when IPv6 is disabled

With the next patch mlxsw and netdevsim will fail in compilation if
CONFIG_IPV6 is disabled.

Do not call fib6_info_hw_flags_set() when IPv6 is disabled.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Pass 'net' struct as first argument to fib6_info_hw_flags_set()

The next patch will emit notification when hardware flags are changed,
in case that fib_notify_on_flag_change sysctl is set to 1.

To know sysctl values, net struct is needed.
This change is consistent with the IPv4 version, which gets 'net' struct
as its first argument.

Currently, the only callers of this function are mlxsw and netdevsim.
Patch the callers to pass net.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: Emit notification when fib hardware flags are changed

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: Publish fib_nlmsg_size()

Publish fib_nlmsg_size() to allow it to be used later on from
fib_alias_hw_flags_set().

Remove the inline keyword since it shouldn't be used inside C files.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: Pass fib_rt_info as const to fib_dump_info()

fib_dump_info() does not change 'fri', so pass it as 'const'.
It will later allow us to invoke fib_dump_info() from
fib_alias_hw_flags_set().

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: fib: Perform the route programming in a non-atomic context

Currently, netdevsim implements dummy FIB offload and marks notified
routes with RTM_F_TRAP flag. netdevsim does not defer route notifications
to a work queue because it does not need to program any hardware.

Given that netdevsim's purpose is to both give an example implementation
and allow developers to test their code, align netdevsim to a "real"
hardware device driver like mlxsw and have it also perform the route
"programming" in a non-atomic context.

It will be used to test route flags notifications which will be added in
the next patches.

The following changes are needed when route handling is performed in WQ:
- Handle the accounting in the main context, to be able to return an
  error for adding route when all the routes are used.
  For FIB_EVENT_ENTRY_REPLACE increase the counter before scheduling
  the delayed work, and in case that this event replaces an existing route,
  decrease the counter as part of the delayed work.

- For IPv6, cannot use fen6_info->rt->fib6_siblings list because it
  might be changed during handling the delayed work.
  Save an array with the nexthops as part of fib6_event struct, and take
  a reference for each nexthop to prevent them from being freed while
  event is queued.

- Change GFP_ATOMIC allocations to GFP_KERNEL.

- Use single work item that is handling a list of ordered routes.
  Handling routes must be processed in the order they were submitted to
  avoid logical errors that could lead to unexpected failures.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: fib: Convert the current occupancy to an atomic variable

When route is added/deleted, the appropriate counter is increased/decreased
to maintain number of routes.

User can limit the number of routes and then according to the appropriate
counter, adding more routes than the limitation is forbidden.

Currently, there is one lock which protects hashtable, list and accounting.

Handling the counters will be performed from both atomic context and
non-atomic context, while the hashtable and the list will be used only from
non-atomic context and therefore will be protected by a separate lock.

Protect accounting by using an atomic variable, so lock is not needed.

v2:
* Use atomic64_sub() in nsim_nexthop_account()'s error path

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-don-t-disable-napi-in-suspend'

Alex Elder says:

====================
net: ipa: don't disable NAPI in suspend

This is version 2 of a series that reworks the order in which things
happen during channel stop and suspend (and start and resume), in
order to address a hang that has been observed during suspend.
The introductory message on the first version of the series gave
some history which is omitted here.

The end result of this series is that we only enable NAPI and the
I/O completion interrupt on a channel when we start the channel for
the first time.  And we only disable them when stopping the channel
"for good."  In other words, NAPI and the completion interrupt
remain enabled while a channel is stopped for suspend.

One comment on version 1 of the series suggested *not* returning
early on success in a function, instead having both success and
error paths return from the same point at the end of the function
block.  This has been addressed in this version.

In addition, this version consolidates things a little bit, but the
net result of the series is exactly the same as version 1 (with the
exception of the return fix mentioned above).

First, patch 6 in the first version was a small step to make patch 7
easier to understand.  The two have been combined now.

Second, previous version moved (and for suspend/resume, eliminated)
I/O completion interrupt and NAPI disable/enable control in separate
steps (patches).  Now both are moved around together in patch 5 and
6, which eliminates the need for the final (NAPI-only) patch.

I won't repeat the patch summaries provided in v1:
  https://lore.kernel.org/netdev/20210129202019.2099259-1-elder@linaro.org/

Many thanks to Willem de Bruijn for his thoughtful input.
====================

Link: https://lore.kernel.org/r/20210201172850.2221624-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: expand last transaction check

Transactions to send data for a network device can be allocated at
any time up until the point the TX queue is stopped. It is possible
for ipa_start_xmit() to be called in one context just before a
the transmit queue is stopped in another.

Update gsi_channel_trans_last() so that for TX channels the
allocated and pending transaction lists are checked--in addition
to the completed and polled lists--to determine the "last"
transaction. This means any transaction that has been allocated
before the TX queue is stopped will be allowed to complete before
we conclude the channel is quiesced.

Rework the function a bit to use a list pointer and gotos.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: don't disable interrupt on suspend

No completion interrupts will occur while an endpoint is suspended,
nor when a channel has been stopped for suspend. So there's no need
to disable the interrupt during suspend and re-enable it when
resuming. Without any interrupts occurring, there is no need to
disable/re-enable NAPI for channel suspend/resume either.

We'll only enable NAPI and the interrupt when we first start the
channel, and disable it again only when it's "really" stopped.

To accomplish this, move the enable/disable calls out of
__gsi_channel_start() and __gsi_channel_stop(), and into
gsi_channel_start() and gsi_channel_stop() instead.

Add a call to napi_synchronize() to gsi_channel_suspend(), to ensure
NAPI polling is done before moving on.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: disable interrupt and NAPI after channel stop

Disable both the I/O completion interrupt and NAPI polling on a
channel *after* we successfully stop it rather than before. This
ensures a completion occurring just before the channel is stopped
gets processed.

Enable NAPI polling and the interrupt *before* starting a channel
rather than after, to be symmetric. A stopped channel won't
generate any completion interrupts anyway.

Enable NAPI before the interrupt and disable it afterward.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw()

Open-code gsi_channel_freeze() and gsi_channel_thaw() in all callers
and get rid of these two functions. This is part of reworking the
sequence of things done during channel suspend/resume and start/stop.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: introduce __gsi_channel_start()

Create a new function that does most of the work of starting a
channel.  What's different is that it takes a flag indicating
whether the channel should really be started or not.  Create
another new function __gsi_channel_stop() that behaves similarly.

IPA v3.5.1 implements suspend using a special SUSPEND endpoint
setting.  If the endpoint is suspended when an I/O completes on the
underlying GSI channel, a SUSPEND interrupt is generated.

Newer versions of IPA do not implement the SUSPEND endpoint mode.
Instead, endpoint suspend is implemented by simply stopping the
underlying GSI channel.  In this case, a completing I/O on a
*stopped* channel causes the SUSPEND interrupt condition.

These new functions put all activity related to starting or stopping
a channel (including "thawing/freezing" the channel) in one place,
whether or not the channel is actually started or stopped.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: introduce gsi_channel_stop_retry()

Create a new helper function that encapsulates issuing a set of
channel stop commands, retrying if appropriate, with a short delay
between attempts.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: don't thaw channel if error starting

If an error occurs starting a channel, don't "thaw" it.
We should assume the channel remains in a non-started state.

Update the comment in gsi_channel_stop(); calls to this function
are no longer retried.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fix up truesize of cloned skb in skb_prepare_for_shift()

Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).

Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).

Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com
Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix length of MP_PRIO suboption

With version 0 of the protocol it was legal to encode the 'Subflow Id' in
the MP_PRIO suboption, to specify which subflow would change its 'Backup'
flag. This has been removed from v1 specification: thus, according to RFC
8684 §3.3.8, the resulting 'Length' for MP_PRIO changed from 4 to 3 byte.

Current Linux generates / parses MP_PRIO according to the old spec, using
'Length' equal to 4, and hardcoding 1 as 'Subflow Id'; RFC compliance can
improve if we change 'Length' in other to become 3, leaving a 'Nop' after
the MP_PRIO suboption. In this way the kernel will emit and accept *only*
MP_PRIO suboptions that are compliant to version 1 of the MPTCP protocol.

unpatched 5.11-rc kernel:
[root@bottarga ~]# tcpdump -tnnr unpatched.pcap | grep prio
reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
IP 10.0.3.2.48433 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 4032325513 ecr 1876514270,mptcp prio non-backup id 1,mptcp dss ack 14084896651682217737], length 0

patched 5.11-rc kernel:
[root@bottarga ~]# tcpdump -tnnr patched.pcap | grep prio
reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
IP 10.0.3.2.49735 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 1276737699 ecr 2686399734,mptcp prio non-backup,nop,mptcp dss ack 18433038869082491686], length 0

Changes since v2:
- when accounting for option space, don't increment 'TCPOLEN_MPTCP_PRIO'
and use 'TCPOLEN_MPTCP_PRIO_ALIGN' instead, thanks to Matthieu Baerts.
Changes since v1:
- refactor patch to avoid using 'TCPOLEN_MPTCP_PRIO' with its old value,
thanks to Geliang Tang.

Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matteo Croce <mcroce@linux.microsoft.com>
Link: https://lore.kernel.org/r/846cdd41e6ad6ec88ef23fee1552ab39c2f5a3d1.1612184361.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'drivers-net-update-tasklet_init-callers'

Emil Renner Berthing says:

====================
drivers: net: update tasklet_init callers

This updates the remaining callers of tasklet_init() in drivers/net
to the new API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

All changes are done by coccinelle using the following semantic patch.
Coccinelle needs a little help parsing drivers/net/arcnet/arcnet.c

@ match @
type T;
T *container;
identifier tasklet;
identifier callback;
@@
tasklet_init(&container->tasklet, callback, (unsigned long)container);

@ patch1 depends on match @
type match.T;
identifier match.tasklet;
identifier match.callback;
identifier data;
identifier container;
@@
-void callback(unsigned long data)
+void callback(struct tasklet_struct *t)
{
...
- T *container = (T *)data;
+ T *container = from_tasklet(container, t, tasklet);
...
}

@ patch2 depends on match @
type match.T;
identifier match.tasklet;
identifier match.callback;
identifier data;
identifier container;
@@
-void callback(unsigned long data)
+void callback(struct tasklet_struct *t)
{
...
- T *container;
+ T *container = from_tasklet(container, t, tasklet);
...
- container = (T *)data;
...
}

@ depends on (patch1 || patch2) @
match.T *container;
identifier match.tasklet;
identifier match.callback;
@@
- tasklet_init(&container->tasklet, callback, (unsigned long)container);
+ tasklet_setup(&container->tasklet, callback);
====================

Link: https://lore.kernel.org/r/20210130234730.26565-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: rtl8150: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: r8152: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: pegasus: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: lan78xx: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: hso: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ppp: use new tasklet API

This converts the async and synctty drivers to use the new tasklet API n
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ifb: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

caif_virtio: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arcnet: use new tasklet API

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'clang-format-for-linux-v5.11-rc7' of git://github.com/ojeda/linux

Pull clang-format update from Miguel Ojeda:
"Update with the latest for_each macro list"

* tag 'clang-format-for-linux-v5.11-rc7' of git://github.com/ojeda/linux:
clang-format: Update with the latest for_each macro list

Merge tag 'dma-mapping-5.11-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:
"Fix a kernel crash in the new dma-mapping benchmark test (Barry Song)"

* tag 'dma-mapping-5.11-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: benchmark: fix kernel crash when dma_map_single fails

Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull vdpa fix from Michael Tsirkin:
"A single mlx bugfix"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix memory key MTT population

Merge tag 'net-5.11-rc7' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
  trees.

  Current release - regressions:

   - ip_tunnel: fix mtu calculation

   - mlx5: fix function calculation for page trees

  Previous releases - regressions:

   - vsock: fix the race conditions in multi-transport support

   - neighbour: prevent a dead entry from updating gc_list

   - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add

  Previous releases - always broken:

   - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
     cgroup getsockopt infra when user space is trying to race against
     optlen, from Loris Reiff.

   - bpf: add missing fput() in BPF inode storage map update helper

   - udp: ipv4: manipulate network header of NATed UDP GRO fraglist

   - mac80211: fix station rate table updates on assoc

   - r8169: work around RTL8125 UDP HW bug

   - igc: report speed and duplex as unknown when device is runtime
     suspended

   - rxrpc: fix deadlock around release of dst cached on udp tunnel"

* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
  net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
  net: ipa: fix two format specifier errors
  net: ipa: use the right accessor in ipa_endpoint_status_skip()
  net: ipa: be explicit about endianness
  net: ipa: add a missing __iomem attribute
  net: ipa: pass correct dma_handle to dma_free_coherent()
  r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
  net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
  net: mvpp2: TCAM entry enable should be written after SRAM data
  net: lapb: Copy the skb before sending a packet
  net/mlx5e: Release skb in case of failure in tc update skb
  net/mlx5e: Update max_opened_tc also when channels are closed
  net/mlx5: Fix leak upon failure of rule creation
  net/mlx5: Fix function calculation for page trees
  docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
  udp: ipv4: manipulate network header of NATed UDP GRO fraglist
  net: ip_tunnel: fix mtu calculation
  vsock: fix the race conditions in multi-transport support
  net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
  ibmvnic: device remove has higher precedence over reset
  ...

net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary

sup_multicast_addr is passed to ether_addr_equal for address comparison
which casts the address inputs to u16 leading to an unaligned access.
Aligning the sup_multicast_addr to u16 boundary fixes the issue.

Signed-off-by: Andreas Oetken <andreas.oetken@siemens.com>
Link: https://lore.kernel.org/r/20210202090304.2740471-1-ennoerlangen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mlx5-fixes-2021-02-01' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2021-02-01

Please note the first patch in this series
("Fix function calculation for page trees") is fixing a regression
due to previous fix in net which you didn't include in your previous
rc pr. So I hope this series will make it into your next rc pr,
so mlx5 won't be broken in the next rc.

* tag 'mlx5-fixes-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Release skb in case of failure in tc update skb
  net/mlx5e: Update max_opened_tc also when channels are closed
  net/mlx5: Fix leak upon failure of rule creation
  net/mlx5: Fix function calculation for page trees
====================

Link: https://lore.kernel.org/r/20210202070703.617251-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-a-few-bug-fixes'

Alex Elder says:

====================
net: ipa: a few bug fixes

This series fixes four minor bugs. The first two are things that
sparse points out. All four are very simple and each patch should
explain itself pretty well.
====================

Link: https://lore.kernel.org/r/20210201232609.3524451-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: fix two format specifier errors

Fix two format specifiers that used %lu for a size_t in "ipa_mem.c".

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: use the right accessor in ipa_endpoint_status_skip()

When extracting the destination endpoint ID from the status in
ipa_endpoint_status_skip(), u32_get_bits() is used. This happens to
work, but it's wrong: the structure field is only 8 bits wide
instead of 32.

Fix this by using u8_get_bits() to get the destination endpoint ID.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: be explicit about endianness

Sparse warns that the assignment of the metadata mask for a QMAP
endpoint in ipa_endpoint_init_hdr_metadata_mask() is a bad
assignment. We know we want the mask value to be big endian, even
though the value we write is in host byte order. Use a __force
tag to indicate we really mean it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: add a missing __iomem attribute

The virt local variable in gsi_channel_state() does not have an
__iomem attribute but should. Fix this.

Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Amy Parker <enbyamy@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: pass correct dma_handle to dma_free_coherent()

The "ring->addr = addr;" assignment is done a few lines later so we
can't use "ring->addr" yet. The correct dma_handle is "addr".

Fixes: 650d1603825d ("soc: qcom: ipa: the generic software interface")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/YBjpTU2oejkNIULT@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set

So far phy_disconnect() is called before free_irq(). If CONFIG_DEBUG_SHIRQ
is set and interrupt is shared, then free_irq() creates an "artificial"
interrupt by calling the interrupt handler. The "link change" flag is set
in the interrupt status register, causing phylib to eventually call
phy_suspend(). Because the net_device is detached from the PHY already,
the PHY driver can't recognize that WoL is configured and powers down the
PHY.

Fixes: f1e911d5d0df ("r8169: add basic phylib support")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/fe732c2c-a473-9088-3974-df83cfbd6efd@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS

syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS
control message is passed with user-controlled
0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.

The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400.
So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10
is the closest limit, with 0x10 leftover.

Same condition is currently done in rds_cmsg_rdma_args().

[1] WARNING: mm/page_alloc.c:5011
[..]
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc_array include/linux/slab.h:592 [inline]
kcalloc include/linux/slab.h:621 [inline]
rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568
rds_rm_size net/rds/send.c:928 [inline]

Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvpp2: TCAM entry enable should be written after SRAM data

Last TCAM data contains TCAM enable bit.
It should be written after SRAM data before entry enabled.

Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Stefan Chulski <stefanc@marvell.com>
Link: https://lore.kernel.org/r/1612172139-28343-1-git-send-email-stefanc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lapb: Copy the skb before sending a packet

When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.

In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.

The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20210201055706.415842-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mac80211-for-net-2021-02-02' of git://git./linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two fixes:
- station rate tables were not updated correctly
   after association, leading to bad configuration
- rtl8723bs (staging) was initializing data incorrectly
   after the previous fix and needed to move the init
   later

* tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
  mac80211: fix station rate table updates on assoc
====================

Link: https://lore.kernel.org/r/20210202143505.37610-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Release skb in case of failure in tc update skb

In case of failure in tc update skb the packet is dropped
without freeing the skb.

Fixed by freeing the skb in case failure in tc update skb.

Fixes: d6d27782864f ("net/mlx5: E-Switch, Restore chain id on miss")
Fixes: c75690972228 ("net/mlx5e: Add tc chains offload support for nic flows")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Update max_opened_tc also when channels are closed

max_opened_tc is used for stats, so that potentially non-zero stats
won't disappear when num_tc decreases. However, mlx5e_setup_tc_mqprio
fails to update it in the flow where channels are closed.

This commit fixes it. The new value of priv->channels.params.num_tc is
always checked on exit. In case of errors it will just be the old value,
and in case of success it will be the updated value.

Fixes: 05909babce53 ("net/mlx5e: Avoid reset netdev stats on configuration changes")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Fix leak upon failure of rule creation

When creation of a new rule that requires allocation of an FTE fails,
need to call to tree_put_node on the FTE in order to release its'
resource.

Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Alaa Hleihel <alaa@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Fix function calculation for page trees

The function calculation always results in a value of 0. This works
generally, but when the release all pages feature is enabled it will
result in crashes.

Fixes: 0aa128475d33 ("net/mlx5: Maintain separate page trees for ECPF and PF functions")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Avoid unnecessary csum recalculation on supporting devices

If as part of the actions the TTL of the packet is modified, the packet's
checksum needs to be recalculated. Connect-X6DX can handle this csum
recalculation natively. Older devices require this additional recalculation.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: CT: remove useless conversion to PTR_ERR then ERR_PTR

Just return the ptr directly.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: accel, remove redundant space

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: kTLS, Improve TLS RX workqueue scope

The TLS RX workqueue is needed only when kTLS RX device offload
is supported.

Move its creation from the general TLS init function to the
kTLS RX init.
Create it once at init time if supported, avoid creation/destroy
everytime the feature bit is toggled.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: remove h from printk format specifier

This change fixes the checkpatch warning described in this commit
commit cbacb5ab0aa0 ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]")

Standard integer promotion is already done and %hx and %hhx is useless
so do not encourage the use of %hh[xudi] or %h[xudi].

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Increase indirection RQ table size to 256

Increasing the indirection RQ table size from 128 to 256 improves the
packet distribution over the NIC HW queues for various cases.

Let's take a look at the following scenario:
Assuming RSS result distributed uniformly and indirection table is filled
with queues in a cyclic manner.
Let N be the number of queues on a given setup.
If 256%N = 128%N = 0, then all queues have the same probability to be
chosen for a given RSS result.
This case doesn't improves nor degrade by this change.

If 256%N != 0 and 128%N != 0, there is a remainder which will favor some
queues. Increasing the indirection RQ table size to 256 reduce the ratio
between the favored queues probability to be selected to the rest of the
queues and improves the distribution.

For example, let's assume the number of queues is 56.
For a table size of 128, we have 128%56=16 queues which will have a 3/128
probability to be chosen and 2/128 for the rest 40.
16 queues have 1.5 times the probability to be chosen over the other 40.

For a table size of 256, we have 256%56=32 queues which will have a 5/256
probability to be chosen and 4/256 probability for the rest 24 queues.
Here 32 queues have 1.25 more probability to be chosen over the other 24.

This shows that the larger indirection table size would more likely cause
an even distribution.

This change also aligns our mlx5 driver's indirection table size with
other vendors.

Signed-off-by: Noam Stolero <noams@nvidia.com>
Reviewed-by: Tal Gilboa <talgi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Enable napi in channel's activation stage

The channel's napi is first needed upon activation, not creation.
Minimize its enabled scope by moving it from the channel's open/close
stage into the activate/deactivate stage.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Move representor neigh init into profile enable

Also cleanup neigh in profile disable.
This is for logical separation.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Avoid false lock depenency warning on tc_ht

To avoid false lock dependency warning set the tc_ht lock
class different than the lock class of the ht being used when deleting
last flow from a group and then deleting a group, we get into del_sw_flow_group()
which call rhashtable_destroy on fg->ftes_hash which will take ht->mutex but
it's different than the ht->mutex here.

======================================================
WARNING: possible circular locking dependency detected
5.11.0-rc4_net_next_mlx5_949fdcc #1 Not tainted
------------------------------------------------------
modprobe/12950 is trying to acquire lock:
ffff88816510f910 (&node->lock){++++}-{3:3}, at: mlx5_del_flow_rules+0x2a/0x210 [mlx5_core]

but task is already holding lock:
ffff88815834e3e8 (&ht->mutex){+.+.}-{3:3}, at: rhashtable_free_and_destroy+0x37/0x340

which lock already depends on the new lock.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Move set vxlan nic info to profile init

Since its profile dependent let's init the vxlan info
as part of profile initialization.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Move netif_carrier_off() out of mlx5e_priv_init()

It's not part of priv initialization.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Refactor mlx5e_netdev_init/cleanup to mlx5e_priv_init/cleanup

We actually initialize priv and not netdev. The only call to
set netdev carrier will be moved in the following commit.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mxl5e: Add change profile method

Port nic netdevice will be used as uplink representor in downstream
patches. Add change profile method to allow changing a mlx5e netdevice
profile dynamically.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>

net/mlx5e: Separate between netdev objects and mlx5e profiles initialization

1) Initialize netdevice features and structures on netdevice allocation
   and outside of the mlx5e profile.

2) As now mlx5e netdevice private params will be setup on profile init only
   after netdevice features are already set, we add  a call to
   netde_update_features() to resolve any conflict.
   This is nice since we reuse the fix_features ndo code if a profile
   wants different default features, instead of duplicating features
   conflict resolution code on profile initialization.

3) With this we achieve total separation between mlx5e profiles and
   netdevices, and will allow replacing mlx5e profiles on the fly to reuse
   the same netdevice for multiple profiles.
   e.g. for uplink representor profile as shown in the following patch

4) Profile callbacks are not allowed to touch netdev->features directly
   anymore, since in downstream patch we will detach/attach netdev
   dynamically to profile, hence we move the code dealing with
   netdev->features from profile->init() to fix_features ndo, and we
   will call netdev_update_features() on
   mlx5e_attach_netdev(profile, netdev);

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>

Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-02-01

This series contains updates to igc and i40e drivers.

Kai-Heng Feng fixes igc to report unknown speed and duplex during suspend
as an attempted read will cause errors.

Kevin Lo sets the default value to -IGC_ERR_NVM instead of success for
writing shadow RAM as this could miss a timeout. Also propagates the return
value for Flow Control configuration to properly pass on errors for igc.

Aleksandr reverts commit 2ad1274fa35a ("i40e: don't report link up for a VF
who hasn't enabled queues") as this can cause link flapping.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  i40e: Revert "i40e: don't report link up for a VF who hasn't enabled queues"
  igc: check return value of ret_val in igc_config_fc_after_link_up
  igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
  igc: Report speed and duplex as unknown when device is runtime suspended
====================

Link: https://lore.kernel.org/r/20210201214618.852831-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rework-the-memory-barrier-for-scrq-entry'

Lijun Pan says:

====================
rework the memory barrier for SCRQ entry

This series rework the memory barrier for SCRQ (Sub-Command-Response
Queue) entry.
====================

Link: https://lore.kernel.org/r/20210130011905.1485-1-ljp@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: remove unnecessary rmb() inside ibmvnic_poll

rmb() can be removed since:
1. pending_scrq() has dma_rmb() at the function end;
2. dma_rmb(), though weaker, is enough here.

Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Dwip Banerjee <dnbanerg@us.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: rework to ensure SCRQ entry reads are properly ordered

Move the dma_rmb() between pending_scrq() and ibmvnic_next_scrq()
into the end of pending_scrq() to save the duplicated code since
this dma_rmb will be used 3 times.

Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc

Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Link: https://lore.kernel.org/r/20210130190518.854806-1-vincent@bernat.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp: ipv4: manipulate network header of NATed UDP GRO fraglist

UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.

A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.

Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.

Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ip_tunnel: fix mtu calculation

dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.

Fixes: fdafed459998 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly")
Reported-by: Slava Bacherikov <mail@slava.cc>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: fix the race conditions in multi-transport support

There are multiple similar bugs implicitly introduced by the
commit c0cfa2d8a788fcf4 ("vsock: add multi-transports support") and
commit 6a2c0962105ae8ce ("vsock: prevent transport modules unloading").

The bug pattern:
[1] vsock_sock.transport pointer is copied to a local variable,
[2] lock_sock() is called,
[3] the local variable is used.
VSOCK multi-transport support introduced the race condition:
vsock_sock.transport value may change between [1] and [2].

Let's copy vsock_sock.transport pointer to local variables after
the lock_sock() call.

Fixes: c0cfa2d8a788fcf4 ("vsock: add multi-transports support")
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Link: https://lore.kernel.org/r/20210201084719.2257066-1-alex.popov@linux.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>