OSDN Git Service

uclinux-h8/linux.git
2 years agonet: dpaa2-mac: populate supported_interfaces member
Russell King [Wed, 17 Nov 2021 17:24:02 +0000 (17:24 +0000)]
net: dpaa2-mac: populate supported_interfaces member

Populate the phy interface mode bitmap for the Freescale DPAA2 driver
with interfaces modes supported by the MAC.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'ag71xx-phylink'
David S. Miller [Thu, 18 Nov 2021 11:36:48 +0000 (11:36 +0000)]
Merge branch 'ag71xx-phylink'

Russell King says:

====================
net: ag71xx: phylink validate implementation updates

This series converts ag71xx to fill in the supported_interfaces member
of phylink_config, cleans up the validate() implementation, and then
converts to phylink_generic_validate().

The question over the port linkmode restriction has been answered by
Oleksij - there is no reason for this restriction, so we can go the
whole hog with this conversion. Thanks!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ag71xx: use phylink_generic_validate()
Russell King (Oracle) [Wed, 17 Nov 2021 16:46:31 +0000 (16:46 +0000)]
net: ag71xx: use phylink_generic_validate()

ag71xx apparently only supports MII port type, which makes it different
from other implementations. However, Oleksij says there is no special
reason for this.

Convert the driver to use phylink_generic_validate(), which will allow
all ethtool port linkmodes instead of only MII, giving the driver
consistent behaviour with other drivers.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ag71xx: remove interface checks in ag71xx_mac_validate()
Russell King (Oracle) [Wed, 17 Nov 2021 16:46:25 +0000 (16:46 +0000)]
net: ag71xx: remove interface checks in ag71xx_mac_validate()

As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode, nor handle
PHY_INTERFACE_MODE_NA in the validation function. Remove these to
simplify the implementation.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ag71xx: populate supported_interfaces member
Russell King [Wed, 17 Nov 2021 16:46:20 +0000 (16:46 +0000)]
net: ag71xx: populate supported_interfaces member

Populate the phy_interface_t bitmap for the Atheros ag71xx driver with
interfaces modes supported by the MAC.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: stmmac: dwmac-qcom-ethqos: add platform level clocks management
Bhupesh Sharma [Wed, 17 Nov 2021 11:05:38 +0000 (16:35 +0530)]
net: stmmac: dwmac-qcom-ethqos: add platform level clocks management

Split clocks settings from init callback into clks_config callback,
which could support platform level clock management.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Bhupesh Sharma <bhupesh.sharma@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoipv4/raw: support binding to nonlocal addresses
Riccardo Paolo Bestetti [Wed, 17 Nov 2021 09:00:11 +0000 (10:00 +0100)]
ipv4/raw: support binding to nonlocal addresses

Add support to inet v4 raw sockets for binding to nonlocal addresses
through the IP_FREEBIND and IP_TRANSPARENT socket options, as well as
the ipv4.ip_nonlocal_bind kernel parameter.

Add helper function to inet_sock.h to check for bind address validity on
the base of the address type and whether nonlocal address are enabled
for the socket via any of the sockopts/sysctl, deduplicating checks in
ipv4/ping.c, ipv4/af_inet.c, ipv6/af_inet6.c (for mapped v4->v6
addresses), and ipv4/raw.c.

Add test cases with IP[V6]_FREEBIND verifying that both v4 and v6 raw
sockets support binding to nonlocal addresses after the change. Add
necessary support for the test cases to nettest.

Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211117090010.125393-1-pbl@bestov.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet: add missing include in include/net/gro.h
Eric Dumazet [Wed, 17 Nov 2021 10:01:30 +0000 (02:01 -0800)]
net: add missing include in include/net/gro.h

This is needed for some arches, as reported by Geert Uytterhoeven,
Randy Dunlap and Stephen Rothwell

Fixes: 4721031c3559 ("net: move gro definitions to include/net/gro.h")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20211117100130.2368319-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agostmmac: fix build due to brainos in trans_start changes
Alexander Lobakin [Wed, 17 Nov 2021 15:29:17 +0000 (16:29 +0100)]
stmmac: fix build due to brainos in trans_start changes

txq_trans_cond_update() takes netdev_tx_queue *nq,
not nq->trans_start.

Fixes: 5337824f4dc4 ("net: annotate accesses to queue->trans_start")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211117152917.3739-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge branch 'dev_watchdog-less-intrusive'
David S. Miller [Wed, 17 Nov 2021 14:56:16 +0000 (14:56 +0000)]
Merge branch 'dev_watchdog-less-intrusive'

Eric Dumazet says:

====================
net: make dev_watchdog() less intrusive

dev_watchdog() is used on many NIC to periodically monitor TX queues
to detect hangs.

Problem is : It stops all queues, then check them, then 'unfreeze' them.

Not only this stops feeding the NIC, it also migrates all qdiscs
to be serviced on the cpu calling netif_tx_unlock(), causing
a potential latency artifact.

With many TX queues, this is becoming more visible.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: no longer stop all TX queues in dev_watchdog()
Eric Dumazet [Wed, 17 Nov 2021 03:29:24 +0000 (19:29 -0800)]
net: no longer stop all TX queues in dev_watchdog()

There is no reason for stopping all TX queues from dev_watchdog()

Not only this stops feeding the NIC, it also migrates all qdiscs
to be serviced on the cpu calling netif_tx_unlock(), causing
a potential latency artifact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: do not inline netif_tx_lock()/netif_tx_unlock()
Eric Dumazet [Wed, 17 Nov 2021 03:29:23 +0000 (19:29 -0800)]
net: do not inline netif_tx_lock()/netif_tx_unlock()

These are not fast path, there is no point in inlining them.

Also provide netif_freeze_queues()/netif_unfreeze_queues()
so that we can use them from dev_watchdog() in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: annotate accesses to queue->trans_start
Eric Dumazet [Wed, 17 Nov 2021 03:29:22 +0000 (19:29 -0800)]
net: annotate accesses to queue->trans_start

In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: use an atomic_long_t for queue->trans_timeout
Eric Dumazet [Wed, 17 Nov 2021 03:29:21 +0000 (19:29 -0800)]
net: use an atomic_long_t for queue->trans_timeout

tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.

As we want to no longer disrupt transmits, we use an
atomic_long_t instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge tag 'for-net-next-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Wed, 17 Nov 2021 14:52:44 +0000 (14:52 +0000)]
Merge tag 'for-net-next-2021-11-16' of git://git./linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - Add support for AOSP Bluetooth Quality Report
 - Enables AOSP extension for Mediatek Chip (MT7921 & MT7922)
 - Rework of HCI command execution serialization
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ethernet: ti: cpsw: Enable PHY timestamping
Kurt Kanzenbach [Tue, 16 Nov 2021 08:03:25 +0000 (09:03 +0100)]
net: ethernet: ti: cpsw: Enable PHY timestamping

If the used PHYs also support hardware timestamping, all configuration requests
should be forwared to the PHYs instead of being processed by the MAC driver
itself.

This enables PHY timestamping in combination with the cpsw driver.

Tested with an am335x based board with two DP83640 PHYs connected to the cpsw
switch.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoDocumentation: networking: net_failover: Fix documentation
Vasudev Kamath [Tue, 16 Nov 2021 07:21:48 +0000 (12:51 +0530)]
Documentation: networking: net_failover: Fix documentation

Update net_failover documentation with missing and incomplete
details to get a proper working setup.

Signed-off-by: Vasudev Kamath <vasudev@copyninja.info>
Reviewed-by: Krishna Kumar <krikku@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'ocelot_net-phylink'
David S. Miller [Wed, 17 Nov 2021 11:25:45 +0000 (11:25 +0000)]
Merge branch 'ocelot_net-phylink'

Russell King says:

====================
net: ocelot_net: phylink validate implementation updates

This series converts ocelot_net to fill in the supported_interfaces
member of phylink_config, cleans up the validate() implementation,
and then converts to phylink_generic_validate().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ocelot_net: use phylink_generic_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:09:41 +0000 (10:09 +0000)]
net: ocelot_net: use phylink_generic_validate()

ocelot_net has no special behaviour in its validation implementation, so
can be switched to phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ocelot_net: remove interface checks in macb_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:09:36 +0000 (10:09 +0000)]
net: ocelot_net: remove interface checks in macb_validate()

As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode in the
validation function. Remove this to simplify it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: ocelot_net: populate supported_interfaces member
Russell King (Oracle) [Tue, 16 Nov 2021 10:09:31 +0000 (10:09 +0000)]
net: ocelot_net: populate supported_interfaces member

Populate the phy interface mode bitmap for the MSCC Ocelot driver with
the interface modes supported by the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'mtk_eth_soc-phylink'
David S. Miller [Wed, 17 Nov 2021 11:23:39 +0000 (11:23 +0000)]
Merge branch 'mtk_eth_soc-phylink'

Russell King says:

====================
net: mtk_eth_soc: phylink validate implementation updates

This series converts mtk_eth_soc to fill in the supported_interfaces
member of phylink_config, cleans up the validate() implementation, and
then converts to phylink_generic_validate().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: mtk_eth_soc: use phylink_generic_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:06:58 +0000 (10:06 +0000)]
net: mtk_eth_soc: use phylink_generic_validate()

mtk_eth_soc has no special behaviour in its validation implementation,
so can be switched to phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: mtk_eth_soc: drop use of phylink_helper_basex_speed()
Russell King (Oracle) [Tue, 16 Nov 2021 10:06:53 +0000 (10:06 +0000)]
net: mtk_eth_soc: drop use of phylink_helper_basex_speed()

Now that we have a better method to select SFP interface modes, we
no longer need to use phylink_helper_basex_speed() in a driver's
validation function, and we can also get rid of our hack to indicate
both 1000base-X and 2500base-X if the comphy is present to make that
work. Remove this hack and use of phylink_helper_basex_speed().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: mtk_eth_soc: remove interface checks in mtk_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:06:48 +0000 (10:06 +0000)]
net: mtk_eth_soc: remove interface checks in mtk_validate()

As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode, nor handle
PHY_INTERFACE_MODE_NA in the validation function. Remove these to
simplify the implementation.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: mtk_eth_soc: populate supported_interfaces member
Russell King (Oracle) [Tue, 16 Nov 2021 10:06:43 +0000 (10:06 +0000)]
net: mtk_eth_soc: populate supported_interfaces member

Populate the phy interface mode bitmap for the Mediatek driver with
interfaces modes supported by the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'sparx5-phylink'
David S. Miller [Wed, 17 Nov 2021 11:21:42 +0000 (11:21 +0000)]
Merge branch 'sparx5-phylink'

Russell King says:

====================
net: sparx5: phylink validate implementation updates

This series converts sparx5 to fill in the supported_interfaces member
of phylink_config, cleans up the validate() implementation, and then
converts to phylink_generic_validate().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: sparx5: use phylink_generic_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:02:11 +0000 (10:02 +0000)]
net: sparx5: use phylink_generic_validate()

Sparx5 has no special behaviour in its validation implementation, so can
be switched to phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: sparx5: clean up sparx5_phylink_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 10:02:06 +0000 (10:02 +0000)]
net: sparx5: clean up sparx5_phylink_validate()

sparx5_phylink_validate() no longer needs to check for
PHY_INTERFACE_MODE_NA as phylink will walk the supported interface
types to discover the link mode capabilities. Neither is it necessary
to check the device capabilities as we will not be called for
unsupported interface modes. Remove these checks.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: sparx5: populate supported_interfaces member
Russell King (Oracle) [Tue, 16 Nov 2021 10:02:01 +0000 (10:02 +0000)]
net: sparx5: populate supported_interfaces member

Populate the phy_interface_t bitmap for the Microchip Sparx5 driver
with interfaces modes supported by the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'enetc-phylink'
David S. Miller [Wed, 17 Nov 2021 11:19:28 +0000 (11:19 +0000)]
Merge branch 'enetc-phylink'

Russell King says:

====================
net: enetc: phylink validate implementation updates

This series converts enetc to fill in the supported_interfaces member
of phylink_config, cleans up the validate() implementation, and then
converts to phylink_generic_validate().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: enetc: use phylink_generic_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 09:59:08 +0000 (09:59 +0000)]
net: enetc: use phylink_generic_validate()

enetc has no special behaviour in its validation implementation, so can
be switched to phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: enetc: remove interface checks in enetc_pl_mac_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 09:59:03 +0000 (09:59 +0000)]
net: enetc: remove interface checks in enetc_pl_mac_validate()

As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode in the
validation function. Remove this to simplify it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: enetc: populate supported_interfaces member
Russell King (Oracle) [Tue, 16 Nov 2021 09:58:58 +0000 (09:58 +0000)]
net: enetc: populate supported_interfaces member

Populate the phy_interface_t bitmap for the Freescale enetc driver with
interfaces modes supported by the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'xilinx-phylink'
David S. Miller [Wed, 17 Nov 2021 11:17:44 +0000 (11:17 +0000)]
Merge branch 'xilinx-phylink'

Russell King says:

====================
net: xilinx: phylink validate implementation updates

This series converts axienet to fill in the supported_interfaces member
of phylink_config, cleans up the validate() implementation, and then
converts to phylink_generic_validate().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: axienet: use phylink_generic_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 09:55:32 +0000 (09:55 +0000)]
net: axienet: use phylink_generic_validate()

axienet has no special behaviour in its validation implementation, so
can be switched to phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: axienet: remove interface checks in axienet_validate()
Russell King (Oracle) [Tue, 16 Nov 2021 09:55:27 +0000 (09:55 +0000)]
net: axienet: remove interface checks in axienet_validate()

As phylink checks the interface mode against the supported_interfaces
bitmap, we no longer need to validate the interface mode in the
validation function. Remove this to simplify it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: axienet: populate supported_interfaces member
Russell King (Oracle) [Tue, 16 Nov 2021 09:55:22 +0000 (09:55 +0000)]
net: axienet: populate supported_interfaces member

Populate the phy_interface_t bitmap for the Xilinx axienet driver with
interfaces modes supported by the MAC.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge tag 'mlx5-updates-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Wed, 17 Nov 2021 11:03:43 +0000 (11:03 +0000)]
Merge tag 'mlx5-updates-2021-11-16' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-11-16

Updates for mlx5 driver:

1) Support ethtool cq mode
2) Static allocation of mod header object for the common case
3) TC support for when local and remote VTEPs are in the same
4) Create E-Switch QoS objects on demand to save on resources
5) Minor code improvements
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet/mlx5: E-switch, Create QoS on demand
Dmytro Linkin [Tue, 21 Sep 2021 16:08:38 +0000 (19:08 +0300)]
net/mlx5: E-switch, Create QoS on demand

Don't create eswitch QoS (root TSAR) on switch mode change. Create it on
first child TSAR object creation - vport or rate group. Keep track
root TSAR references and release root TSAR with last object deletion.
No need to check for QoS is enabled when installing tc matchall filter.
Remove related helper function due to no users of it.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: E-switch, Enable vport QoS on demand
Dmytro Linkin [Tue, 21 Sep 2021 15:45:42 +0000 (18:45 +0300)]
net/mlx5: E-switch, Enable vport QoS on demand

Vports' QoS is not commonly used but consume SW/HW resources, which
becomes an issue on BlueField SoC systems.
Don't enable QoS on vports by default on eswitch mode change and enable
when it's going to be used by one of the top level users:
- configuring TC matchall filter with police action;
- setting rate with legacy NDO API;
- calling devlink ops->rate_leaf_*() callbacks.

Disable vport QoS on vport cleanup.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: E-switch, move offloads mode callbacks to offloads file
Parav Pandit [Thu, 21 Oct 2021 15:21:30 +0000 (18:21 +0300)]
net/mlx5: E-switch, move offloads mode callbacks to offloads file

eswitch.c is mainly for common code between legacy and offloads mode.
MAC address get and set via devlink is applicable only in offloads mode.

Hence, move it to eswitch_offloads.c file.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: E-switch, Reuse mlx5_eswitch_set_vport_mac
Parav Pandit [Thu, 21 Oct 2021 15:17:52 +0000 (18:17 +0300)]
net/mlx5: E-switch, Reuse mlx5_eswitch_set_vport_mac

mlx5_eswitch_set_vport_mac() routine already does necessary checks which
are duplicated in implementation of
mlx5_devlink_port_function_hw_addr_set().

Hence, reuse mlx5_eswitch_set_vport_mac() and cut down the code.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: E-switch, Remove vport enabled check
Parav Pandit [Wed, 20 Oct 2021 04:56:01 +0000 (07:56 +0300)]
net/mlx5: E-switch, Remove vport enabled check

An eswitch vport of the devlink port is always enabled before a
devlink port is registered. And a eswitch vport is always disabled
after a devlink port is unregistered.
Hence avoid the vport enabled check in the devlink callback routine.
Such check is only applicable in the legacy SR-IOV callbacks.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Sunil Sudhakar Rani <sunrani@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5e: Specify out ifindex when looking up decap route
Chris Mi [Tue, 26 Oct 2021 09:08:24 +0000 (17:08 +0800)]
net/mlx5e: Specify out ifindex when looking up decap route

There is a use case that the local and remote VTEPs are in the same
host. Currently, the out ifindex is not specified when looking up the
decap route for offloads. So in this case, a local route is returned
and the route dev is lo.

Actual tunnel interface can be created with a parameter "dev" [1],
which specifies the physical device to use for tunnel endpoint
communication. Pass this parameter to driver when looking up decap
route for offloads. So that a unicast route will be returned.

[1] ip link add name vxlan1 type vxlan id 100 dev enp4s0f0 remote 1.1.1.1 dstport 4789

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5e: TC, Move comment about mod header flag to correct place
Roi Dayan [Wed, 10 Nov 2021 14:19:41 +0000 (16:19 +0200)]
net/mlx5e: TC, Move comment about mod header flag to correct place

Move the comment to the correct place where the driver actually
removes the flag and not in the check that maybe pedit actions exists.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5e: TC, Move kfree() calls after destroying all resources
Roi Dayan [Mon, 1 Nov 2021 16:13:02 +0000 (18:13 +0200)]
net/mlx5e: TC, Move kfree() calls after destroying all resources

When deleting fdb/nic flow rules first release all resources
and then call the kfree() calls instead of sparse them around
the function.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5e: TC, Destroy nic flow counter if exists
Roi Dayan [Mon, 1 Nov 2021 16:02:00 +0000 (18:02 +0200)]
net/mlx5e: TC, Destroy nic flow counter if exists

Counter is only added if counter flag exists.
So check the counter fag exists for deleting the counter.
This is the same as in add/del fdb flow.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: TC, using swap() instead of tmp variable
Yihao Han [Wed, 3 Nov 2021 06:21:09 +0000 (23:21 -0700)]
net/mlx5: TC, using swap() instead of tmp variable

swap() was used instead of the tmp variable to swap values

Signed-off-by: Yihao Han <hanyihao@vivo.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: CT: Allow static allocation of mod headers
Paul Blakey [Wed, 25 Aug 2021 13:46:41 +0000 (16:46 +0300)]
net/mlx5: CT: Allow static allocation of mod headers

As each CT rule uses at least 4 modify header actions, each rule
causes at least 3 reallocations by the mod header actions api.

Allow initial static allocation of the mod acts array, and use it for
CT rules. If the static allocation is exceeded go back to dynamic
allocation.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
2 years agonet/mlx5e: Refactor mod header management API
Paul Blakey [Mon, 5 Jul 2021 08:31:47 +0000 (11:31 +0300)]
net/mlx5e: Refactor mod header management API

For all mod hdr related functions to reside in a single self contained
component (mod_hdr.c), refactor alloc() and add get_id() so that user
won't rely on internal implementation, and move both to mod_hdr
component.

Rename the prefix to mlx5e_mod_hdr_* as other mod hdr functions.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: Avoid printing health buffer when firmware is unavailable
Aya Levin [Tue, 9 Nov 2021 13:44:58 +0000 (15:44 +0200)]
net/mlx5: Avoid printing health buffer when firmware is unavailable

Use firmware version field as an indication to health buffer's sanity.
When firmware version is 0xFFFFFFFF, deduce that firmware is unavailable
and avoid printing the health buffer to dmesg as it doesn't provide
debug info.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2 years agonet/mlx5: Fix format-security build warnings
Saeed Mahameed [Wed, 3 Nov 2021 21:01:05 +0000 (14:01 -0700)]
net/mlx5: Fix format-security build warnings

Treat the string as an argument to avoid this.

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c:482:5:
error: format string is not a string literal (potentially insecure)
                         name);
                         ^~~~
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:2079:4:
error: format string is not a string literal (potentially insecure)
                        ptp_ch_stats_desc[i].format);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
2 years agonet/mlx5e: Support ethtool cq mode
Saeed Mahameed [Wed, 15 Sep 2021 06:26:17 +0000 (23:26 -0700)]
net/mlx5e: Support ethtool cq mode

Add support for ethtool coalesce cq mode set and get.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
2 years agonet: document SMII and correct phylink's new validation mechanism
Russell King (Oracle) [Mon, 15 Nov 2021 17:11:17 +0000 (17:11 +0000)]
net: document SMII and correct phylink's new validation mechanism

SMII has not been documented in the kernel, but information on this PHY
interface mode has been recently found. Document it, and correct the
recently introduced phylink handling for this interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1mmfVl-0075nP-14@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge branch 'r8169-disable-detection-of-further-chip-versions-that-didn-t-make-it...
Jakub Kicinski [Wed, 17 Nov 2021 03:10:34 +0000 (19:10 -0800)]
Merge branch 'r8169-disable-detection-of-further-chip-versions-that-didn-t-make-it-to-the-mass-market'

Heiner Kallweit says:

====================
r8169: disable detection of further chip versions that didn't make it to the mass market

There's no sign of life from further chip versions. Seems they didn't
make it to the mass market. Let's disable detection and if nobody
complains remove support a few kernel versions later.
====================

Link: https://lore.kernel.org/r/7708d13a-4a2b-090d-fadf-ecdd0fff5d2e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agor8169: disable detection of chip version 41
Heiner Kallweit [Mon, 15 Nov 2021 20:52:35 +0000 (21:52 +0100)]
r8169: disable detection of chip version 41

It seems this chip version never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agor8169: disable detection of chip version 45
Heiner Kallweit [Mon, 15 Nov 2021 20:51:52 +0000 (21:51 +0100)]
r8169: disable detection of chip version 45

It seems this chip version never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agor8169: disable detection of chip versions 49 and 50
Heiner Kallweit [Mon, 15 Nov 2021 20:51:14 +0000 (21:51 +0100)]
r8169: disable detection of chip versions 49 and 50

It seems these chip versions never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agor8169: enable ASPM L1/L1.1 from RTL8168h
Heiner Kallweit [Mon, 15 Nov 2021 20:17:56 +0000 (21:17 +0100)]
r8169: enable ASPM L1/L1.1 from RTL8168h

With newer chip versions ASPM-related issues seem to occur only if
L1.2 is enabled. I have a test system with RTL8168h that gives a
number of rx_missed errors when running iperf and L1.2 is enabled.
With L1.2 disabled (and L1 + L1.1 active) everything is fine.
See also [0]. Can't test this, but L1 + L1.1 being active should be
sufficient to reach higher package power saving states.

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942830

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/36feb8c4-a0b6-422a-899c-e61f2e869dfe@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge branch 'net-better-packing-of-global-vars'
Jakub Kicinski [Wed, 17 Nov 2021 03:07:57 +0000 (19:07 -0800)]
Merge branch 'net-better-packing-of-global-vars'

Eric Dumazet says:

====================
net: better packing of global vars

First two patches avoid holes in data section,
and last patch makes sure some siphash keys are contained
in a single cache line.
====================

Link: https://lore.kernel.org/r/20211115172303.3732746-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet: align static siphash keys
Eric Dumazet [Mon, 15 Nov 2021 17:23:03 +0000 (09:23 -0800)]
net: align static siphash keys

siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet: use .data.once section in netdev_level_once()
Eric Dumazet [Mon, 15 Nov 2021 17:23:02 +0000 (09:23 -0800)]
net: use .data.once section in netdev_level_once()

Same rationale than prior patch : using the dedicated
section avoid holes and pack all these bool values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoonce: use __section(".data.once")
Eric Dumazet [Mon, 15 Nov 2021 17:23:01 +0000 (09:23 -0800)]
once: use __section(".data.once")

.data.once contains nicely packed bool variables.
It is used already by DO_ONCE_LITE().

Using it also in DO_ONCE() removes holes in .data section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoBluetooth: btusb: enable Mediatek to support AOSP extension
mark-yw.chen [Thu, 4 Nov 2021 18:26:05 +0000 (02:26 +0800)]
Bluetooth: btusb: enable Mediatek to support AOSP extension

This patch enables AOSP extension for Mediatek Chip (MT7921 & MT7922).

Signed-off-by: mark-yw.chen <mark-yw.chen@mediatek.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: Attempt to clear HCI_LE_ADV on adv set terminated error event
Archie Pusaka [Thu, 11 Nov 2021 05:20:54 +0000 (13:20 +0800)]
Bluetooth: Attempt to clear HCI_LE_ADV on adv set terminated error event

We should clear the flag if the adv instance removed due to receiving
this error status is the last one we have.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: Ignore HCI_ERROR_CANCELLED_BY_HOST on adv set terminated event
Archie Pusaka [Thu, 11 Nov 2021 05:20:53 +0000 (13:20 +0800)]
Bluetooth: Ignore HCI_ERROR_CANCELLED_BY_HOST on adv set terminated event

This event is received when the controller stops advertising,
specifically for these three reasons:
(a) Connection is successfully created (success).
(b) Timeout is reached (error).
(c) Number of advertising events is reached (error).
(*) This event is NOT generated when the host stops the advertisement.
Refer to the BT spec ver 5.3 vol 4 part E sec 7.7.65.18. Note that the
section was revised from BT spec ver 5.0 vol 2 part E sec 7.7.65.18
which was ambiguous about (*).

Some chips (e.g. RTL8822CE) send this event when the host stops the
advertisement with status = HCI_ERROR_CANCELLED_BY_HOST (due to (*)
above). This is treated as an error and the advertisement will be
removed and userspace will be informed via MGMT event.

On suspend, we are supposed to temporarily disable advertisements,
and continue advertising on resume. However, due to the behavior
above, the advertisements are removed instead.

This patch returns early if HCI_ERROR_CANCELLED_BY_HOST is received.

Btmon snippet of the unexpected behavior:
@ MGMT Command: Remove Advertising (0x003f) plen 1
        Instance: 1
< HCI Command: LE Set Extended Advertising Enable (0x08|0x0039) plen 6
        Extended advertising: Disabled (0x00)
        Number of sets: 1 (0x01)
        Entry 0
          Handle: 0x01
          Duration: 0 ms (0x00)
          Max ext adv events: 0
> HCI Event: LE Meta Event (0x3e) plen 6
      LE Advertising Set Terminated (0x12)
        Status: Operation Cancelled by Host (0x44)
        Handle: 1
        Connection handle: 0
        Number of completed extended advertising events: 5
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Extended Advertising Enable (0x08|0x0039) ncmd 2
        Status: Success (0x00)

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: hci_request: Remove bg_scan_update work
Luiz Augusto von Dentz [Fri, 12 Nov 2021 00:48:44 +0000 (16:48 -0800)]
Bluetooth: hci_request: Remove bg_scan_update work

This work is no longer necessary since all the code using it has been
converted to use hci_passive_scan/hci_passive_scan_sync.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: hci_sync: Convert MGMT_OP_SET_CONNECTABLE to use cmd_sync
Luiz Augusto von Dentz [Fri, 12 Nov 2021 00:48:43 +0000 (16:48 -0800)]
Bluetooth: hci_sync: Convert MGMT_OP_SET_CONNECTABLE to use cmd_sync

This makes MGMT_OP_SET_CONNEABLE use hci_cmd_sync_queue instead of
use a dedicated connetable_update work.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: hci_sync: Convert MGMT_OP_SET_DISCOVERABLE to use cmd_sync
Luiz Augusto von Dentz [Fri, 12 Nov 2021 00:48:42 +0000 (16:48 -0800)]
Bluetooth: hci_sync: Convert MGMT_OP_SET_DISCOVERABLE to use cmd_sync

This makes MGMT_OP_SET_DISCOVERABLE use hci_cmd_sync_queue instead of
use a dedicated discoverable_update work.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoBluetooth: btmrvl_main: repair a non-kernel-doc comment
Randy Dunlap [Mon, 15 Nov 2021 03:05:17 +0000 (19:05 -0800)]
Bluetooth: btmrvl_main: repair a non-kernel-doc comment

Do not use "/**" to begin a non-kernel-doc comment.
Fixes this build warning:

drivers/bluetooth/btmrvl_main.c:2: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2 years agoMerge branch 'inuse-cleanups'
David S. Miller [Tue, 16 Nov 2021 13:20:45 +0000 (13:20 +0000)]
Merge branch 'inuse-cleanups'

Eric Dumazet says:

====================
net: prot_inuse and sock_inuse cleanups

Small series cleaning and optimizing sock_prot_inuse_add()
and sock_inuse_add().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: drop nopreempt requirement on sock_prot_inuse_add()
Eric Dumazet [Mon, 15 Nov 2021 17:11:50 +0000 (09:11 -0800)]
net: drop nopreempt requirement on sock_prot_inuse_add()

This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: merge net->core.prot_inuse and net->core.sock_inuse
Eric Dumazet [Mon, 15 Nov 2021 17:11:49 +0000 (09:11 -0800)]
net: merge net->core.prot_inuse and net->core.sock_inuse

net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.

per cpu allocator tend to place them in very different places.

Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: make sock_inuse_add() available
Eric Dumazet [Mon, 15 Nov 2021 17:11:48 +0000 (09:11 -0800)]
net: make sock_inuse_add() available

MPTCP hard codes it, let us instead provide this helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: inline sock_prot_inuse_add()
Eric Dumazet [Mon, 15 Nov 2021 17:11:47 +0000 (09:11 -0800)]
net: inline sock_prot_inuse_add()

sock_prot_inuse_add() is very small, we can inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'gro-out-of-core-files'
David S. Miller [Tue, 16 Nov 2021 13:16:54 +0000 (13:16 +0000)]
Merge branch 'gro-out-of-core-files'

Eric Dumazet says:

====================
gro: get out of core files

Move GRO related content into net/core/gro.c
and include/net/gro.h.

This reduces GRO scope to where it is really needed,
and shrinks too big files (include/linux/netdevice.h
and net/core/dev.c)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: gro: populate net/core/gro.c
Eric Dumazet [Mon, 15 Nov 2021 17:05:54 +0000 (09:05 -0800)]
net: gro: populate net/core/gro.c

Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.

gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: gro: move skb_gro_receive into net/core/gro.c
Eric Dumazet [Mon, 15 Nov 2021 17:05:53 +0000 (09:05 -0800)]
net: gro: move skb_gro_receive into net/core/gro.c

net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: gro: move skb_gro_receive_list to udp_offload.c
Eric Dumazet [Mon, 15 Nov 2021 17:05:52 +0000 (09:05 -0800)]
net: gro: move skb_gro_receive_list to udp_offload.c

This helper is used once, no need to keep it in fat net/core/skbuff.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: move gro definitions to include/net/gro.h
Eric Dumazet [Mon, 15 Nov 2021 17:05:51 +0000 (09:05 -0800)]
net: move gro definitions to include/net/gro.h

include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoMerge branch 'tcp-optimizations'
David S. Miller [Tue, 16 Nov 2021 13:10:35 +0000 (13:10 +0000)]
Merge branch 'tcp-optimizations'

Eric Dumazet says:

====================
tcp: optimizations for linux-5.17

Mostly small improvements in this series.

The notable change is in "defer skb freeing after
socket lock is released" in recvmsg() (and RX zerocopy)

The idea is to try to let skb freeing to BH handler,
whenever possible, or at least perform the freeing
outside of the socket lock section, for much improved
performance. This idea can probably be extended
to other protocols.

 Tests on a 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs.

 MTU : 1500  (1428 bytes of TCP payload per MSS)
 Before: 55 Gbit
 After:  66 Gbit

 MTU : 4096+ (4096 bytes of TCP payload, plus TCP/IPv6 headers)
 Before: 82 Gbit
 After:  95 Gbit
====================

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: move early demux fields close to sk_refcnt
Eric Dumazet [Mon, 15 Nov 2021 19:02:49 +0000 (11:02 -0800)]
net: move early demux fields close to sk_refcnt

sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux,
and currently spans two cache lines.

Moving them close to sk_refcnt makes more sense, as only one cache
line is needed.

New layout for this hot cache line is :

struct sock {
struct sock_common         __sk_common;          /*     0  0x88 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
struct dst_entry *         sk_rx_dst;            /*  0x88   0x8 */
int                        sk_rx_dst_ifindex;    /*  0x90   0x4 */
u32                        sk_rx_dst_cookie;     /*  0x94   0x4 */
socket_lock_t              sk_lock;              /*  0x98  0x20 */
atomic_t                   sk_drops;             /*  0xb8   0x4 */
int                        sk_rcvlowat;          /*  0xbc   0x4 */
/* --- cacheline 3 boundary (192 bytes) --- */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: do not call tcp_cleanup_rbuf() if we have a backlog
Eric Dumazet [Mon, 15 Nov 2021 19:02:48 +0000 (11:02 -0800)]
tcp: do not call tcp_cleanup_rbuf() if we have a backlog

Under pressure, tcp recvmsg() has logic to process the socket backlog,
but calls tcp_cleanup_rbuf() right before.

Avoiding sending ACK right before processing new segments makes
a lot of sense, as this decrease the number of ACK packets,
with no impact on effective ACK clocking.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: check local var (timeo) before socket fields in one test
Eric Dumazet [Mon, 15 Nov 2021 19:02:47 +0000 (11:02 -0800)]
tcp: check local var (timeo) before socket fields in one test

Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.

Modern applications use non-blocking IO, while a socket is terminated
only once during its life time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: defer skb freeing after socket lock is released
Eric Dumazet [Mon, 15 Nov 2021 19:02:46 +0000 (11:02 -0800)]
tcp: defer skb freeing after socket lock is released

tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: avoid indirect calls to sock_rfree
Eric Dumazet [Mon, 15 Nov 2021 19:02:45 +0000 (11:02 -0800)]
tcp: avoid indirect calls to sock_rfree

TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.

Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: tp->urg_data is unlikely to be set
Eric Dumazet [Mon, 15 Nov 2021 19:02:44 +0000 (11:02 -0800)]
tcp: tp->urg_data is unlikely to be set

Use some unlikely() hints in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: annotate races around tp->urg_data
Eric Dumazet [Mon, 15 Nov 2021 19:02:43 +0000 (11:02 -0800)]
tcp: annotate races around tp->urg_data

tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
owned.

Also, it is faster to first check tp->urg_data in tcp_poll(),
then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
located in a different/cold cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: annotate data-races on tp->segs_in and tp->data_segs_in
Eric Dumazet [Mon, 15 Nov 2021 19:02:42 +0000 (11:02 -0800)]
tcp: annotate data-races on tp->segs_in and tp->data_segs_in

tcp_segs_in() can be called from BH, while socket spinlock
is held but socket owned by user, eventually reading these
fields from tcp_get_info()

Found by code inspection, no need to backport this patch
to older kernels.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: add RETPOLINE mitigation to sk_backlog_rcv
Eric Dumazet [Mon, 15 Nov 2021 19:02:41 +0000 (11:02 -0800)]
tcp: add RETPOLINE mitigation to sk_backlog_rcv

Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: small optimization in tcp recvmsg()
Eric Dumazet [Mon, 15 Nov 2021 19:02:40 +0000 (11:02 -0800)]
tcp: small optimization in tcp recvmsg()

When reading large chunks of data, incoming packets might
be added to the backlog from BH.

tcp recvmsg() detects the backlog queue is not empty, and uses
a release_sock()/lock_sock() pair to process this backlog.

We now have __sk_flush_backlog() to perform this
a bit faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: cache align tcp_memory_allocated, tcp_sockets_allocated
Eric Dumazet [Mon, 15 Nov 2021 19:02:39 +0000 (11:02 -0800)]
net: cache align tcp_memory_allocated, tcp_sockets_allocated

tcp_memory_allocated and tcp_sockets_allocated often share
a common cache line, source of false sharing.

Also take care of udp_memory_allocated and mptcp_sockets_allocated.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: forward_alloc_get depends on CONFIG_MPTCP
Eric Dumazet [Mon, 15 Nov 2021 19:02:38 +0000 (11:02 -0800)]
net: forward_alloc_get depends on CONFIG_MPTCP

(struct proto)->sk_forward_alloc is currently only used by MPTCP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: shrink struct sock by 8 bytes
Eric Dumazet [Mon, 15 Nov 2021 19:02:37 +0000 (11:02 -0800)]
net: shrink struct sock by 8 bytes

Move sk_bind_phc next to sk_peer_lock to fill a hole.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agoipv6: shrink struct ipcm6_cookie
Eric Dumazet [Mon, 15 Nov 2021 19:02:36 +0000 (11:02 -0800)]
ipv6: shrink struct ipcm6_cookie

gso_size can be moved after tclass, to use an existing hole.
(8 bytes saved on 64bit arches)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: remove sk_route_nocaps
Eric Dumazet [Mon, 15 Nov 2021 19:02:35 +0000 (11:02 -0800)]
net: remove sk_route_nocaps

Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: remove sk_route_forced_caps
Eric Dumazet [Mon, 15 Nov 2021 19:02:34 +0000 (11:02 -0800)]
net: remove sk_route_forced_caps

We were only using one bit, and we can replace it by sk_is_tcp()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agonet: use sk_is_tcp() in more places
Eric Dumazet [Mon, 15 Nov 2021 19:02:33 +0000 (11:02 -0800)]
net: use sk_is_tcp() in more places

Move sk_is_tcp() to include/net/sock.h and use it where we can.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agotcp: small optimization in tcp_v6_send_check()
Eric Dumazet [Mon, 15 Nov 2021 19:02:32 +0000 (11:02 -0800)]
tcp: small optimization in tcp_v6_send_check()

For TCP flows, inet6_sk(sk)->saddr has the same value
than sk->sk_v6_rcv_saddr.

Using sk->sk_v6_rcv_saddr increases data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>