OSDN Git Service

uclinux-h8/linux.git
3 years agonet/mlx5: DR, Avoid unnecessary csum recalculation on supporting devices
Yevgeny Kliteynik [Wed, 28 Oct 2020 23:35:47 +0000 (01:35 +0200)]
net/mlx5: DR, Avoid unnecessary csum recalculation on supporting devices

If as part of the actions the TTL of the packet is modified, the packet's
checksum needs to be recalculated. Connect-X6DX can handle this csum
recalculation natively. Older devices require this additional recalculation.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: CT: remove useless conversion to PTR_ERR then ERR_PTR
Saeed Mahameed [Fri, 8 Jan 2021 04:49:57 +0000 (20:49 -0800)]
net/mlx5e: CT: remove useless conversion to PTR_ERR then ERR_PTR

Just return the ptr directly.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: accel, remove redundant space
Saeed Mahameed [Fri, 8 Jan 2021 04:24:12 +0000 (20:24 -0800)]
net/mlx5e: accel, remove redundant space

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: kTLS, Improve TLS RX workqueue scope
Tariq Toukan [Sun, 3 Jan 2021 09:34:04 +0000 (11:34 +0200)]
net/mlx5e: kTLS, Improve TLS RX workqueue scope

The TLS RX workqueue is needed only when kTLS RX device offload
is supported.

Move its creation from the general TLS init function to the
kTLS RX init.
Create it once at init time if supported, avoid creation/destroy
everytime the feature bit is toggled.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: remove h from printk format specifier
Tom Rix [Wed, 23 Dec 2020 19:45:12 +0000 (11:45 -0800)]
net/mlx5e: remove h from printk format specifier

This change fixes the checkpatch warning described in this commit
commit cbacb5ab0aa0 ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]")

Standard integer promotion is already done and %hx and %hhx is useless
so do not encourage the use of %hh[xudi] or %h[xudi].

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Increase indirection RQ table size to 256
Noam Stolero [Tue, 8 Dec 2020 07:31:36 +0000 (09:31 +0200)]
net/mlx5e: Increase indirection RQ table size to 256

Increasing the indirection RQ table size from 128 to 256 improves the
packet distribution over the NIC HW queues for various cases.

Let's take a look at the following scenario:
Assuming RSS result distributed uniformly and indirection table is filled
with queues in a cyclic manner.
Let N be the number of queues on a given setup.
If 256%N = 128%N = 0, then all queues have the same probability to be
chosen for a given RSS result.
This case doesn't improves nor degrade by this change.

If 256%N != 0 and 128%N != 0, there is a remainder which will favor some
queues. Increasing the indirection RQ table size to 256 reduce the ratio
between the favored queues probability to be selected to the rest of the
queues and improves the distribution.

For example, let's assume the number of queues is 56.
For a table size of 128, we have 128%56=16 queues which will have a 3/128
probability to be chosen and 2/128 for the rest 40.
16 queues have 1.5 times the probability to be chosen over the other 40.

For a table size of 256, we have 256%56=32 queues which will have a 5/256
probability to be chosen and 4/256 probability for the rest 24 queues.
Here 32 queues have 1.25 more probability to be chosen over the other 24.

This shows that the larger indirection table size would more likely cause
an even distribution.

This change also aligns our mlx5 driver's indirection table size with
other vendors.

Signed-off-by: Noam Stolero <noams@nvidia.com>
Reviewed-by: Tal Gilboa <talgi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Enable napi in channel's activation stage
Tariq Toukan [Mon, 30 Dec 2019 12:41:53 +0000 (14:41 +0200)]
net/mlx5e: Enable napi in channel's activation stage

The channel's napi is first needed upon activation, not creation.
Minimize its enabled scope by moving it from the channel's open/close
stage into the activate/deactivate stage.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Move representor neigh init into profile enable
Roi Dayan [Wed, 16 Sep 2020 07:11:38 +0000 (10:11 +0300)]
net/mlx5e: Move representor neigh init into profile enable

Also cleanup neigh in profile disable.
This is for logical separation.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Avoid false lock depenency warning on tc_ht
Roi Dayan [Mon, 25 Jan 2021 09:28:07 +0000 (11:28 +0200)]
net/mlx5e: Avoid false lock depenency warning on tc_ht

To avoid false lock dependency warning set the tc_ht lock
class different than the lock class of the ht being used when deleting
last flow from a group and then deleting a group, we get into del_sw_flow_group()
which call rhashtable_destroy on fg->ftes_hash which will take ht->mutex but
it's different than the ht->mutex here.

======================================================
WARNING: possible circular locking dependency detected
5.11.0-rc4_net_next_mlx5_949fdcc #1 Not tainted
------------------------------------------------------
modprobe/12950 is trying to acquire lock:
ffff88816510f910 (&node->lock){++++}-{3:3}, at: mlx5_del_flow_rules+0x2a/0x210 [mlx5_core]

but task is already holding lock:
ffff88815834e3e8 (&ht->mutex){+.+.}-{3:3}, at: rhashtable_free_and_destroy+0x37/0x340

which lock already depends on the new lock.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Move set vxlan nic info to profile init
Roi Dayan [Wed, 16 Sep 2020 07:11:12 +0000 (10:11 +0300)]
net/mlx5e: Move set vxlan nic info to profile init

Since its profile dependent let's init the vxlan info
as part of profile initialization.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Move netif_carrier_off() out of mlx5e_priv_init()
Roi Dayan [Wed, 16 Sep 2020 07:10:42 +0000 (10:10 +0300)]
net/mlx5e: Move netif_carrier_off() out of mlx5e_priv_init()

It's not part of priv initialization.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5e: Refactor mlx5e_netdev_init/cleanup to mlx5e_priv_init/cleanup
Roi Dayan [Wed, 16 Sep 2020 07:10:30 +0000 (10:10 +0300)]
net/mlx5e: Refactor mlx5e_netdev_init/cleanup to mlx5e_priv_init/cleanup

We actually initialize priv and not netdev. The only call to
set netdev carrier will be moved in the following commit.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mxl5e: Add change profile method
Saeed Mahameed [Mon, 23 Mar 2020 06:17:14 +0000 (23:17 -0700)]
net/mxl5e: Add change profile method

Port nic netdevice will be used as uplink representor in downstream
patches. Add change profile method to allow changing a mlx5e netdevice
profile dynamically.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
3 years agonet/mlx5e: Separate between netdev objects and mlx5e profiles initialization
Saeed Mahameed [Tue, 25 Feb 2020 22:37:39 +0000 (14:37 -0800)]
net/mlx5e: Separate between netdev objects and mlx5e profiles initialization

1) Initialize netdevice features and structures on netdevice allocation
   and outside of the mlx5e profile.

2) As now mlx5e netdevice private params will be setup on profile init only
   after netdevice features are already set, we add  a call to
   netde_update_features() to resolve any conflict.
   This is nice since we reuse the fix_features ndo code if a profile
   wants different default features, instead of duplicating features
   conflict resolution code on profile initialization.

3) With this we achieve total separation between mlx5e profiles and
   netdevices, and will allow replacing mlx5e profiles on the fly to reuse
   the same netdevice for multiple profiles.
   e.g. for uplink representor profile as shown in the following patch

4) Profile callbacks are not allowed to touch netdev->features directly
   anymore, since in downstream patch we will detach/attach netdev
   dynamically to profile, hence we move the code dealing with
   netdev->features from profile->init() to fix_features ndo, and we
   will call netdev_update_features() on
   mlx5e_attach_netdev(profile, netdev);

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
3 years agoMerge branch 'rework-the-memory-barrier-for-scrq-entry'
Jakub Kicinski [Tue, 2 Feb 2021 04:21:14 +0000 (20:21 -0800)]
Merge branch 'rework-the-memory-barrier-for-scrq-entry'

Lijun Pan says:

====================
rework the memory barrier for SCRQ entry

This series rework the memory barrier for SCRQ (Sub-Command-Response
Queue) entry.
====================

Link: https://lore.kernel.org/r/20210130011905.1485-1-ljp@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: remove unnecessary rmb() inside ibmvnic_poll
Lijun Pan [Sat, 30 Jan 2021 01:19:05 +0000 (19:19 -0600)]
ibmvnic: remove unnecessary rmb() inside ibmvnic_poll

rmb() can be removed since:
1. pending_scrq() has dma_rmb() at the function end;
2. dma_rmb(), though weaker, is enough here.

Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Dwip Banerjee <dnbanerg@us.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: rework to ensure SCRQ entry reads are properly ordered
Lijun Pan [Sat, 30 Jan 2021 01:19:04 +0000 (19:19 -0600)]
ibmvnic: rework to ensure SCRQ entry reads are properly ordered

Move the dma_rmb() between pending_scrq() and ibmvnic_next_scrq()
into the end of pending_scrq() to save the duplicated code since
this dma_rmb will be used 3 times.

Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed...
Jakub Kicinski [Tue, 2 Feb 2021 02:50:12 +0000 (18:50 -0800)]
Merge tag 'mlx5-dr-2021-01-29' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-dr-2021-01-29

Add support for Connect-X6DX Software steering

This series adds SW Steering support for Connect-X6DX.

Since the STE and actions formats are different on this new HW,
we implemented the HW specific STEv1 layer on the infrastructure
implemented in previous mlx5 DR patchset to support all the
functionalities as previous devices.

Most of the code in this series very is low level HW specific, we
implement the function pointers for the generic SW steering layer.

* tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Allow SW steering for sw_owner_v2 devices
  net/mlx5: DR, Copy all 64B whenever replacing STE in the head of miss-list
  net/mlx5: DR, Use HW specific logic API when writing STE
  net/mlx5: DR, Use the right size when writing partial STE into HW
  net/mlx5: DR, Add STEv1 modify header logic
  net/mlx5: DR, Add STEv1 action apply logic
  net/mlx5: DR, Add STEv1 setters and getters
  net/mlx5: DR, Allow native protocol support for HW STEv1
  net/mlx5: DR, Add HW STEv1 match logic
  net/mlx5: DR, Add match STEv1 structs to ifc
  net/mlx5: DR, Fix potential shift wrapping of 32-bit value
====================

Link: https://lore.kernel.org/r/20210130022618.317351-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-dsa-hellcreek-report-tables-sizes'
Jakub Kicinski [Tue, 2 Feb 2021 02:28:35 +0000 (18:28 -0800)]
Merge branch 'net-dsa-hellcreek-report-tables-sizes'

Kurt Kanzenbach says:

====================
net: dsa: hellcreek: Report tables sizes

Florian, Andrew and Vladimir suggested at some point to use devlink for
reporting tables, features and debugging counters instead of using debugfs and
printk.

So, start by reporting the VLAN and FDB table sizes.
====================

Link: https://lore.kernel.org/r/20210130135934.22870-1-kurt@kmk-computers.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: hellcreek: Report FDB table occupancy
Kurt Kanzenbach [Sat, 30 Jan 2021 13:59:34 +0000 (14:59 +0100)]
net: dsa: hellcreek: Report FDB table occupancy

Report the FDB table size and occupancy via devlink. The actual size depends on
the used Hellcreek version:

|root@tsn:~# devlink resource show platform/ff240000.switch
|platform/ff240000.switch:
|  name VLAN size 4096 occ 2 unit entry dpipe_tables none
|  name FDB size 256 occ 6 unit entry dpipe_tables none

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: hellcreek: Report VLAN table occupancy
Kurt Kanzenbach [Sat, 30 Jan 2021 13:59:33 +0000 (14:59 +0100)]
net: dsa: hellcreek: Report VLAN table occupancy

The VLAN membership configuration is cached in software already. So, it can be
reported via devlink. Add support for it:

|root@tsn:~# devlink resource show platform/ff240000.switch
|platform/ff240000.switch:
|  name VLAN size 4096 occ 4 unit entry dpipe_tables none

Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agotcp: shrink inet_connection_sock icsk_mtup enabled and probe_size
Neal Cardwell [Fri, 29 Jan 2021 18:54:38 +0000 (13:54 -0500)]
tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size

This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.

This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 9d9b1ee0b2d1 ("tcp: fix TCP_USER_TIMEOUT with zero window").

This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210129185438.1813237-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-bridge-drop-hosts-limit-sysfs-and-add-a-comment'
Jakub Kicinski [Sat, 30 Jan 2021 05:33:04 +0000 (21:33 -0800)]
Merge branch 'net-bridge-drop-hosts-limit-sysfs-and-add-a-comment'

Nikolay Aleksandrov says:

====================
net: bridge: drop hosts limit sysfs and add a comment

As recently discussed[1] we should stop extending the bridge sysfs
support for new options and move to using netlink only, so patch 01
drops the recently added hosts limit sysfs support which is still in
net-next only and patch 02 adds comments in br_sysfs_br/if.c to warn
against adding new sysfs options.

[1] https://lore.kernel.org/netdev/20210128105201.7c6bed82@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/T/#mda7265b2e57b52bdab863f286efa85291cf83822
====================

Link: https://lore.kernel.org/r/20210129115142.188455-2-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: bridge: add warning comments to avoid extending sysfs
Nikolay Aleksandrov [Fri, 29 Jan 2021 11:51:42 +0000 (13:51 +0200)]
net: bridge: add warning comments to avoid extending sysfs

We're moving to netlink-only options, so add comments in the bridge's
sysfs files to warn against adding any new sysfs entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: bridge: mcast: drop hosts limit sysfs support
Nikolay Aleksandrov [Fri, 29 Jan 2021 11:51:41 +0000 (13:51 +0200)]
net: bridge: mcast: drop hosts limit sysfs support

We decided to stop adding new sysfs bridge options and continue with
netlink only, so remove hosts limit sysfs support.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'tag_8021q-for-ocelot-switches'
Jakub Kicinski [Sat, 30 Jan 2021 05:25:27 +0000 (21:25 -0800)]
Merge branch 'tag_8021q-for-ocelot-switches'

Vladimir Oltean says:

====================
tag_8021q for Ocelot switches

The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port,
and the external ports, in the majority of use cases, run at 1G. This
means that, when the CPU injects traffic into the switch, it is very
easy to run into congestion. This is not to say that it is impossible to
enter congestion even with all ports running at the same speed, just
that the default configuration is already very prone to that by design.

Normally, the way to deal with that is using Ethernet flow control
(PAUSE frames).

However, this functionality is not working today with the ENETC - Felix
switch pair. The hardware issue is undergoing documentation right now as
an erratum within NXP, but several customers have been requesting a
reasonable workaround for it.

In truth, the LS1028A has 2 internal port pairs. The lack of flow control
is an issue only when NPI mode (Node Processor Interface, aka the mode
where the "CPU port module", which carries DSA-style tagged packets, is
connected to a regular Ethernet port) is used, and NPI mode is supported
by Felix on a single port.

In past BSPs, we have had setups where both internal port pairs were
enabled. We were advertising the following setup:

"data port"     "control port"
  (2.5G)            (1G)

   eno2             eno3
    ^                ^
    |                |
    | regular        | DSA-tagged
    | frames         | frames
    |                |
    v                v
   swp4             swp5

This works but is highly unpractical, due to NXP shifting the task of
designing a functional system (choosing which port to use, depending on
type of traffic required) up to the end user. The swpN interfaces would
have to be bridged with swp4, in order for the eno2 "data port" to have
access to the outside network. And the swpN interfaces would still be
capable of IP networking. So running a DHCP client would give us two IP
interfaces from the same subnet, one assigned to eno2, and the other to
swpN (0, 1, 2, 3).

Also, the dual port design doesn't scale. When attaching another DSA
switch to a Felix port, the end result is that the "data port" cannot
carry any meaningful data to the external world, since it lacks the DSA
tags required to traverse the sja1105 switches below. All that traffic
needs to go through the "control port".

So in newer BSPs there was a desire to simplify that setup, and only
have one internal port pair:

   eno2            eno3
    ^
    |
    | DSA-tagged    x disabled
    | frames
    |
    v
   swp4            swp5

However, this setup only exacerbates the issue of not having flow
control on the NPI port, since that is the only port now. Also, there
are use cases that still require the "data port", such as IEEE 802.1CB
(TSN stream identification doesn't work over an NPI port), source
MAC address learning over NPI, etc.

Again, there is a desire to keep the simplicity of the single internal
port setup, while regaining the benefits of having a dedicated data port
as well. And this series attempts to deliver just that.

So the NPI functionality is disabled conditionally. Its purpose was:
- To ensure individually addressable ports on TX. This can be replaced
  by using some designated VLAN tags which are pushed by the DSA tagger
  code, then removed by the switch (so they are invisible to the outside
  world and to the user).
- To ensure source port identification on RX. Again, this can be
  replaced by using some designated VLAN tags to encapsulate all RX
  traffic (each VLAN uniquely identifies a source port). The DSA tagger
  determines which port it was based on the VLAN number, then removes
  that header.
- To deliver PTP timestamps. This cannot be obtained through VLAN
  headers, so we need to take a step back and see how else we can do
  that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual
  injection/extraction from the CPU port module using register-based
  MMIO, and not over Ethernet. We will need to do the same from DSA,
  which makes this tagger a sort of hybrid between DSA and pure
  switchdev.
====================

Link: https://lore.kernel.org/r/20210129010009.3959398-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: felix: perform switch setup for tag_8021q
Vladimir Oltean [Fri, 29 Jan 2021 01:00:09 +0000 (03:00 +0200)]
net: dsa: felix: perform switch setup for tag_8021q

Unlike sja1105, the only other user of the software-defined tag_8021q.c
tagger format, the implementation we choose for the Felix DSA switch
driver preserves full functionality under a vlan_filtering bridge
(i.e. IP termination works through the DSA user ports under all
circumstances).

The tag_8021q protocol just wants:
- Identifying the ingress switch port based on the RX VLAN ID, as seen
  by the CPU. We achieve this by using the TCAM engines (which are also
  used for tc-flower offload) to push the RX VLAN as a second, outer
  tag, on egress towards the CPU port.
- Steering traffic injected into the switch from the network stack
  towards the correct front port based on the TX VLAN, and consuming
  (popping) that header on the switch's egress.

A tc-flower pseudocode of the static configuration done by the driver
would look like this:

$ tc qdisc add dev <cpu-port> clsact
$ for eth in swp0 swp1 swp2 swp3; do \
tc filter add dev <cpu-port> egress flower indev ${eth} \
action vlan push id <rxvlan> protocol 802.1ad; \
tc filter add dev <cpu-port> ingress protocol 802.1Q flower
vlan_id <txvlan> action vlan pop \
action mirred egress redirect dev ${eth}; \
done

but of course since DSA does not register network interfaces for the CPU
port, this configuration would be impossible for the user to do. Also,
due to the same reason, it is impossible for the user to inadvertently
delete these rules using tc. These rules do not collide in any way with
tc-flower, they just consume some TCAM space, which is something we can
live with.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: add a second tagger for Ocelot switches based on tag_8021q
Vladimir Oltean [Fri, 29 Jan 2021 01:00:08 +0000 (03:00 +0200)]
net: dsa: add a second tagger for Ocelot switches based on tag_8021q

There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.

Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
  source address learning is performed, no TSN stream classification,
  etc.
- Flow control is not functional over an NPI port (PAUSE frames are
  encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
  in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
  port is currently either disabled, or operated as a plain user port
  (albeit an internally-facing one). Having the ability to configure the
  two CPU ports symmetrically could pave the way for e.g. creating a LAG
  between them, to increase bandwidth seamlessly for the system.

So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:

echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: felix: convert to the new .change_tag_protocol DSA API
Vladimir Oltean [Fri, 29 Jan 2021 01:00:07 +0000 (03:00 +0200)]
net: dsa: felix: convert to the new .change_tag_protocol DSA API

In expectation of the new tag_ocelot_8021q tagger implementation, we
need to be able to do runtime switchover between one tagger and another.
So we must structure the existing code for the current NPI-based tagger
in a certain way.

We move the felix_npi_port_init function in expectation of the future
driver configuration necessary for tag_ocelot_8021q: we would like to
not have the NPI-related bits interspersed with the tag_8021q bits.

The conversion from this:

ocelot_write_rix(ocelot,
 ANA_PGID_PGID_PGID(GENMASK(ocelot->num_phys_ports, 0)),
 ANA_PGID_PGID, PGID_UC);

to this:

cpu_flood = ANA_PGID_PGID_PGID(BIT(ocelot->num_phys_ports));
ocelot_rmw_rix(ocelot, cpu_flood, cpu_flood, ANA_PGID_PGID, PGID_UC);

is perhaps non-trivial, but is nonetheless non-functional. The PGID_UC
(replicator for unknown unicast) is already configured out of hardware
reset to flood to all ports except ocelot->num_phys_ports (the CPU port
module). All we change is that we use a read-modify-write to only add
the CPU port module to the unknown unicast replicator, as opposed to
doing a full write to the register.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: allow changing the tag protocol via the "tagging" device attribute
Vladimir Oltean [Fri, 29 Jan 2021 01:00:06 +0000 (03:00 +0200)]
net: dsa: allow changing the tag protocol via the "tagging" device attribute

Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot

which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").

It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.

In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.

Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
  the initial tagging protocol driver. Nonetheless, new drivers should
  report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
  protocol, no later than the .setup() method, as well as destroying
  resources used by the last tagger in use, no earlier than the
  .teardown() method.

For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.

Testing:

$ insmod mscc_felix.ko
[   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[   97.261724] libphy: VSC9959 internal MDIO bus: probed
[   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[   98.527948] device eno2 entered promiscuous mode
[   98.544755] DSA: tree 0 setup

$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
 -  10.0.0.1 ping statistics  -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms

$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
        #!/bin/bash

        ip link set swp0 down
        ip link set swp1 down
        ip link set swp2 down
        ip link set swp3 down
        ip link set swp5 down
        ip link set eno2 down
        echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
        ip link set eno2 up
        ip link set swp0 up
        ip link set swp1 up
        ip link set swp2 up
        ip link set swp3 up
        ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[  645.544426] mscc_felix 0000:00:00.5: Link is Down
[  645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: keep a copy of the tagging protocol in the DSA switch tree
Vladimir Oltean [Fri, 29 Jan 2021 01:00:05 +0000 (03:00 +0200)]
net: dsa: keep a copy of the tagging protocol in the DSA switch tree

Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.

So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.

It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).

This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?

One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.

Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.

There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: document the existing switch tree notifiers and add a new one
Vladimir Oltean [Fri, 29 Jan 2021 01:00:04 +0000 (03:00 +0200)]
net: dsa: document the existing switch tree notifiers and add a new one

The existence of dsa_broadcast has generated some confusion in the past:
https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html

So let's document the existing dsa_port_notify and dsa_broadcast
functions and explain when each of them should be used.

Also, in fact, the in-between function has always been there but was
lacking a name, and is the main reason for this patch: dsa_tree_notify.
Refactor dsa_broadcast to use it.

This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
where it really belonged in the first place, but had no companion so it
stood with dsa_port_notify.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mscc: ocelot: don't use NPI tag prefix for the CPU port module
Vladimir Oltean [Fri, 29 Jan 2021 01:00:03 +0000 (03:00 +0200)]
net: mscc: ocelot: don't use NPI tag prefix for the CPU port module

Context: Ocelot switches put the injection/extraction frame header in
front of the Ethernet header. When used in NPI mode, a DSA master would
see junk instead of the destination MAC address, and it would most
likely drop the packets. So the Ocelot frame header can have an optional
prefix, which is just "ff:ff:ff:ff:ff:fe > ff:ff:ff:ff:ff:ff" padding
put before the actual tag (still before the real Ethernet header) such
that the DSA master thinks it's looking at a broadcast frame with a
strange EtherType.

Unfortunately, a lesson learned in commit 69df578c5f4b ("net: mscc:
ocelot: eliminate confusion between CPU and NPI port") seems to have
been forgotten in the meanwhile.

The CPU port module and the NPI port have independent settings for the
length of the tag prefix. However, the driver is using the same variable
to program both of them.

There is no reason really to use any tag prefix with the CPU port
module, since that is not connected to any Ethernet port. So this patch
makes the inj_prefix and xtr_prefix variables apply only to the NPI
port (which the switchdev ocelot_vsc7514 driver does not use).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mscc: ocelot: reapply bridge forwarding mask on bonding join/leave
Vladimir Oltean [Fri, 29 Jan 2021 01:00:02 +0000 (03:00 +0200)]
net: mscc: ocelot: reapply bridge forwarding mask on bonding join/leave

Applying the bridge forwarding mask currently is done only on the STP
state changes for any port. But it depends on both STP state changes,
and bonding interface state changes. Export the bit that recalculates
the forwarding mask so that it could be reused, and call it when a port
starts and stops offloading a bonding interface.

Now that the logic is split into a separate function, we can rename "p"
into "port", since the "port" variable was already taken in
ocelot_bridge_stp_state_set. Also, we can rename "i" into "lag", to make
it more clear what is it that we're iterating through.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mscc: ocelot: store a namespaced VCAP filter ID
Vladimir Oltean [Fri, 29 Jan 2021 01:00:01 +0000 (03:00 +0200)]
net: mscc: ocelot: store a namespaced VCAP filter ID

We will be adding some private VCAP filters that should not interfere in
any way with the filters added using tc-flower. So we need to allocate
some IDs which will not be used by tc.

Currently ocelot uses an u32 id derived from the flow cookie, which in
itself is an unsigned long. This is a problem in itself, since on 64 bit
systems, sizeof(unsigned long)=8, so the driver is already truncating
these.

Create a struct ocelot_vcap_id which contains the full unsigned long
cookie from tc, as well as a boolean that is supposed to namespace the
filters added by tc with the ones that aren't.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mscc: ocelot: export VCAP structures to include/soc/mscc
Vladimir Oltean [Fri, 29 Jan 2021 01:00:00 +0000 (03:00 +0200)]
net: mscc: ocelot: export VCAP structures to include/soc/mscc

The Felix driver will need to preinstall some VCAP filters for its
tag_8021q implementation (outside of the tc-flower offload logic), so
these need to be exported to the common includes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN
Vladimir Oltean [Fri, 29 Jan 2021 00:59:59 +0000 (02:59 +0200)]
net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN

The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agovmxnet3: Remove buf_info from device accessible structures
Ronak Doshi [Thu, 28 Jan 2021 02:08:16 +0000 (18:08 -0800)]
vmxnet3: Remove buf_info from device accessible structures

buf_info structures in RX & TX queues are private driver data that
do not need to be visible to the device.  Although there is physical
address and length in the queue descriptor that points to these
structures, their layout is not standardized, and device never looks
at them.

So lets allocate these structures in non-DMA-able memory, and fill
physical address as all-ones and length as zero in the queue
descriptor.

That should alleviate worries brought by Martin Radev in
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210104/022829.html
that malicious vmxnet3 device could subvert SVM/TDX guarantees.

Signed-off-by: Petr Vandrovec <petr@vmware.com>
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: dsa: hellcreek: Add missing TAPRIO dependency
Kurt Kanzenbach [Thu, 28 Jan 2021 16:33:38 +0000 (17:33 +0100)]
net: dsa: hellcreek: Add missing TAPRIO dependency

Add missing dependency to TAPRIO to avoid build failures such as:

|ERROR: modpost: "taprio_offload_get" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!
|ERROR: modpost: "taprio_offload_free" [drivers/net/dsa/hirschmann/hellcreek_sw.ko] undefined!

Fixes: 24dfc6eb39b2 ("net: dsa: hellcreek: Add TAPRIO offloading support")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20210128163338.22665-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: proc: speedup /proc/net/netstat
Eric Dumazet [Thu, 28 Jan 2021 16:21:45 +0000 (08:21 -0800)]
net: proc: speedup /proc/net/netstat

Use cache friendly helpers to better use cpu caches
while reading /proc/net/netstat

Tested on a platform with 256 threads (AMD Rome)

Before: 305 usec spent in netstat_seq_show()
After: 130 usec spent in netstat_seq_show()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: Remove redundant calls of sk_tx_queue_clear().
Kuniyuki Iwashima [Thu, 28 Jan 2021 15:02:17 +0000 (00:02 +0900)]
net: Remove redundant calls of sk_tx_queue_clear().

The commit 41b14fb8724d ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-hns3-updates-for-next'
Jakub Kicinski [Sat, 30 Jan 2021 04:42:25 +0000 (20:42 -0800)]
Merge branch 'net-hns3-updates-for-next'

Huazhong Tan says:

====================
net: hns3: updates for -next

This patchset adds dump tm info of nodes, priority and qset in debugfs.
Three debugfs files tm_nodes, tm_priority and tm_qset are created in
new tm directory, and use cat command to dump their info, for examples:

$ cat tm_nodes
       BASE_ID  MAX_NUM
PG         0         8
PRI        0         8
QSET       0         8
QUEUE      0      1024

$ cat tm_priority
ID    MODE  DWRR  C_IR_B  C_IR_U  C_IR_S  C_BS_B  C_BS_S  C_FLAG  C_RATE(Mbps)  P_IR_B  P_IR_U  P_IR_S  P_BS_B  P_BS_S  P_FLAG  P_RATE(Mbps)
0000  dwrr  100     0       0       0       5      20       0          0        150       7       0       5      20       0          0
0001    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0002    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0003    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0004    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0005    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0006    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0
0007    sp    0     0       0       0       0       0       0          0          0       0       0       0       0       0          0

$ cat tm_qset
ID    MAP_PRI  LINK_VLD  MODE  DWRR
0000     0        1      dwrr  100
0001     0        0        sp    0
0002     0        0        sp    0
0003     0        0        sp    0
0004     0        0        sp    0
0005     0        0        sp    0
0006     0        0        sp    0

change log:
V2: add readonly files for dump all nodes, priority and qset info
    suggested by Jakub Kicinski.

previous version:
V1: https://patchwork.kernel.org/project/netdevbpf/patch/1610694569-43099-1-git-send-email-tanhuazhong@huawei.com/
====================

Link: https://lore.kernel.org/r/1611834696-56207-1-git-send-email-tanhuazhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: hns3: add debugfs support for tm nodes, priority and qset info
Guangbin Huang [Thu, 28 Jan 2021 11:51:36 +0000 (19:51 +0800)]
net: hns3: add debugfs support for tm nodes, priority and qset info

In order to query tm info of nodes, priority and qset
for debugging, adds three debugfs files tm_nodes,
tm_priority and tm_qset in newly created tm directory.

Unlike previous debugfs commands, these three files
just support read ops, so they only support to use cat
command to dump their info.

The new tm file style is acccording to suggestion from
Jakub Kicinski's opinion as link https://lkml.org/lkml/2020/9/29/2101.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: hns3: add interfaces to query information of tm priority/qset
Guangbin Huang [Thu, 28 Jan 2021 11:51:35 +0000 (19:51 +0800)]
net: hns3: add interfaces to query information of tm priority/qset

Add some interfaces to get information of tm priority and qset,
then they can be used by debugfs.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-add-support-for-ip-generic-checksum-offload-for-gre'
Jakub Kicinski [Sat, 30 Jan 2021 04:39:16 +0000 (20:39 -0800)]
Merge branch 'net-add-support-for-ip-generic-checksum-offload-for-gre'

Xin Long says:

====================
net: add support for ip generic checksum offload for gre

This patchset it to add ip generic csum processing first in
skb_csum_hwoffload_help() in Patch 1/2 and then add csum
offload support for GRE header in Patch 2/2.
====================

Link: https://lore.kernel.org/r/cover.1611825446.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoip_gre: add csum offload support for gre header
Xin Long [Thu, 28 Jan 2021 09:18:32 +0000 (17:18 +0800)]
ip_gre: add csum offload support for gre header

This patch is to add csum offload support for gre header:

On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.

On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().

Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.

v1->v2:
  - remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
    and it will always do checksum for SCTP in sctp_packet_pack()
    when it's not a GSO packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: support ip generic csum processing in skb_csum_hwoffload_help
Xin Long [Thu, 28 Jan 2021 09:18:31 +0000 (17:18 +0800)]
net: support ip generic csum processing in skb_csum_hwoffload_help

NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.

However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.

This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.

Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.

v1->v2:
  - not extend skb->csum_not_inet, but use skb->csum_offset to tell
    if it's an UDP/TCP csum packet.
v2->v3:
  - add a note in the changelog, as Willem suggested.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge tag 'linux-can-next-for-5.12-20210129' of git://git.kernel.org/pub/scm/linux...
Jakub Kicinski [Sat, 30 Jan 2021 03:45:21 +0000 (19:45 -0800)]
Merge tag 'linux-can-next-for-5.12-20210129' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
linux-can-next-for-5.12-20210129

All patches are by me and target the mcp251xfd driver. The first 4
patches update the information regarding the "85% of (FSYSCLK/2)"
errata. The other 4 are misc cleanups, unitfy error messages, add
missing postfix to a macro, simplify the return of a function, and
make use of dev_err_probe() in the mcp251xfd_probe() function.
====================

Link: https://lore.kernel.org/r/20210129084302.3040284-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mhi: Get rid of local rx queue count
Loic Poulain [Mon, 11 Jan 2021 18:07:42 +0000 (19:07 +0100)]
net: mhi: Get rid of local rx queue count

Use the new mhi_get_free_desc_count helper to track queue usage
instead of relying on the locally maintained rx_queued count.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: mhi: Get RX queue size from MHI core
Loic Poulain [Thu, 28 Jan 2021 13:47:22 +0000 (14:47 +0100)]
net: mhi: Get RX queue size from MHI core

The RX queue size can be determined at runtime by retrieving the
number of available transfer descriptors.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'mhi-net-immutable' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Sat, 30 Jan 2021 03:40:25 +0000 (19:40 -0800)]
Merge branch 'mhi-net-immutable' of https://git./linux/kernel/git/mani/mhi

Needed by mhi-net patches.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agodocs: networking: timestamping: fix section title markup
Jan Luebbe [Thu, 28 Jan 2021 11:19:30 +0000 (12:19 +0100)]
docs: networking: timestamping: fix section title markup

This section was missed during the conversion to ReST, so convert it in the
same style as the surrounding section titles.

Signed-off-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lore.kernel.org/r/20210128111930.29473-1-jlu@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/ethernet: convert to use module_platform_driver in octeon_mgmt.c
dingsenjie [Thu, 28 Jan 2021 03:53:30 +0000 (11:53 +0800)]
net/ethernet: convert to use module_platform_driver in octeon_mgmt.c

Simplify the code by using module_platform_driver macro
for octeon_mgmt.

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210128035330.17676-1-dingsenjie@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: atm: pppoatm: use new API for wakeup tasklet
Emil Renner Berthing [Wed, 27 Jan 2021 17:32:56 +0000 (18:32 +0100)]
net: atm: pppoatm: use new API for wakeup tasklet

This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: atm: pppoatm: use tasklet_init to initialize wakeup tasklet
Emil Renner Berthing [Wed, 27 Jan 2021 17:32:55 +0000 (18:32 +0100)]
net: atm: pppoatm: use tasklet_init to initialize wakeup tasklet

Previously a temporary tasklet structure was initialized on the stack
using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing
else in the kernel seems to use this pattern, so let's just call
tasklet_init() like everyone else.

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/mlx5: DR, Allow SW steering for sw_owner_v2 devices
Yevgeny Kliteynik [Mon, 25 Jan 2021 00:26:45 +0000 (02:26 +0200)]
net/mlx5: DR, Allow SW steering for sw_owner_v2 devices

Allow sw_owner_v2 based on sw_format_version.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Copy all 64B whenever replacing STE in the head of miss-list
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:24:09 +0000 (04:24 +0300)]
net/mlx5: DR, Copy all 64B whenever replacing STE in the head of miss-list

Till now the code assumed that need to copy reduced size of the
ste because the rest is the mask part which shouldn't be changed.
This is not true for all types of HW (like STEv1).
Take all 64B from the new STE and write them in the replaced STE place.
This change will make it easier to handle all STE HW types because we have
all the data that is about to be written into HW.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Use HW specific logic API when writing STE
Yevgeny Kliteynik [Sun, 6 Dec 2020 16:13:01 +0000 (18:13 +0200)]
net/mlx5: DR, Use HW specific logic API when writing STE

STEv0 format and STEv1 HW format are different, each has a
different order:
STEv0: CTRL 32B, TAG 16B, BITMASK 16B
STEv1: CTRL 32B, BITMASK 16B, TAG 16B

To make this transparent to upper layers we introduce a
new ste_ctx function to format the STE prior to writing it.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Use the right size when writing partial STE into HW
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:23:58 +0000 (04:23 +0300)]
net/mlx5: DR, Use the right size when writing partial STE into HW

In these cases we need to update only the ctrl area of the STE.
So it is better to write only the control 32B and avoid copying
the unneeded reduced 48B (control 32B + tag 16B).

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Add STEv1 modify header logic
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:23:53 +0000 (04:23 +0300)]
net/mlx5: DR, Add STEv1 modify header logic

Add HW specific modify header fields and logic to STEv1 file.
Since STEv0 and STEv1 modify actions values are different, each
version has its own implementation.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Add STEv1 action apply logic
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:23:42 +0000 (04:23 +0300)]
net/mlx5: DR, Add STEv1 action apply logic

Add HW specific action apply logic to STEv1.
Since STEv0 and STEv1 actions format is different, each
version has its implementation.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Add STEv1 setters and getters
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:23:31 +0000 (04:23 +0300)]
net/mlx5: DR, Add STEv1 setters and getters

Add HW specific setter and getters to STEv1 file.
Since STEv0 and STEv1 format are different, each version
should implemented different setters and getters.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Allow native protocol support for HW STEv1
Yevgeny Kliteynik [Tue, 22 Sep 2020 01:23:25 +0000 (04:23 +0300)]
net/mlx5: DR, Allow native protocol support for HW STEv1

Some flex parser protocols are native as part of STEv1.
The check for supported protocols was modified to allow this.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Add HW STEv1 match logic
Yevgeny Kliteynik [Tue, 1 Dec 2020 13:37:57 +0000 (15:37 +0200)]
net/mlx5: DR, Add HW STEv1 match logic

Add STEv1 match logic to a new file.
This file will be used for HW specific STEv1.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Add match STEv1 structs to ifc
Yevgeny Kliteynik [Tue, 1 Dec 2020 13:30:59 +0000 (15:30 +0200)]
net/mlx5: DR, Add match STEv1 structs to ifc

Add mlx5_ifc_dr_ste_v1.h - a new header with HW specific
STE structs for version 1.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agonet/mlx5: DR, Fix potential shift wrapping of 32-bit value
Yevgeny Kliteynik [Wed, 20 Jan 2021 00:53:28 +0000 (02:53 +0200)]
net/mlx5: DR, Fix potential shift wrapping of 32-bit value

Fix 32-bit variable shift wrapping in dr_ste_v0_get_miss_addr.

Fixes: 6b93b400aa88 ("net/mlx5: DR, Move STEv0 setters and getters")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
3 years agoMerge branch 'net-sched-cls_flower-add-support-for-matching-on-ct_state-reply-flag'
Jakub Kicinski [Sat, 30 Jan 2021 02:05:32 +0000 (18:05 -0800)]
Merge branch 'net-sched-cls_flower-add-support-for-matching-on-ct_state-reply-flag'

Paul Blakey says:

====================
net/sched: cls_flower: Add support for matching on ct_state reply flag

This patchset adds software match support and offload of flower
match ct_state reply flag (+/-rpl).

The first patch adds the definition for the flag and match to flower.

Second patch gives the direction of the connection to the offloading
drivers via ct_metadata flow offload action.

The last patch does offload of this new ct_state by using the supplied
connection's direction.
====================

Link: https://lore.kernel.org/r/1611757967-18236-1-git-send-email-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/mlx5: CT: Add support for matching on ct_state reply flag
Paul Blakey [Wed, 27 Jan 2021 14:32:47 +0000 (16:32 +0200)]
net/mlx5: CT: Add support for matching on ct_state reply flag

Add support for matching on ct_state reply flag.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est+rpl \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est-rpl \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: flow_offload: Add original direction flag to ct_metadata
Paul Blakey [Wed, 27 Jan 2021 14:32:46 +0000 (16:32 +0200)]
net: flow_offload: Add original direction flag to ct_metadata

Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/sched: cls_flower: Add match on the ct_state reply flag
Paul Blakey [Wed, 27 Jan 2021 14:32:45 +0000 (16:32 +0200)]
net/sched: cls_flower: Add match on the ct_state reply flag

Add match on the ct_state reply flag.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est+rpl \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est-rpl \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'add-nci-suit-and-virtual-nci-device-driver'
Jakub Kicinski [Sat, 30 Jan 2021 02:03:36 +0000 (18:03 -0800)]
Merge branch 'add-nci-suit-and-virtual-nci-device-driver'

Bongsu Jeon says:

====================
Add nci suit and virtual nci device driver

1/2 is the Virtual NCI device driver.
2/2 is the NCI selftest suite
====================

Link: https://lore.kernel.org/r/20210127130829.4026-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoselftests: Add nci suite
Bongsu Jeon [Wed, 27 Jan 2021 13:08:29 +0000 (22:08 +0900)]
selftests: Add nci suite

This is the NCI test suite. It tests the NFC/NCI module using virtual NCI
device. Test cases consist of making the virtual NCI device on/off and
controlling the device's polling for NCI1.0 and NCI2.0 version.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonfc: Add a virtual nci device driver
Bongsu Jeon [Wed, 27 Jan 2021 13:08:28 +0000 (22:08 +0900)]
nfc: Add a virtual nci device driver

NCI virtual device simulates a NCI device to the user. It can be used to
validate the NCI module and applications. This driver supports
communication between the virtual NCI device and NCI module.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: packet: make pkt_sk() inline
Menglong Dong [Wed, 27 Jan 2021 12:33:02 +0000 (04:33 -0800)]
net: packet: make pkt_sk() inline

It's better make 'pkt_sk()' inline here, as non-inline function
shouldn't occur in headers. Besides, this function is simple
enough to be inline.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agohv_netvsc: Copy packets sent by Hyper-V out of the receive buffer
Andrea Parri (Microsoft) [Tue, 26 Jan 2021 16:29:07 +0000 (17:29 +0100)]
hv_netvsc: Copy packets sent by Hyper-V out of the receive buffer

Pointers to receive-buffer packets sent by Hyper-V are used within the
guest VM.  Hyper-V can send packets with erroneous values or modify
packet fields after they are processed by the guest.  To defend against
these scenarios, copy (sections of) the incoming packet after validating
their length and offset fields in netvsc_filter_receive().  In this way,
the packet can no longer be modified by the host.

Reported-by: Juan Vazquez <juvazq@microsoft.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20210126162907.21056-1-parri.andrea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agocan: mcp251xfd: mcp251xfd_probe(): use dev_err_probe() to simplify error handling
Marc Kleine-Budde [Fri, 16 Oct 2020 16:03:59 +0000 (18:03 +0200)]
can: mcp251xfd: mcp251xfd_probe(): use dev_err_probe() to simplify error handling

dev_err_probe() can reduce code size, uniform error handling and record the
defer probe reason etc., use it to simplify the code.

Link: https://lore.kernel.org/r/20210128104644.2982125-9-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: mcp251xfd_chip_clock_enable(): simplify return
Marc Kleine-Budde [Fri, 16 Oct 2020 18:49:30 +0000 (20:49 +0200)]
can: mcp251xfd: mcp251xfd_chip_clock_enable(): simplify return

This patch simplifies the return of the mcp251xfd_chip_clock_enable()
function by direct returning the error.

Link: https://lore.kernel.org/r/20210128104644.2982125-8-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: add missing _MASK postfix to MCP251XFD_OBJ_FLAGS_DLC
Marc Kleine-Budde [Tue, 20 Oct 2020 15:10:14 +0000 (17:10 +0200)]
can: mcp251xfd: add missing _MASK postfix to MCP251XFD_OBJ_FLAGS_DLC

As MCP251XFD_OBJ_FLAGS_DLC is a mask, add the missing _MASK postfix,
that all other masks in the driver have.

Link: https://lore.kernel.org/r/20210128104644.2982125-7-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: unify error messages and commets
Marc Kleine-Budde [Thu, 22 Oct 2020 21:14:54 +0000 (23:14 +0200)]
can: mcp251xfd: unify error messages and commets

This patch unifies the error messages:
- have a "." and the end of each message
- write controller with a small "c", if not the first word of an error
  message.

Link: https://lore.kernel.org/r/20210128104644.2982125-6-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: mcp251xfd_probe(): add imx6 to errata table
Marc Kleine-Budde [Thu, 15 Oct 2020 21:06:31 +0000 (23:06 +0200)]
can: mcp251xfd: mcp251xfd_probe(): add imx6 to errata table

This patch adds an imx6 as known good to the errata table.

Link: https://lore.kernel.org/r/20210128104644.2982125-5-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: mcp251xfd_probe(): remove known bad combinations from errata tabe
Marc Kleine-Budde [Thu, 15 Oct 2020 21:06:31 +0000 (23:06 +0200)]
can: mcp251xfd: mcp251xfd_probe(): remove known bad combinations from errata tabe

The published errata specify the maximum allowed SPI frequency to be
max 85% of (FSYSCLK/2). So there's no need to track known bad clock
settings in the driver. As the setup of known good values is a bit
tricky, keep them.

Link: https://lore.kernel.org/r/20210128104644.2982125-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: mcp251xfd_probe(): sort errata table alphabetically, fix indention
Marc Kleine-Budde [Thu, 15 Oct 2020 21:06:31 +0000 (23:06 +0200)]
can: mcp251xfd: mcp251xfd_probe(): sort errata table alphabetically, fix indention

This patch sorts the errata table alphabetically and fixes the
indention.

Link: https://lore.kernel.org/r/20210128104644.2982125-3-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agocan: mcp251xfd: mcp251xfd_probe(): fix errata reference
Marc Kleine-Budde [Thu, 15 Oct 2020 21:06:31 +0000 (23:06 +0200)]
can: mcp251xfd: mcp251xfd_probe(): fix errata reference

This patch fixes the reference to the errata for both the mcp2517fd
and the mcp2518fd.

Fixes: f5b84dedf7eb ("can: mcp25xxfd: mcp25xxfd_probe(): add SPI clk limit related errata information")
Link: https://lore.kernel.org/r/20210128104644.2982125-2-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
3 years agoocteontx2-af: Fix 'physical' typos
Bjorn Helgaas [Wed, 27 Jan 2021 18:13:59 +0000 (12:13 -0600)]
octeontx2-af: Fix 'physical' typos

Fix misspellings of "physical".

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210127181359.3008316-1-helgaas@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agolinux/qed: fix spelling typo in qed_chain.h
dingsenjie [Wed, 27 Jan 2021 02:28:01 +0000 (10:28 +0800)]
linux/qed: fix spelling typo in qed_chain.h

allocted -> allocated

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210127022801.8028-1-dingsenjie@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'nexthop-preparations-for-resilient-next-hop-groups'
Jakub Kicinski [Fri, 29 Jan 2021 04:49:57 +0000 (20:49 -0800)]
Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'

Petr Machata says:

====================
nexthop: Preparations for resilient next-hop groups

At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patchset prepares the next-hop code for eventual introduction of
resilient hashing groups.

- Patches #1-#4 carry otherwise disjoint changes that just remove certain
  assumptions in the next-hop code.

- Patches #5-#6 extend the in-kernel next-hop notifiers to support more
  next-hop group types.

- Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
  will introduce a new logical object, a hash table bucket. It turns out
  that handling bucket-related messages is similar to how next-hop messages
  are handled. These patches extract the commonalities into reusable
  components.

The plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (this patchset)
3) Implementation of resilient next hop group
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1
====================

Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Extract a helper for validation of get/del RTNL requests
Petr Machata [Thu, 28 Jan 2021 12:49:24 +0000 (13:49 +0100)]
nexthop: Extract a helper for validation of get/del RTNL requests

Validation of messages for get / del of a next hop is the same as will be
validation of messages for get of a resilient next hop group bucket. The
difference is that policy for resilient next hop group buckets is a
superset of that used for next-hop get.

It is therefore possible to reuse the code that validates the nhmsg fields,
extracts the next-hop ID, and validates that. To that end, extract from
nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
that.

Make the nlh argument const so that the function can be called from the
dump context, which only has a const nlh. Propagate the constness to
nh_valid_get_del_req().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Add a callback parameter to rtm_dump_walk_nexthops()
Petr Machata [Thu, 28 Jan 2021 12:49:23 +0000 (13:49 +0100)]
nexthop: Add a callback parameter to rtm_dump_walk_nexthops()

In order to allow different handling for next-hop tree dumper and for
bucket dumper, parameterize the next-hop tree walker with a callback. Add
rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
dumping.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Extract a helper for walking the next-hop tree
Petr Machata [Thu, 28 Jan 2021 12:49:22 +0000 (13:49 +0100)]
nexthop: Extract a helper for walking the next-hop tree

Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
separate function for this will be reusable from the bucket dumper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Strongly-type context of rtm_dump_nexthop()
Petr Machata [Thu, 28 Jan 2021 12:49:21 +0000 (13:49 +0100)]
nexthop: Strongly-type context of rtm_dump_nexthop()

The dump operations need to keep state from one invocation to another. A
scratch area is dedicated for this purpose in the passed-in argument, cb,
namely via two aliased arrays, struct netlink_callback.args and .ctx.

Dumping of buckets will end up having to iterate over next hops as well,
and it would be nice to be able to reuse the iteration logic with the NH
dumper. The fact that the logic currently relies on fixed index to the
.args array, and the indices would have to be coordinated between the two
dumpers, makes this somewhat awkward.

To make the access patters clearer, introduce a helper struct with a NH
index, and instead of using the .args array directly, use it through this
structure.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Extract a common helper for parsing dump attributes
Petr Machata [Thu, 28 Jan 2021 12:49:20 +0000 (13:49 +0100)]
nexthop: Extract a common helper for parsing dump attributes

Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. However, they
have different policies. To allow reuse of this code, extract a
policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
function into a thin wrapper around it.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Extract dump filtering parameters into a single structure
Petr Machata [Thu, 28 Jan 2021 12:49:19 +0000 (13:49 +0100)]
nexthop: Extract dump filtering parameters into a single structure

Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. In order to make
reuse of this code simpler, convert the code to use a single structure with
filtering configuration instead of passing around the parameters one by
one.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Dispatch notifier init()/fini() by group type
Petr Machata [Thu, 28 Jan 2021 12:49:18 +0000 (13:49 +0100)]
nexthop: Dispatch notifier init()/fini() by group type

After there are several next-hop group types, initialization and
finalization of notifier type needs to reflect the actual type. Transform
nh_notifier_grp_info_init() and _fini() to make extending them easier.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Use enum to encode notification type
Ido Schimmel [Thu, 28 Jan 2021 12:49:17 +0000 (13:49 +0100)]
nexthop: Use enum to encode notification type

Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.

As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Assert the invariant that a NH group is of only one type
Petr Machata [Thu, 28 Jan 2021 12:49:16 +0000 (13:49 +0100)]
nexthop: Assert the invariant that a NH group is of only one type

Most of the code that deals with nexthop groups relies on the fact that the
group is of exactly one well-known type. Currently there is only one type,
"mpath", but as more next-hop group types come, it becomes desirable to
have a central place where the setting is validated. Introduce such place
into nexthop_create_group(), such that the check is done before the code
that relies on that invariant is invoked.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Introduce to struct nh_grp_entry a per-type union
Petr Machata [Thu, 28 Jan 2021 12:49:15 +0000 (13:49 +0100)]
nexthop: Introduce to struct nh_grp_entry a per-type union

The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Dispatch nexthop_select_path() by group type
Petr Machata [Thu, 28 Jan 2021 12:49:14 +0000 (13:49 +0100)]
nexthop: Dispatch nexthop_select_path() by group type

The logic for selecting path depends on the next-hop group type. Adapt the
nexthop_select_path() to dispatch according to the group type.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonexthop: Rename nexthop_free_mpath
David Ahern [Thu, 28 Jan 2021 12:49:13 +0000 (13:49 +0100)]
nexthop: Rename nexthop_free_mpath

nexthop_free_mpath really should be nexthop_free_group. Rename it.

Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-iucv-updates-2021-01-28'
Jakub Kicinski [Fri, 29 Jan 2021 04:36:24 +0000 (20:36 -0800)]
Merge branch 'net-iucv-updates-2021-01-28'

Julian Wiedmann says:

====================
net/iucv: updates 2021-01-28

This reworks & simplifies the TX notification path in af_iucv, so that we
can send out SG skbs over TRANS_HIPER sockets. Also remove a noisy
WARN_ONCE() in the RX path.
====================

Link: https://lore.kernel.org/r/20210128114108.39409-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet/af_iucv: build SG skbs for TRANS_HIPER sockets
Julian Wiedmann [Thu, 28 Jan 2021 11:41:08 +0000 (12:41 +0100)]
net/af_iucv: build SG skbs for TRANS_HIPER sockets

The TX path no longer falls apart when some of its SG skbs are later
linearized by lower layers of the stack. So enable the use of SG skbs
in iucv_sock_sendmsg() again.

This effectively reverts
commit dc5367bcc556 ("net/af_iucv: don't use paged skbs for TX on HiperSockets").

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>