git.osdn.net Git - uclinux-h8/linux.git/log

mctp: Specify route types, require rtm_type in RTM_*ROUTE messages

This change adds a 'type' attribute to routes, which can be parsed from
a RTM_NEWROUTE message. This will help to distinguish local vs. peer
routes in a future change.

This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20210810023834.2231088-1-jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: add support for triggering reset by ethtool

Currently, four reset types are supported for the HNS3 ethernet
driver: IMP reset, global reset, function reset, and FLR. Only
FLR can now be triggered by the user. To restore the device when
an exception occurs, add support for triggering reset by ethtool.

Run the "ethtool --reset DEVNAME mgmt | all | dedicated" to
trigger the IMP | global | function reset manually.

In addition, VF can only trigger function reset.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Link: https://lore.kernel.org/r/1628602128-15640-1-git-send-email-huangguangbin2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bonding-cleanup-header-file-and-error-msgs'

Jonathan Toppins says:

====================
bonding: cleanup header file and error msgs

Two small patches removing unreferenced symbols and unifying error
messages across netlink and printk.
====================

Link: https://lore.kernel.org/r/cover.1628650079.git.jtoppins@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: combine netlink and console error messages

There seems to be no reason to have different error messages between
netlink and printk. It also cleans up the function slightly.

Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: remove extraneous definitions from bonding.h

All of the symbols either only exist in bond_options.c or nowhere at
all. These symbols were verified to not exist in the code base by
using `git grep` and their removal was verified by compiling bonding.ko.

Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: Fix non-GPL export of regmap APIs

The ocelot driver makes use of regmap, wrapping it with driver specific
operations that are thin wrappers around the core regmap APIs. These are
exported with EXPORT_SYMBOL, dropping the _GPL from the core regmap
exports which is frowned upon. Add _GPL suffixes to at least the APIs that
are doing register I/O.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20210810123748.47871-1-broonie@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dsa-tagger-helpers'

Vladimir Oltean says:

====================
DSA tagger helpers

The goal of this series is to minimize the use of memmove and skb->data
in the DSA tagging protocol drivers. Unfiltered access to this level of
information is not very friendly to drive-by contributors, and sometimes
is also not the easiest to review.

For starters, I have converted the most common form of DSA tagging
protocols: the DSA headers which are placed where the EtherType is.

The helper functions introduced by this series are:
- dsa_alloc_etype_header
- dsa_strip_etype_header
- dsa_etype_header_pos_rx
- dsa_etype_header_pos_tx

This series is just a resend as non-RFC of v1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: create a helper for locating EtherType DSA headers on TX

Create a similar helper for locating the offset to the DSA header
relative to skb->data, and make the existing EtherType header taggers to
use it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: create a helper for locating EtherType DSA headers on RX

It seems that protocol tagging driver writers are always surprised about
the formula they use to reach their EtherType header on RX, which
becomes apparent from the fact that there are comments in multiple
drivers that mention the same information.

Create a helper that returns a void pointer to skb->data - 2, as well as
centralize the explanation why that is the case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: create a helper which allocates space for EtherType DSA headers

Hide away the memmove used by DSA EtherType header taggers to shift the
MAC SA and DA to the left to make room for the header, after they've
called skb_push(). The call to skb_push() is still left explicit in
drivers, to be symmetric with dsa_strip_etype_header, and because not
all callers can be refactored to do it (for example, brcm_tag_xmit_ll
has common code for a pre-Ethernet DSA tag and an EtherType DSA tag).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: create a helper that strips EtherType DSA headers on RX

All header taggers open-code a memmove that is fairly not all that
obvious, and we can hide the details behind a helper function, since the
only thing specific to the driver is the length of the header tag.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'devlink-aux-devices'

Parav Pandit says:

====================
devlink: Control auxiliary devices

Currently, for mlx5 multi-function device, a user is not able to control
which functionality to enable/disable. For example, each PCI
PF, VF, SF function by default has netdevice, RDMA and vdpa-net
devices always enabled.

Hence, enable user to control which device functionality to enable/disable.

This is achieved by using existing devlink params [1] to
enable/disable eth, rdma and vdpa net functionality control knob.

For example user interested in only vdpa device function: performs,

$ devlink dev param set pci/0000:06:00.0 name enable_rdma value false \
                   cmode driverinit
$ devlink dev param set pci/0000:06:00.0 name enable_eth value false \
                   cmode driverinit
$ devlink dev param set pci/0000:06:00.0 name enable_vnet value true \
                   cmode driverinit

$ devlink dev reload pci/0000:06:00.0

Reload command honors parameters set, initializes the device that user
has composed using devlink dev params and resources.
Devices before reload:

            mlx5_core.sf.4
         (subfunction device)
                  /\
                 /| \
                / |  \
               /  |   \
mlx5_core.eth.4  |  mlx5_core.rdma.4
(SF eth aux dev)  |  (SF rdma aux dev)
    |             |        |
    |             |        |
enp6s0f0s88      |      mlx5_0
(SF netdev)      |  (SF rdma device)
                  |
         mlx5_core.vnet.4
         (SF vnet aux dev)
                 |
                 |
        auxiliary/mlx5_core.sf.4
        (vdpa net mgmt device)

Above example reconfigures the device with only VDPA functionality.
Devices after reload:

            mlx5_core.sf.4
         (subfunction device)
                  /\
                 /  \
                /    \
               /      \
mlx5_core.vnet.4     no eth, no rdma aux devices
(SF vnet aux dev)

Above parameters enable user to compose the device as needed based
on the use case.

Since devlink params are done on the devlink instance, these
knobs are uniformly usable for PCI PF, VF and SF devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Support enable_vnet devlink dev param

Enable user to disable VDPA net auxiliary device so that when it is not
required, user can disable it.

For example,

$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device
mlx5_core.vnet.2 for the VDPA net functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Support enable_rdma devlink dev param

Enable user to disable RDMA auxiliary device so that when it is not
required, user can disable it.

For example,

$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device
mlx5_core.rdma.2 for the RDMA functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Support enable_eth devlink dev param

Enable user to disable Ethernet auxiliary device so that when it is not
required, user can disable it.

For example,

$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create mlx5_core.eth.2 auxiliary
device for the Ethernet functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Fix unpublish devlink parameters

Cleanup routine missed to unpublish the parameters. Add it.

Fixes: e890acd5ff18 ("net/mlx5: Add devlink flow_steering_mode parameter")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add APIs to publish, unpublish individual parameter

Enable drivers to publish/unpublish individual parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add API to register and unregister single parameter

Currently device configuration parameters can be registered as an array.
Due to this a constant array must be registered. A single driver
supporting multiple devices each with different device capabilities end
up registering all parameters even if it doesn't support it.

One possible workaround a driver can do is, it registers multiple single
entry arrays to overcome such limitation.

Better is to provide a API that enables driver to register/unregister a
single parameter. This also further helps in two ways.
(1) to reduce the memory of devlink_param_entry by avoiding in registering
parameters which are not supported by the device.
(2) avoid generating multiple parameter add, delete, publish, unpublish,
init value notifications for such unsupported parameters

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Create a helper function for one parameter registration

Create and use a helper function for one parameter registration.
Subsequent patch also will reuse this for driver facing routine to
register a single parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add new "enable_vnet" generic device param

Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
VDPA net functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add new "enable_rdma" generic device param

Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
RDMA functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add new "enable_eth" generic device param

Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
Ethernet functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bridge-global-mcast'

Nikolay Aleksandrov says:

====================
net: bridge: vlan: add global mcast options

This is the first follow-up set after the support for per-vlan multicast
contexts which extends global vlan options to support bridge's multicast
config per-vlan, it enables user-space to change and dump the already
existing bridge vlan multicast context options. The global option patches
(01 - 09 and 12-13) follow a similar pattern of changing current mcast
functions to take multicast context instead of a port/bridge directly.
Option equality checks have been added for dumping vlan range compression.
The last 2 patches extend the mcast router dump support so it can be
re-used when dumping vlan config.

patches 01 - 09: add support for various mcast options
patches 10 - 11: prepare for per-vlan querier control
patches 12 - 13: add support for querier control and router control
patches 14 - 15: add support for dumping per-vlan router ports

Next patch-sets:
- per-port/vlan router option config
- iproute2 support for all new vlan options
- selftests
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: use br_rports_fill_info() to export mcast router ports

Embed the standard multicast router port export by br_rports_fill_info()
into a new global vlan attribute BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS.
In order to have the same format for the global bridge mcast context and
the per-vlan mcast context we need a double-nesting:
- BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS
- MDBA_ROUTER

Currently we don't compare router lists, if any router port exists in
the bridge mcast contexts we consider their option sets as different and
export them separately.

In addition we export the router port vlan id when dumping similar to
the router port notification format.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: mcast: use the proper multicast context when dumping router ports

When we are dumping the router ports of a vlan mcast context we need to
use the bridge/vlan and port/vlan's multicast contexts to check if
IPv4/IPv6 router port is present and later to dump the vlan id.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast router global option

Add support to change and retrieve global vlan multicast router state
which is used for the bridge itself. We just need to pass multicast context
to br_multicast_set_router instead of bridge device and the rest of the
logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast querier global option

Add support to change and retrieve global vlan multicast querier state.
We just need to pass multicast context to br_multicast_set_querier
instead of bridge device and the rest of the logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: mcast: querier and query state affect only current context type

It is a minor optimization and better behaviour to make sure querier and
query sending routines affect only the matching multicast context
depending if vlan snooping is enabled (vlan ctx vs bridge ctx).
It also avoids sending unnecessary extra query packets.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: mcast: move querier state to the multicast context

We need to have the querier state per multicast context in order to have
per-vlan control, so remove the internal option bit and move it to the
multicast context. Also annotate the lockless reads of the new variable.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast startup query interval global option

Add support to change and retrieve global vlan multicast startup query
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast query response interval global option

Add support to change and retrieve global vlan multicast query response
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast query interval global option

Add support to change and retrieve global vlan multicast query interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast querier interval global option

Add support to change and retrieve global vlan multicast querier interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast membership interval global option

Add support to change and retrieve global vlan multicast membership
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast last member interval global option

Add support to change and retrieve global vlan multicast last member
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast startup query count global option

Add support to change and retrieve global vlan multicast startup query
count option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast last member count global option

Add support to change and retrieve global vlan multicast last member
count option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: vlan: add support for mcast igmp/mld version global options

Add support to change and retrieve global vlan IGMP/MLD versions.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipa-runtime-pm'

Alex Elder says:

====================
net: ipa: use runtime PM reference counting

This series does further rework of the IPA clock code so that we
rely on some of the core runtime power management code (including
its referencing counting) instead.

The first patch makes ipa_clock_get() act like pm_runtime_get_sync().

The second patch makes system suspend occur regardless of the
current reference count value, which is again more like how the
runtime PM core code behaves.

The third patch creates functions to encapsulate all hardware
suspend and resume activity.  The fourth uses those functions as
the ->runtime_suspend and ->runtime_resume power callbacks.  With
that in place, ipa_clock_get() and ipa_clock_put() are changed to
use runtime PM get and put functions when needed.

The fifth patch eliminates an extra clock reference previously used
to control system suspend.  The sixth eliminates the "IPA clock"
reference count and mutex.

The final patch replaces the one call to ipa_clock_get_additional()
with a call to pm_runtime_get_if_active(), making the former
unnecessary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: kill ipa_clock_get_additional()

Now that ipa_clock_get_additional() is a trivial wrapper around
pm_runtime_get_if_active(), just open-code it in its only caller
and delete the function.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: kill IPA clock reference count

The runtime power management core code maintains a usage count.  This
count mirrors the IPA clock reference count, and there's no need to
maintain both.  So get rid of the IPA clock reference count and just
rely on the runtime PM usage count to determine when the hardware
should be suspended or resumed.

Use pm_runtime_get_if_active() in ipa_clock_get_additional().  We
care whether power is active, regardless of whether it's in use, so
pass true for its ign_usage_count argument.

The IPA clock mutex is just used to make enabling/disabling the
clock and updating the reference count occur atomically.  Without
the reference count, there's no need for the mutex, so get rid of
that too.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: get rid of extra clock reference

Suspending the IPA hardware is now managed by the runtime PM core
code. The ->runtime_idle callback returns a non-zero value, so it
will never suspend except when forced. As a result, there's no need
to take an extra "do not suspend" clock reference.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: use runtime PM core

Use the runtime power management core to cause hardware suspend and
resume to occur. Enable it in ipa_clock_init() (without autosuspend),
and disable it in ipa_clock_exit().

Use ipa_runtime_suspend() as the ->runtime_suspend power operation,
and arrange for it to be called by having ipa_clock_get() call
pm_runtime_get_sync() when the first clock reference is taken.
Similarly, use ipa_runtime_resume() as the ->runtime_resume power
operation, and pm_runtime_put() when the last IPA clock reference
is dropped.

Introduce ipa_runtime_idle() as the ->runtime_idle power operation,
and have it return a non-zero value; this way suspend will never
occur except when forced.

Use pm_runtime_force_suspend() and pm_runtime_force_resume() as the
system suspend and resume callbacks, and remove ipa_suspend() and
ipa_resume().

Store a pointer to the device structure passed to ipa_clock_init(),
so it can be used by ipa_clock_exit() to disable runtime power
management.

For now we preserve IPA clock reference counting.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: resume in ipa_clock_get()

Introduce ipa_runtime_suspend() and ipa_runtime_resume(), which
encapsulate the activities necessary for suspending and resuming
the IPA hardware. Call these functions from ipa_clock_get() and
ipa_clock_put() when the first reference is taken or last one is
dropped.

When the very first clock reference is taken (for ipa_config()),
setup isn't complete yet, so (as before) only the core clock gets
enabled.

When the last clock reference is dropped (after ipa_deconfig()),
ipa_teardown() will have made the setup_complete flag false, so
there too, the core clock will be stopped without affecting GSI
or the endpoints.

Otherwise these new functions will perform the desired suspend and
resume actions once setup is complete.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: disable clock in suspend

Disable the IPA clock rather than dropping a reference to it in the
system suspend callback. This forces the suspend to occur without
affecting existing references.

Similarly, enable the clock rather than taking a reference in
ipa_resume(), forcing a resume without changing the reference count.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: have ipa_clock_get() return a value

We currently assume no errors occur when enabling or disabling the
IPA core clock and interconnects.  And although this commit exposes
errors that could occur, we generally assume this won't happen in
practice.

This commit changes ipa_clock_get() and ipa_clock_put() so each
returns a value.  The values returned are meant to mimic what the
runtime power management functions return, so we can set up error
handling here before we make the switch.  Have ipa_clock_get()
increment the reference count even if it returns an error, to match
the behavior of pm_runtime_get().

More details follow.

When taking a reference in ipa_clock_get(), return 0 for the first
reference, 1 for subsequent references, or a negative error code if
an error occurs.  Note that if ipa_clock_get() returns an error, we
must not touch hardware; in some cases such errors now cause entire
blocks of code to be skipped.

When dropping a reference in ipa_clock_put(), we return 0 or an
error code.  The error would come from ipa_clock_disable(), which
now returns what ipa_interconnect_disable() returns (either 0 or a
negative error code).  For now, callers ignore the return value;
if an error occurs, a message will have already been logged, and
little more can actually be done to improve the situation.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat.

2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout
fixup.

3) CLUSTERIP registers ARP hook on demand, from Florian.

4) Use clusterip_net to store pernet warning, also from Florian.

5) Remove struct netns_xt, from Florian Westphal.

6) Enable ebtables hooks in initns on demand, from Florian.

7) Allow to filter conntrack netlink dump per status bits,
from Florian Westphal.

8) Register x_tables hooks in initns on demand, from Florian.

9) Remove queue_handler from per-netns structure, again from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Support filtering interfaces on no master

Currently there's support for filtering neighbours/links for interfaces
which have a specific master device (using the IFLA_MASTER/NDA_MASTER
attributes).

This patch adds support for filtering interfaces/neighbours dump for
interfaces that *don't* have a master.

Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210810090658.2778960-1-lschlesinger@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: cls_api, reset flags on replay

tc_new_tfilter() can replay a request if it got EAGAIN. The cited commit
didn't account for this when it converted TC action ->init() API
to use flags instead of parameters. This can lead to passing stale flags
down the call chain which results in trying to lock rtnl when it's
already locked, deadlocking the entire system.

Fix by making sure to reset flags on each replay.

============================================
WARNING: possible recursive locking detected
5.14.0-rc3-custom-49011-g3d2bbb4f104d #447 Not tainted
--------------------------------------------
tc/37605 is trying to acquire lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_setup_cb_add+0x14b/0x4d0

but task is already holding lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

other info that might help us debug this:
Possible unsafe locking scenario:
       CPU0
       ----
  lock(rtnl_mutex);
  lock(rtnl_mutex);

*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by tc/37605:
#0: ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

stack backtrace:
CPU: 0 PID: 37605 Comm: tc Not tainted 5.14.0-rc3-custom-49011-g3d2bbb4f104d #447
Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
Call Trace:
dump_stack_lvl+0x8b/0xb3
__lock_acquire.cold+0x175/0x3cb
lock_acquire+0x1a4/0x4f0
__mutex_lock+0x136/0x10d0
fl_hw_replace_filter+0x458/0x630 [cls_flower]
fl_change+0x25f2/0x4a64 [cls_flower]
tc_new_tfilter+0xa65/0x22e0
rtnetlink_rcv_msg+0x86c/0xc60
netlink_rcv_skb+0x14d/0x430
netlink_unicast+0x539/0x7e0
netlink_sendmsg+0x84d/0xd80
____sys_sendmsg+0x7ff/0x970
___sys_sendmsg+0xf8/0x170
__sys_sendmsg+0xea/0x1b0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7b93b6c0a7
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48>
RSP: 002b:00007ffe365b3818 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b93b6c0a7
RDX: 0000000000000000 RSI: 00007ffe365b3880 RDI: 0000000000000003
RBP: 00000000610a75f6 R08: 0000000000000001 R09: 0000000000000000
R10: fffffffffffff3a9 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ffe365b7b58 R15: 00000000004822c0

Fixes: 695176bfe5de ("net_sched: refactor TC action init API")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20210810034305.63997-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
pull-request: mlx5-next 2020-08-9

This pulls mlx5-next branch which includes patches already reviewed on
net-next and rdma mailing lists.

1) mlx5 single E-Switch FDB for lag

2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq

3) Add DCS caps & fields support

[1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/

[2] https://patchwork.kernel.org/project/netdevbpf/patch/0e3364dab7e0e4eea5423878b01aa42470be8d36.1626609184.git.leonro@nvidia.com/

[3] https://patchwork.kernel.org/project/netdevbpf/patch/55e1d69bef1fbfa5cf195c0bfcbe35c8019de35e.1624258894.git.leonro@nvidia.com/

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Lag, Create shared FDB when in switchdev mode
  net/mlx5: E-Switch, add logic to enable shared FDB
  net/mlx5: Lag, move lag destruction to a workqueue
  net/mlx5: Lag, properly lock eswitch if needed
  net/mlx5: Add send to vport rules on paired device
  net/mlx5: E-Switch, Add event callback for representors
  net/mlx5e: Use shared mappings for restoring from metadata
  net/mlx5e: Add an option to create a shared mapping
  net/mlx5: E-Switch, set flow source for send to uplink rule
  RDMA/mlx5: Add shared FDB support
  {net, RDMA}/mlx5: Extend send to vport rules
  RDMA/mlx5: Fill port info based on the relevant eswitch
  net/mlx5: Lag, add initial logic for shared FDB
  net/mlx5: Return mdev from eswitch
  IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
  net/mlx5: Add DCS caps & fields support
====================

Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nf_queue: move hookfn registration out of struct net

This was done to detect when the pernet->init() function was not called
yet, by checking if net->nf.queue_handler is NULL.

Once the nfnetlink_queue module is active, all struct net pointers
contain the same address. So place this back in nf_queue.c.

Handle the 'netns error unwind' test by checking nfnl_queue_net for a
NULL pointer and add a comment for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2021-08-10

We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).

1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.

2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
   32-bit MIPS JIT development, from Johan Almbladh.

3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.

4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.

5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.

6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.

7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
   and Muhammad Falak R Wani.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
  bpf, tests: Add tail call test suite
  bpf, tests: Add tests for BPF_CMPXCHG
  bpf, tests: Add tests for atomic operations
  bpf, tests: Add test for 32-bit context pointer argument passing
  bpf, tests: Add branch conversion JIT test
  bpf, tests: Add word-order tests for load/store of double words
  bpf, tests: Add tests for ALU operations implemented with function calls
  bpf, tests: Add more ALU64 BPF_MUL tests
  bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
  bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
  bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
  bpf, tests: Fix typos in test case descriptions
  bpf, tests: Add BPF_MOV tests for zero and sign extension
  bpf, tests: Add BPF_JMP32 test cases
  samples, bpf: Add an explict comment to handle nested vlan tagging.
  selftests/bpf: Add tests for XDP bonding
  selftests/bpf: Fix xdp_tx.c prog section name
  net, core: Allow netdev_lower_get_next_private_rcu in bh context
  bpf, devmap: Exclude XDP broadcast to master device
  net, bonding: Add XDP support to the bonding driver
  ...
====================

Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bpf, tests: Add tail call test suite

While BPF_CALL instructions were tested implicitly by the cBPF-to-eBPF
translation, there has not been any tests for BPF_TAIL_CALL instructions.
The new test suite includes tests for tail call chaining, tail call count
tracking and error paths. It is mainly intended for JIT development and
testing.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-15-johan.almbladh@anyfinetworks.com

bpf, tests: Add tests for BPF_CMPXCHG

Tests for BPF_CMPXCHG with both word and double word operands. As with
the tests for other atomic operations, these tests only check the result
of the arithmetic operation. The atomicity of the operations is not tested.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-14-johan.almbladh@anyfinetworks.com

bpf, tests: Add tests for atomic operations

Tests for each atomic arithmetic operation and BPF_XCHG, derived from
old BPF_XADD tests. The tests include BPF_W/DW and BPF_FETCH variants.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-13-johan.almbladh@anyfinetworks.com

bpf, tests: Add test for 32-bit context pointer argument passing

On a 32-bit architecture, the context pointer will occupy the low
half of R1, and the other half will be zero.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-12-johan.almbladh@anyfinetworks.com

bpf, tests: Add branch conversion JIT test

Some JITs may need to convert a conditional jump instruction to
to short PC-relative branch and a long unconditional jump, if the
PC-relative offset exceeds offset field width in the CPU instruction.
This test triggers such branch conversion on the 32-bit MIPS JIT.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-11-johan.almbladh@anyfinetworks.com

bpf, tests: Add word-order tests for load/store of double words

A double word (64-bit) load/store may be implemented as two successive
32-bit operations, one for each word. Check that the order of those
operations is consistent with the machine endianness.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-10-johan.almbladh@anyfinetworks.com

bpf, tests: Add tests for ALU operations implemented with function calls

32-bit JITs may implement complex ALU64 instructions using function calls.
The new tests check aspects related to this, such as register clobbering
and register argument re-ordering.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-9-johan.almbladh@anyfinetworks.com

bpf, tests: Add more ALU64 BPF_MUL tests

This patch adds BPF_MUL tests for 64x32 and 64x64 multiply. Mainly
testing 32-bit JITs that implement ALU64 operations with two 32-bit
CPU registers per operand.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-8-johan.almbladh@anyfinetworks.com

bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64

This patch adds a number of tests for BPF_LSH, BPF_RSH amd BPF_ARSH
ALU64 operations with values that may trigger different JIT code paths.
Mainly testing 32-bit JITs that implement ALU64 operations with two
32-bit CPU registers per operand.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-7-johan.almbladh@anyfinetworks.com

bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH

This patch adds more tests of ALU32 shift operations BPF_LSH and BPF_RSH,
including the special case of a zero immediate. Also add corresponding
BPF_ARSH tests which were missing for ALU32.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-6-johan.almbladh@anyfinetworks.com

bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations

This patch adds tests of BPF_AND, BPF_OR and BPF_XOR with different
magnitude of the immediate value. Mainly checking 32-bit JIT sub-word
handling and zero/sign extension.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-5-johan.almbladh@anyfinetworks.com

bpf, tests: Fix typos in test case descriptions

This patch corrects the test description in a number of cases where
the description differed from what was actually tested and expected.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-4-johan.almbladh@anyfinetworks.com

bpf, tests: Add BPF_MOV tests for zero and sign extension

Tests for ALU32 and ALU64 MOV with different sizes of the immediate
value. Depending on the immediate field width of the native CPU
instructions, a JIT may generate code differently depending on the
immediate value. Test that zero or sign extension is performed as
expected. Mainly for JIT testing.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-3-johan.almbladh@anyfinetworks.com

bpf, tests: Add BPF_JMP32 test cases

An eBPF JIT may implement JMP32 operations in a different way than JMP,
especially on 32-bit architectures. This patch adds a series of tests
for JMP32 operations, mainly for testing JITs.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-2-johan.almbladh@anyfinetworks.com

samples, bpf: Add an explict comment to handle nested vlan tagging.

A codeblock for handling nested vlan trips newbies into thinking it as
duplicate code. Explicitly add a comment to clarify.

Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809070046.32142-1-falakreyaz@gmail.com

Merge branch 'add-frag-page-support-in-page-pool'

Yunsheng Lin says:

====================
add frag page support in page pool

This patchset adds frag page support in page pool and
enable skb's page frag recycling based on page pool in
hns3 drvier.
====================

Link: https://lore.kernel.org/r/1628217982-53533-1-git-send-email-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: support skb's frag page recycling based on page pool

This patch adds skb's frag page recycling support based on
the frag page support in page pool.

The performance improves above 10~20% for single thread iperf
TCP flow with IOMMU disabled when iperf server and irq/NAPI
have a different CPU.

The performance improves about 135%(14Gbit to 33Gbit) for single
thread iperf TCP flow when IOMMU is in strict mode and iperf
server shares the same cpu with irq/NAPI.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

page_pool: add frag page recycling support in page pool

Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.

Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
   for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
   multi desc using different part of the same page in order
   to save memory.

So add multi-users support and frag page recycling in page
pool to overcome the above limitation.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

page_pool: add interface to manipulate frag count in page pool

For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.

For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.

In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.

The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

page_pool: keep pp info as long as page pool owns the page

Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.

So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.

This is also a preparation to support allocating frag page
in page pool.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: Add tests for XDP bonding

Add a test suite to test XDP bonding implementation over a pair of
veth devices.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-8-joamaki@gmail.com

selftests/bpf: Fix xdp_tx.c prog section name

The program type cannot be deduced from 'tx' which causes an invalid
argument error when trying to load xdp_tx.o using the skeleton.
Rename the section name to "xdp" so that libbpf can deduce the type.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-7-joamaki@gmail.com

net, core: Allow netdev_lower_get_next_private_rcu in bh context

For the XDP bonding slave lookup to work in the NAPI poll context in which
the redudant rcu_read_lock() has been removed we have to follow the same
approach as in 694cea395fde ("bpf: Allow RCU-protected lookups to happen
from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com

bpf, devmap: Exclude XDP broadcast to master device

If the ingress device is bond slave, do not broadcast back through it or
the bond master.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com

net, bonding: Add XDP support to the bonding driver

XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.

Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].

-----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
-----------------------
with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
-----------------------
with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
--------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

  [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
  [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
  [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
  [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com

net, core: Add support for XDP redirection to slave device

This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.

The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com

net, bonding: Refactor bond_xmit_hash for use with xdp_buff

In preparation for adding XDP support to the bonding driver
refactor the packet hashing functions to be able to work with
any linear data buffer without an skb.

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com

devlink: Fix port_type_set function pointer check

Fix a typo when checking existence of port_type_set function pointer.

Fixes: 82564f6c706a ("devlink: Simplify devlink port API calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fec: fix build error for ARCH m68k

reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
make.cross ARCH=m68k  m5272c3_defconfig
make.cross ARCH=m68k

   drivers/net/ethernet/freescale/fec_main.c: In function 'fec_enet_eee_mode_set':
>> drivers/net/ethernet/freescale/fec_main.c:2758:33: error: 'FEC_LPI_SLEEP' undeclared (first use in this function); did you mean 'FEC_ECR_SLEEP'?
    2758 |  writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
         |                                 ^~~~~~~~~~~~~
   arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
      25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
         |                                                                  ^~~~
   drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel'
    2758 |  writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
         |  ^~~~~~
   drivers/net/ethernet/freescale/fec_main.c:2758:33: note: each undeclared identifier is reported only once for each function it appears in
    2758 |  writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
         |                                 ^~~~~~~~~~~~~
   arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
      25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
         |                                                                  ^~~~
   drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel'
    2758 |  writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
         |  ^~~~~~
>> drivers/net/ethernet/freescale/fec_main.c:2759:32: error: 'FEC_LPI_WAKE' undeclared (first use in this function)
    2759 |  writel(wake_cycle, fep->hwp + FEC_LPI_WAKE);
         |                                ^~~~~~~~~~~~
   arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
      25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
         |                                                                  ^~~~
   drivers/net/ethernet/freescale/fec_main.c:2759:2: note: in expansion of macro 'writel'
    2759 |  writel(wake_cycle, fep->hwp + FEC_LPI_WAKE);
         |  ^~~~~~

This patch adds register definition for M5272 platform to pass build.

Fixes: b82f8c3f1409 ("net: fec: add eee mode tx lpi support")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: Allow SMC-D 1MB DMB allocations

Commit a3fe3d01bd0d7 ("net/smc: introduce sg-logic for RMBs") introduced
a restriction for RMB allocations as used by SMC-R. However, SMC-D does
not use scatter-gather lists to back its DMBs, yet it was limited by
this restriction, still.
This patch exempts SMC, but limits allocations to the maximum RMB/DMB
size respectively.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Set device as early as possible

All kernel devlink implementations call to devlink_alloc() during
initialization routine for specific device which is used later as
a parent device for devlink_register().

Such late device assignment causes to the situation which requires us to
call to device_register() before setting other parameters, but that call
opens devlink to the world and makes accessible for the netlink users.

Any attempt to move devlink_register() to be the last call generates the
following error due to access to the devlink->dev pointer.

[    8.758862]  devlink_nl_param_fill+0x2e8/0xe50
[    8.760305]  devlink_param_notify+0x6d/0x180
[    8.760435]  __devlink_params_register+0x2f1/0x670
[    8.760558]  devlink_params_register+0x1e/0x20

The simple change of API to set devlink device in the devlink_alloc()
instead of devlink_register() fixes all this above and ensures that
prior to call to devlink_register() everything already set.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

wwan: mhi: Fix missing spin_lock_init() in mhi_mbim_probe()

The driver allocates the spinlock but not initialize it.
Use spin_lock_init() on it to initialize it correctly.

Fixes: aa730a9905b7 ("net: wwan: Add MHI MBIM network driver")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'iucv-next'

Karsten Graul says:

====================
net/iucv: updates 2021-08-09

Please apply the following iucv patches to netdev's net-next tree.

Remove the usage of register asm statements and replace deprecated
CPU-hotplug functions with the current version.
Use use consume_skb() instead of kfree_skb() to avoid flooding
dropwatch with false-positives, and 2 patches with cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/iucv: Replace deprecated CPU-hotplug functions.

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/iucv: get rid of register asm usage

Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.

Therefore get rid of register asm statements in iucv code, even though
there is currently nothing wrong with it. This way we know for sure
that the above mentioned bug class won't be introduced here.

Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/af_iucv: remove wrappers around iucv (de-)registration

These wrappers are just unnecessary obfuscation.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/af_iucv: clean up a try_then_request_module()

Use IS_ENABLED(CONFIG_IUCV) to determine whether the iucv_if symbol
is available, and let depmod deal with the module dependency.

This was introduced back with commit 6fcd61f7bf5d ("af_iucv: use
loadable iucv interface"). And to avoid sprinkling IS_ENABLED() over
all the code, we're keeping the indirection through pr_iucv->...().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/af_iucv: support drop monitoring

Change the good paths to use consume_skb() instead of kfree_skb(). This
avoids flooding dropwatch with false-positives.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-fast-ageing'

Vladimir Oltean says:

====================
DSA fast ageing fixes/improvements

These are 2 small improvements brought to the DSA fast ageing changes
merged earlier today.

Patch 1 restores the behavior for DSA drivers that don't implement the
.port_bridge_flags function (I don't think there is any breakage due
to the new behavior, but just to be sure). This came as a result of
Andrew's review.

Patch 2 reduces the number of fast ages of a port from 2 to 1 when it
leaves a bridge.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: avoid fast ageing twice when port leaves a bridge

Drivers that support both the toggling of address learning and dynamic
FDB flushing (mv88e6xxx, b53, sja1105) currently need to fast-age a port
twice when it leaves a bridge:

- once, when del_nbp() calls br_stp_disable_port() which puts the port
in the BLOCKING state
- twice, when dsa_port_switchdev_unsync_attrs() calls
dsa_port_clear_brport_flags() which disables address learning

The knee-jerk reaction might be to say "dsa_port_clear_brport_flags does
not need to fast-age the port at all", but the thing is, we still need
both code paths to flush the dynamic FDB entries in different situations.
When a DSA switch port leaves a bonding/team interface that is (still) a
bridge port, no del_nbp() will be called, so we rely on
dsa_port_clear_brport_flags() function to restore proper standalone port
functionality with address learning disabled.

So the solution is just to avoid double the work when both code paths
are called in series. Luckily, DSA already caches the STP port state, so
we can skip flushing the dynamic FDB when we disable address learning
and the STP state is one where no address learning takes place at all.
Under that condition, not flushing the FDB is safe because there is
supposed to not be any dynamic FDB entry at all (they were flushed
during the transition towards that state, and none were learned in the
meanwhile).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: still fast-age ports joining a bridge if they can't configure learning

Commit 39f32101543b ("net: dsa: don't fast age standalone ports")
assumed that all standalone ports disable address learning, but if the
switch driver implements .port_fast_age but not .port_bridge_flags (like
ksz9477, ksz8795, lantiq_gswip, lan9303), then that might not actually
be true.

So whereas before, the bridge temporarily walking us through the
BLOCKING STP state meant that the standalone ports had a checkpoint to
flush their baggage and start fresh when they join a bridge, after that
commit they no longer do.

Restore the old behavior for these drivers by checking if the switch can
toggle address learning. If it can't, disregard the "do_fast_age"
argument and unconditionally perform fast ageing on STP state changes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: x_tables: never register tables by default

For historical reasons x_tables still register tables by default in the
initial namespace.
Only newly created net namespaces add the hook on demand.

This means that the init_net always pays hook cost, even if no filtering
rules are added (e.g. only used inside a single netns).

Note that the hooks are added even when 'iptables -L' is called.
This is because there is no way to tell 'iptables -A' and 'iptables -L'
apart at kernel level.

The only solution would be to register the table, but delay hook
registration until the first rule gets added (or policy gets changed).

That however means that counters are not hooked either, so 'iptables -L'
would always show 0-counters even when traffic is flowing which might be
unexpected.

This keeps table and hook registration consistent with what is already done
in non-init netns: first iptables(-save) invocation registers both table
and hooks.

This applies the same solution adopted for ebtables.
All tables register a template that contains the l3 family, the name
and a constructor function that is called when the initial table has to
be added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge branch 'sja1105-fast-ageing'

Vladimir Oltean says:

====================
Fast ageing support for SJA1105 DSA driver

While adding support for flushing dynamically learned FDB entries in the
sja1105 driver, I noticed a few things that could be improved in DSA.
Most notably, drivers could omit a fast age when address learning is
turned off, which might mean that ports leaving a bridge and becoming
standalone could still have FDB entries pointing towards them. Secondly,
when DSA fast ages a port after the 'learning' flag has been turned off,
the software bridge still has the dynamically learned 'master' FDB
entries installed, and those should be deleted too.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: sja1105: add FDB fast ageing support

Delete the dynamically learned FDB entries when the STP state changes
and when address learning is disabled.

On sja1105 there is no shorthand SPI command for this, so we need to
walk through the entire FDB to delete.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: sja1105: rely on DSA core tracking of port learning state

Now that DSA keeps track of the port learning state, it becomes
superfluous to keep an additional variable with this information in the
sja1105 driver. Remove it.

The DSA core's learning state is present in struct dsa_port *dp.
To avoid the antipattern where we iterate through a DSA switch's
ports and then call dsa_to_port to obtain the "dp" reference (which is
bad because dsa_to_port iterates through the DSA switch tree once
again), just iterate through the dst->ports and operate on those
directly.

The sja1105 had an extra use of priv->learn_ena on non-user ports. DSA
does not touch the learning state of those ports - drivers are free to
do what they wish on them. Mark that information with a comment in
struct dsa_port and let sja1105 set dp->learning for cascade ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: flush the dynamic FDB of the software bridge when fast ageing a port

Currently, when DSA performs fast ageing on a port, 'bridge fdb' shows
us that the 'self' entries (corresponding to the hardware bridge, as
printed by dsa_slave_fdb_dump) are deleted, but the 'master' entries
(corresponding to the software bridge) aren't.

Indeed, searching through the bridge driver, neither the
brport_attr_learning handler nor the IFLA_BRPORT_LEARNING handler call
br_fdb_delete_by_port. However, br_stp_disable_port does, which is one
of the paths which DSA uses to trigger a fast ageing process anyway.

There is, however, one other very promising caller of
br_fdb_delete_by_port, and that is the bridge driver's handler of the
SWITCHDEV_FDB_FLUSH_TO_BRIDGE atomic notifier. Currently the s390/qeth
HiperSockets card driver is the only user of this.

I can't say I understand that driver's architecture or interaction with
the bridge, but it appears to not be a switchdev driver in the traditional
sense of the word. Nonetheless, the mechanism it provides is a useful
way for DSA to express the fact that it performs fast ageing too, in a
way that does not change the existing behavior for other drivers.

Cc: Alexandra Winter <wintera@linux.ibm.com>
Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: don't fast age bridge ports with learning turned off

On topology changes, stations that were dynamically learned on ports
that are no longer part of the active topology must be flushed - this is
described by clause "17.11 Updating learned station location information"
of IEEE 802.1D-2004.

However, when address learning on the bridge port is turned off in the
first place, there is nothing to flush, so skip a potentially expensive
operation.

We can finally do this now since DSA is aware of the learning state of
its bridged ports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: centralize fast ageing when address learning is turned off

Currently DSA leaves it down to device drivers to fast age the FDB on a
port when address learning is disabled on it. There are 2 reasons for
doing that in the first place:

- when address learning is disabled by user space, through
  IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
  space typically wants to achieve is to operate in a mode with no
  dynamic FDB entry on that port. But if the port is already up, some
  addresses might have been already learned on it, and it seems silly to
  wait for 5 minutes for them to expire until something useful can be
  done.

- when a port leaves a bridge and becomes standalone, DSA turns off
  address learning on it. This also has the nice side effect of flushing
  the dynamically learned bridge FDB entries on it, which is a good idea
  because standalone ports should not have bridge FDB entries on them.

We let drivers manage fast ageing under this condition because if DSA
were to do it, it would need to track each port's learning state, and
act upon the transition, which it currently doesn't.

But there are 2 reasons why doing it is better after all:

- drivers might get it wrong and not do it (see b53_port_set_learning)

- we would like to flush the dynamic entries from the software bridge
  too, and letting drivers do that would be another pain point

So track the port learning state and trigger a fast age process
automatically within DSA.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>