git.osdn.net Git - tomoyo/tomoyo-test1.git/log

net/mlx5: DR, Keep track of hot ICM chunks in an array instead of list

When ICM chunk is freed, it might still be accessed by HW until we do
sync with HW. This sync is expensive operation, so we don't do it often.
Instead, when the chunk is freed, it is moved to the buddy's "hot memory"
list. Once sync is done, we traverse the hot list and finally free all
the chunks.

It appears that traversing a long list takes unusually long time due to cache
misses on many entries, which causes a big "hiccup" during rule insertion.

This patch deals with this issue the following way:
- Move hot chunks list from buddy to pool, so that the pool will
   keep track of all its hot memory.
- Replace the list with pre-allocated array on the memory pool struct,
   and store only the information that is needed to later free this
   chunk in its buddy allocator.
   This cost additional memory for the array that is dynamically
   allocated, but it allows not to save long list of hot chunks,
   so at peak times it actually saves memory due to the fact that
   each array entry is much smaller than the chunk struct.

This way an overhead of traversing the long list is virtually removed:
the loop of freeing hot chunks takes ~27 msec instead of ~70 msec, where
most of it are the actual freeing activities.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Lower sync threshold for ICM hot memory

Instead of hiding the math in the code, define a value that sets the
fraction of allowed hot memory of ICM pool.
Set the threshold for sync of ICM hot chunks to 1/4 of the pool
instead of 1/2 of the pool. Although we will have more syncs, each
sync will be shorter and will help with insertion rate stability.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Allocate htbl from its own slab allocator

SW steering allocates/frees lots of htbl structs. Create a
separate kmem_cache and allocate htbls from this allocator.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Allocate icm_chunks from their own slab allocator

SW steering allocates/frees lots of icm_chunk structs. To make this more
efficiently, create a separate kmem_cache and allocate these chunks from
this allocator.
By doing this we observe that the alloc/free "hiccups" frequency has
become much lower, which allows for a more steady rule insersion rate.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Manage STE send info objects in pool

Instead of allocating/freeing send info objects dynamically, manage them
in pool. The number of send info objects doesn't depend on rules, so after
pre-populating the pool with an initial batch of send info objects, the
pool is not expected to grow.
This way we save alloc/free during writing STEs to ICM, which can
sometimes take up to 40msec.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, In rehash write the line in the entry immediately

Don't wait for the whole table to be ready - write each row immediately.
This way we save allocations of the ste_send_info structure and improve
performance.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Handle domain memory resources init/uninit separately

Handle creation/destruction of all the domain's memory pools and other
memory-related fields in a separate init/uninit functions.
This simplifies error flow and allows cleaner addition of new pools.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Initialize chunk's ste_arrays at chunk creation

Rather than cleaning the corresponding chunk's section of ste_arrays on
chunk deletion, initialize these areas upon chunk creation.

Chunk destruction tend to come in large batches (during pool syncing).
To reduce the "hiccup" in such cases, moving ste_arrays init from chunk
destruction to initialization.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, For short chains of STEs, avoid allocating ste_arr dynamically

While creating rule, ste_arr is an array that is allocated at the start
of the function and freed at the end.
This memory allocation can sometimes lead to "hiccups" of up to 10ms.
However, the common use case is short chains of STEs. For such cases,
we can use a local buffer on stack instead.

Changes in v2:
Use small local array for short rules, allocate dynamically for long rules

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Remove unneeded argument from dr_icm_chunk_destroy

Remove an argument that can be extracted in the function.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Check device state when polling CQ

Calling fast teardown as part of the normal unloading caused
a problem with SW steering - SW steering still needs to clear
its tables, write to ICM and poll for completions.
When teardown has been done, SW steering keeps polling the CQ
forever, because nobody flushes it.

This patch fixes the issue by checking the device state in
cases where no CQE was returned.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Fix the SMFS sync_steering for fast teardown

If sync happens when the device is in fast teardown, just bail
and don't do anything, because the PCI device is not there any more.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, In destroy flow, free resources even if FW command failed

Otherwise resources will never be freed and refcount will not be decreased.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

skbuff: Proactively round up to kmalloc bucket size

Instead of discovering the kmalloc bucket size _after_ allocation, round
up proactively so the allocation is explicitly made for the full size,
allowing the compiler to correctly reason about the resulting size of
the buffer through the existing __alloc_size() hint.

This will allow for kernels built with CONFIG_UBSAN_BOUNDS or the
coming dynamic bounds checking under CONFIG_FORTIFY_SOURCE to gain
back the __alloc_size() hints that were temporarily reverted in commit
93dd04ab0b2b ("slab: remove __alloc_size attribute from __kmalloc_track_caller")

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221021234713.you.031-kees@kernel.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221025223811.up.360-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-ipa-don-t-use-fixed-table-sizes'

Alex Elder says:

====================
net: ipa: don't use fixed table sizes

Currently, routing and filter tables are assumed to have a fixed
size for all platforms. In fact, these tables can support many more
entries than what has been assumed; the only limitation is the size
of the IPA-resident memory regions that contain them.

This series rearranges things so that the size of the table is
determined from the memory region size defined in configuration
data, rather than assuming it is fixed.

This will required for IPA versions 5.0+, where the number of
entries in a routing table is larger.
====================

Link: https://lore.kernel.org/r/20221025195143.255934-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: determine filter table size from memory region

Currently we assume that any filter table contains a fixed number
of entries.  Like routing tables, the number of entries in a filter
table is limited only by the size of the IPA-local memory region
used to hold the table.

Stop assuming that a filter table has exactly 14 entries.  Instead,
determine the number of entries in a routing table by dividing its
memory region size by the size of an entry.  (Note that the first
"entry" in a filter table contains an endpoint bitmap.)

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: don't assume 8 modem routing table entries

Currently all platforms are assumed allot 8 routing table entries
for use by the modem. Instead, add a new configuration data entry
that defines the number of modem routing table entries, and record
that in the IPA structure.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: determine route table size from memory region

Currently we assume that any routing table contains a fixed number
of entries. The number of entries in a routing table can actually
vary, depending only on the size of the IPA-local memory region used
to hold the table.

Stop assuming that a routing table has exactly 15 entries. Instead,
determine the number of entries in a routing table by dividing its
memory region size by the size of an entry.

The number of entries is computed early, when ipa_table_mem_valid()
is called by ipa_table_init().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: record the route table size in the IPA structure

The non-hashed routing tables for IPv4 and IPv6 will be the same
size. And if supported, the hashed routing tables will be the same
size as the non-hashed tables.

Record the size (number of entries) of all routing tables in the IPA
structure. For now, initialize this field using IPA_ROUTE_TABLE_MAX,
and just do so when the first route table is validated.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth: fealnx: delete the driver for Myson MTD-800

The git history for this driver seems to be completely
automated / tree wide changes. I can't find any boards
or systems which would use this chip. Google search
shows pictures of towel warmers and no networking products.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20221025184254.1717982-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ice: Add support Flex RXD

Add new VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC flag, opcode
VIRTCHNL_OP_GET_SUPPORTED_RXDIDS and add member rxdid
in struct virtchnl_rxq_info to support AVF Flex RXD
extension.

Add support to allow VF to query flexible descriptor RXDIDs supported
by DDP package and configure Rx queues with selected RXDID for IAVF.

Add code to allow VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message to be
processed. Add necessary macros for registers.

Signed-off-by: Leyi Rong <leyi.rong@intel.com>
Signed-off-by: Xu Ting <ting.xu@intel.com>
Signed-off-by: Michal Jaron <michalx.jaron@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20221025161252.1952939-1-jacob.e.keller@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: broadcom: bcm4908_enet: use build_skb()

RX code can be more efficient with the build_skb(). Allocating actual
SKB around eth packet buffer - right before passing it up - results in
a better cache usage.

Without RPS (echo 0 > rps_cpus) BCM4908 NAT masq performance "jumps"
between two speeds: ~900 Mbps and 940 Mbps (it's a 4 CPUs SoC). This
change bumps the lower speed from 905 Mb/s to 918 Mb/s (tested using
single stream iperf 2.0.5 traffic).

There are more optimizations to consider. One obvious to try is GRO
however as BCM4908 doesn't do hw csum is may actually lower performance.
Sometimes. Some early testing:

┌─────────────────────────────────┬─────────────────────┬────────────────────┐
│                                 │ netif_receive_skb() │ napi_gro_receive() │
├─────────────────────────────────┼─────────────────────┼────────────────────┤
│ netdev_alloc_skb()              │            905 Mb/s │           892 Mb/s │
│ napi_alloc_frag() + build_skb() │            918 Mb/s │           917 Mb/s │
└─────────────────────────────────┴─────────────────────┴────────────────────┘

Another ideas:
1. napi_build_skb()
2. skb_copy_from_linear_data() for small packets

Those need proper testing first though. That can be done later.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Link: https://lore.kernel.org/r/20221025132245.22871-1-zajec5@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dp83822: Print the SOR1 strap status

During the bring-up of the Ethernet PHY, it is very useful to
see the bootstrap status information, as it can help identifying
hardware bootstrap mistakes.

Allow printing the SOR1 register, which contains the strap status
to ease the bring-up.

Signed-off-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221025120109.779337-1-festevam@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bond: Disable TLS features indication

Bond agnostically interacts with TLS device-offload requests via the
.ndo_sk_get_lower_dev operation. Return value is true iff bond
guarantees fixed mapping between the TLS connection and a lower netdev.

Due to this nature, the bond TLS device offload features are not
explicitly controllable in the bond layer. As of today, these are
read-only values based on the evaluation of bond_sk_check(). However,
this indication might be incorrect and misleading, when the feature bits
are "fixed" by some dependency features. For example,
NETIF_F_HW_TLS_TX/RX are forcefully cleared in case the corresponding
checksum offload is disabled. But in fact the bond ability to still
offload TLS connections to the lower device is not hurt.

This means that these bits can not be trusted, and hence better become
unused.

This patch revives some old discussion [1] and proposes a much simpler
solution: Clear the bond's TLS features bits. Everyone should stop
reading them.

[1] https://lore.kernel.org/netdev/20210526095747.22446-1-tariqt@nvidia.com/

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221025105300.4718-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: remove duplicate dma queue channel macros

It doesn't need extra macros for queue 0 & 4. Same macro could
be used for all 8 queues. Related queue/channel functions could
be combined together.

Original macro which has two same parameters is unsafe macro and
might have potential side effects. Each MTL RxQ DMA channel mask
is 4 bits, so using (0xf << chan) instead of GENMASK(x + 3, x) to
avoid unsafe macro.

Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
Link: https://lore.kernel.org/r/20221025081747.1884926-1-junxiao.chang@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests: tc-testing: Add matchJSON to tdc

This allows the use of a matchJSON field in tests to match
against JSON output from the command under test, if that
command outputs JSON.

You specify what you want to match against as a JSON array
or object in the test's matchJSON field. You can leave out
any fields you don't want to match against that are present
in the output and they will be skipped.

An example matchJSON value would look like this:

"matchJSON": [
  {
    "Value": {
      "neighIP": {
        "family": 4,
        "addr": "AQIDBA==",
        "width": 32
      },
      "nsflags": 142,
      "ncflags": 0,
      "LLADDR": "ESIzRFVm"
    }
  }
]

The real output from the command under test might have some
extra fields that we don't care about for matching, and
since we didn't include them in our matchJSON value, those
fields will not be attempted to be matched. If everything
we included above has the same values as the real command
output, the test will pass.

The matchJSON field's type must be the same as the command
output's type, otherwise the test will fail. So if the
command outputs an array, then the value of matchJSON must
also be an array.

If matchJSON is an array, it must not contain more elements
than the command output's array, otherwise the test will
fail.

Signed-off-by: Jeremy Carter <jeremy@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20221024111603.2185410-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ave: Remove duplicate phy_resume() calls

ave_open() in ave_resume() executes __phy_resume() via phy_start(), so no
need to call phy_resume() explicitly. Remove it.

Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Link: https://lore.kernel.org/r/20221024072314.24969-1-hayashi.kunihiko@socionext.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: remove variable total_copied

Variable total_copied is just being incremented and it's never used
anywhere else. The variable and the increment are redundant so
remove it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20221024135046.2159523-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

phylink: require valid state argument to phylink_validate_mask_caps()

state is deferenced earlier in the function, the NULL check
is pointless. Since we don't have any crash reports presumably
it's safe to assume state is not NULL.

Fixes: f392a1846489 ("net: phylink: provide phylink_validate_mask_caps() helper")
Reviewed-by: Sean Anderson <sean.anderson@seco.com>
Link: https://lore.kernel.org/r/20221025185126.1720553-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix tracking issue in mptcp_subflow_create_socket()

My recent patch missed that mptcp_subflow_create_socket()
was creating a 'kernel' socket, then converted it to 'user' socket.

Fixes: 0cafd77dcd03 ("net: add a refcount tracker for kernel sockets")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20221025180546.652251-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ieee802154-for-net-next-2022-10-26' of git://git./linux/kernel/git/sschmidt/wpan-next

Re-pull from Stefan to fix the warnings.

Stefan Schmidt says:

====================
pull-request v2: ieee802154-next 2022-10-26

* tag 'ieee802154-for-net-next-2022-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next:
net: mac802154: Fixup function parameter name in docs
====================

Link: https://lore.kernel.org/r/20221026075638.578840-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ieee802154-for-net-next-2022-10-25' of git://git./linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================

==
One of the biggest cycles for ieee802154 in a long time. We are landing the
first pieces of a big enhancements in managing PAN's. We might have another pull
request ready for this cycle later on, but I want to get this one out first.

Miquel Raynal added support for sending frames synchronously as a dependency
to handle MLME commands. Also introducing more filtering levels to match with
the needs of a device when scanning or operating as a pan coordinator.
To support development and testing the hwsim driver for ieee802154 was also
enhanced for the new filtering levels and to update the PIB attributes.

Alexander Aring fixed quite a few bugs spotted during reviewing changes. He
also added support for TRAC in the atusb driver to have better failure
handling if the firmware provides the needed information.

Jilin Yuan fixed a comment with a repeated word in it.
==================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
this is a pull request of 29 patches for net-next/master.

The first patch is by Daniel S. Trevitz and adds documentation for
switchable termination resistors.

Zhang Changzhong's patch fixes a debug output in the j13939 stack.

Oliver Hartkopp finally removes the pch_can driver, which is
superseded by the generic c_can driver.

Gustavo A. R. Silva replaces a zero-length array with
DECLARE_FLEX_ARRAY() in the ucan driver.

Kees Cook's patch removes a no longer needed silencing of
"-Warray-bounds" warnings for the kvaser_usb driver.

The next 2 patches target the m_can driver. The first is by me cleans
up the LEC error handling, the second is by Vivek Yadav and extends
the LEC error handling to the data phase of CAN-FD frames.

The next 9 patches all target the gs_usb driver. The first 5 patches
are by me and improve the Kconfig prompt and help text, set
netdev->dev_id to distinguish multi CAN channel devices, allow
loopback and listen only at the same time, and clean up the
gs_can_open() function a bit. The remaining 4 patches are by Jeroen
Hofstee and add support for 2 new features: Bus Error Reporting and
Get State.

Jimmy Assarsson and Anssi Hannula contribute 10 patches for the
kvaser_usb driver. They first add Listen Only and Bus Error Reporting
support, handle CMD_ERROR_EVENT errors, improve CAN state handling,
restart events, and configuration of the bit timing parameters.

Another patch by me which fixes the indention in the m_can driver.

A patch by Dongliang Mu cleans up the ucan_disconnect() function in
the ucan driver.

The last patch by Biju Das is for the rcan_canfd driver and cleans up
the reset handling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

can: rcar_canfd: Use devm_reset_control_get_optional_exclusive

Replace devm_reset_control_get_exclusive->devm_reset_control_
get_optional_exclusive so that we can avoid unnecessary
SoC specific check in probe().

Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/all/20221025155657.1426948-4-biju.das.jz@bp.renesas.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: ucan: ucan_disconnect(): change unregister_netdev() to unregister_candev()

From API pairing, change unregister_netdev() to unregister_candev()
since the registration function is register_candev(). Actually, they
are the same.

Signed-off-by: Dongliang Mu <dzm91@hust.edu.cn>
Link: https://lore.kernel.org/all/20221024110033.727542-1-dzm91@hust.edu.cn
[mkl: adjust subject + commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: m_can: use consistent indention

The driver uses indent-with-tab for defines. Replace spaces after
IR_ERR_BUS_31X with tab for consistent indention.

Link: https://lore.kernel.org/all/20221024185544.68240-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge patch series "can: kvaser_usb: Fixes and improvements"

Jimmy Assarsson <extja@kvaser.com> says:

Split v4 series, since it got rejected [1]. This part only contains
non-critical fixes and improvements.

Note: This series depend on changes in [2].

Changes in v5:
- Split v4 series, since it got rejected [1].
   This part only contains non-critical fixes and improvements.

Changes in v4: https://lore.kernel.org/all/20220903182344.139-1-extja@kvaser.com
- Add Tested-by: Anssi Hannula to
   [PATCH v4 04/15] can: kvaser_usb: kvaser_usb_leaf: Get capabilities from device
- Update commit message in
   [PATCH v4 04/15] can: kvaser_usb: kvaser_usb_leaf: Get capabilities from device

Changes in v3: https://lore.kernel.org/all/20220901122729.271-1-extja@kvaser.com
- Rebase on top of commit
   1d5eeda23f36 ("can: kvaser_usb: advertise timestamping capabilities and add ioctl support")
- Add Tested-by: Anssi Hannula
- Add stable@vger.kernel.org to CC.
- Add my S-o-b to all patches
- Fix regression introduced in
   [PATCH v2 04/15] can: kvaser_usb: kvaser_usb_leaf: Get capabilities from device
   found by Anssi Hannula
   https://lore.kernel.org/all/b25bc059-d776-146d-0b3c-41aecf4bd9f8@bitwise.fi

v2: https://lore.kernel.org/all/20220708115709.232815-1-extja@kvaser.com

v1: https://lore.kernel.org/all/20220516134748.3724796-1-anssi.hannula@bitwise.fi

[1] https://lore.kernel.org/all/20220920122246.00dbe946@kernel.org
[2] https://lore.kernel.org/all/20221010150829.199676-1-extja@kvaser.com

Link: https://lore.kernel.org/all/20221010185237.319219-1-extja@kvaser.com
[mkl: move "1/1 can: kvaser_usb: Fix possible completions during init_completion" to linux-can tree]
[mkl: improve change log, update links]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb: Compare requested bittiming parameters with actual parameters in do_set_{,data}_bittiming

The device will respond with a CMD_ERROR_EVENT command, with error_code
KVASER_USB_{LEAF,HYDRA}_ERROR_EVENT_PARAM, if the CMD_SET_BUSPARAMS_REQ
contains invalid bittiming parameters.
However, this command does not contain any channel reference.

To check if the CMD_SET_BUSPARAMS_REQ was successful, redback and compare
the requested bittiming parameters with the device reported parameters.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Fixes: aec5fb2268b7 ("can: kvaser_usb: Add support for Kvaser USB hydra family")
Tested-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Co-developed-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-12-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb: Add struct kvaser_usb_busparams

Add struct kvaser_usb_busparams containing the busparameters used in
CMD_{SET,GET}_BUSPARAMS* commands.

Tested-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-11-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb_leaf: Fix bogus restart events

When auto-restart is enabled, the kvaser_usb_leaf driver considers
transition from any state >= CAN_STATE_BUS_OFF as a bus-off recovery
event (restart).

However, these events may occur at interface startup time before
kvaser_usb_open() has set the state to CAN_STATE_ERROR_ACTIVE, causing
restarts counter to increase and CAN_ERR_RESTARTED to be sent despite no
actual restart having occurred.

Fix that by making the auto-restart condition checks more strict so that
they only trigger when the interface was actually in the BUS_OFF state.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja@kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-10-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb_leaf: Ignore stale bus-off after start

With 0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778 it was observed
that if the device was bus-off when stopped, at next start (either via
interface down/up or manual bus-off restart) the initial
CMD_CHIP_STATE_EVENT received just after CMD_START_CHIP_REPLY will have
the M16C_STATE_BUS_OFF bit still set, causing the interface to
immediately go bus-off again.

The bit seems to internally clear quickly afterwards but we do not get
another CMD_CHIP_STATE_EVENT.

Fix the issue by ignoring any initial bus-off state until we see at
least one bus-on state. Also, poll the state periodically until that
occurs.

It is possible we lose one actual immediately occurring bus-off event
here in which case the HW will auto-recover and we see the recovery
event. We will then catch the next bus-off event, if any.

This issue did not reproduce with 0bfd:0017 Kvaser Memorator
Professional HS/HS FW 2.0.50.

Fixes: 71873a9b38d1 ("can: kvaser_usb: Add support for more Kvaser Leaf v2 devices")
Tested-by: Jimmy Assarsson <extja@kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-9-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb_leaf: Fix wrong CAN state after stopping

0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778 sends a
CMD_CHIP_STATE_EVENT indicating bus-off after stopping the device,
causing a stopped device to appear as CAN_STATE_BUS_OFF instead of
CAN_STATE_STOPPED.

Fix that by not handling error events on stopped devices.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja@kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-8-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb_leaf: Fix improved state not being reported

The tested 0bfd:0017 Kvaser Memorator Professional HS/HS FW 2.0.50 and
0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778 do not seem to send
any unsolicited events when error counters decrease or when the device
transitions from ERROR_PASSIVE to ERROR_ACTIVE (or WARNING).

This causes the interface to e.g. indefinitely stay in the ERROR_PASSIVE
state.

Fix that by asking for chip state (inc. counters) event every 0.5 secs
when error counters are non-zero.

Since there are non-error-counter devices, also always poll in
ERROR_PASSIVE even if the counters show zero.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja@kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-7-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb_leaf: Set Warning state even without bus errors

kvaser_usb_leaf_rx_error_update_can_state() sets error state according
to error counters when the hardware does not indicate a specific state
directly.

However, this is currently gated behind a check for
M16C_STATE_BUS_ERROR which does not always seem to be set when error
counters are increasing, and may not be set when error counters are
decreasing.

This causes the CAN_STATE_ERROR_WARNING state to not be set in some
cases even when appropriate.

Change the code to set error state from counters even without
M16C_STATE_BUS_ERROR.

The Error-Passive case seems superfluous as it is already set via
M16C_STATE_BUS_PASSIVE flag above, but it is kept for now.

Tested with 0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja@kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-6-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb: kvaser_usb_leaf: Handle CMD_ERROR_EVENT

The device will send an error event command, to indicate certain errors.
This indicates a misbehaving driver, and should never occur.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Co-developed-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-5-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb: kvaser_usb_leaf: Rename {leaf,usbcan}_cmd_error_event to {leaf,usbcan}_cmd_can_error_event

Prepare for handling CMD_ERROR_EVENT. Rename struct
{leaf,usbcan}_cmd_error_event to {leaf,usbcan}_cmd_can_error_event.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Reported-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Tested-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-4-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: kvaser_usb: kvaser_usb_leaf: Get capabilities from device

Use the CMD_GET_CAPABILITIES_REQ command to query the device for certain
capabilities. We are only interested in LISTENONLY mode and wither the
device reports CAN error counters.

Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Reported-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Tested-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja@kvaser.com>
Link: https://lore.kernel.org/all/20221010185237.319219-3-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

net: mac802154: Fixup function parameter name in docs

The function parameter name was wrong in kdocs.

net/mac802154/util.c:27: warning: Function parameter or member 'hw' not described in 'ieee802154_wake_queue'
net/mac802154/util.c:27: warning: Excess function parameter 'local' description in 'ieee802154_wake_queue'
net/mac802154/util.c:53: warning: Function parameter or member 'hw' not described in 'ieee802154_stop_queue'
net/mac802154/util.c:53: warning: Excess function parameter 'local' description in 'ieee802154_stop_queue'

Fixing name and description.

Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

net: dl2k: remove variable tx_use

Variable tx_use is just being incremented and it's never used
anywhere else. The variable and the increment are redundant so
remove it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20221024143501.2163720-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bna: remove variable num_entries

Variable num_entries is just being incremented and it's never used
anywhere else. The variable and the increment are redundant so
remove it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20221024125951.2155434-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hinic: Set max_mtu/min_mtu directly to simplify the code.

Set max_mtu/min_mtu directly to avoid making the validity judgment
when set mtu, because the judgment is made in net/core: dev_validate_mtu,
so to simplify the code.

Signed-off-by: caihuoqing <cai.huoqing@linux.dev>
Link: https://lore.kernel.org/r/20221024103349.4494-1-cai.huoqing@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-lan743x-pci11010-pci11414-devices-enhancements'

Raju Lakkaraju says:

====================
net: lan743x: PCI11010 / PCI11414 devices Enhancements

This patch series continues with the addition of supported features for the
Ethernet function of the PCI11010 / PCI11414 devices to the LAN743x driver.
====================

Link: https://lore.kernel.org/r/20221024082516.661199-1-Raju.Lakkaraju@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: Add PHY Auto/MDI/MDI-X set driver for KSZ9131

Add support for MDI-X status and configuration for KSZ9131 chips

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lan743x: Add support for get_pauseparam and set_pauseparam

Add pause get and set functions

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dev: Convert sa_data to flexible array in struct sockaddr

One of the worst offenders of "fake flexible arrays" is struct sockaddr,
as it is the classic example of why GCC and Clang have been traditionally
forced to treat all trailing arrays as fake flexible arrays: in the
distant misty past, sa_data became too small, and code started just
treating it as a flexible array, even though it was fixed-size. The
special case by the compiler is specifically that sizeof(sa->sa_data)
and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1))
do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat
it as a flexible array.

However, the coming -fstrict-flex-arrays compiler flag will remove
these special cases so that FORTIFY_SOURCE can gain coverage over all
the trailing arrays in the kernel that are _not_ supposed to be treated
as a flexible array. To deal with this change, convert sa_data to a true
flexible array. To keep the structure size the same, move sa_data into
a union with a newly introduced sa_data_min with the original size. The
result is that FORTIFY_SOURCE can continue to have no idea how large
sa_data may actually be, but anything using sizeof(sa->sa_data) must
switch to sizeof(sa->sa_data_min).

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Petr Machata <petrm@nvidia.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: syzbot <syzkaller@googlegroups.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnx2: Use kmalloc_size_roundup() to match ksize() usage

Round up allocations with kmalloc_size_roundup() so that build_skb()'s
use of ksize() is always accurate and no special handling of the memory
is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE.

Cc: Rasesh Mody <rmody@marvell.com>
Cc: GR-Linux-NIC-Dev@marvell.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221022021004.gonna.489-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-socket-option-updates'

Mat Martineau says:

====================
mptcp: Socket option updates

Patches 1 and 3 refactor a recent socket option helper function for more
generic use, and make use of it in a couple of places.

Patch 2 adds TCP_FASTOPEN_NO_COOKIE functionality to MPTCP sockets,
similar to TCP_FASTOPEN_CONNECT support recently added in v6.1
====================

Link: https://lore.kernel.org/r/20221022004505.160988-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: sockopt: use new helper for TCP_DEFER_ACCEPT

mptcp_setsockopt_sol_tcp_defer() was doing the same thing as
mptcp_setsockopt_first_sf_only() except for the returned code in case of
error.

Ignoring the error is needed to mimic how TCP_DEFER_ACCEPT is handled
when used with "plain" TCP sockets.

The specific function for TCP_DEFER_ACCEPT can be replaced by the new
mptcp_setsockopt_first_sf_only() helper and errors can be ignored to
stay compatible with TCP. A bit of cleanup.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: add TCP_FASTOPEN_NO_COOKIE support

The goal of this socket option is to configure MPTCP + TFO without
cookie per socket.

It was already possible to enable TFO without a cookie per netns by
setting net.ipv4.tcp_fastopen sysctl knob to the right value. Per route
was also supported by setting 'fastopen_no_cookie' option. This patch
adds a per socket support like it is possible to do with TCP thanks to
TCP_FASTOPEN_NO_COOKIE socket option.

The only thing to do here is to relay the request to the first subflow
like it is already done for TCP_FASTOPEN_CONNECT.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

mptcp: sockopt: make 'tcp_fastopen_connect' generic

There are other socket options that need to act only on the first
subflow, e.g. all TCP_FASTOPEN* socket options.

This is similar to the getsockopt version.

In the next commit, this new mptcp_setsockopt_first_sf_only() helper is
used by other another option.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'soreuseport-fix-broken-so_incoming_cpu'

Kuniyuki Iwashima says:

====================
soreuseport: Fix broken SO_INCOMING_CPU.

setsockopt(SO_INCOMING_CPU) for UDP/TCP is broken since 4.5/4.6 due to
these commits:

  * e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
  * c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")

These commits introduced the O(1) socket selection algorithm and removed
O(n) iteration over the list, but it ignores the score calculated by
compute_score().  As a result, it caused two misbehaviours:

  * Unconnected sockets receive packets sent to connected sockets
  * SO_INCOMING_CPU does not work

The former is fixed by commit acdcecc61285 ("udp: correct reuseport
selection with connected sockets").  This series fixes the latter and
adds some tests for SO_INCOMING_CPU.
====================

Link: https://lore.kernel.org/r/20221021204435.4259-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftest: Add test for SO_INCOMING_CPU.

Some highly optimised applications use SO_INCOMING_CPU to make them
efficient, but they didn't test if it's working correctly by getsockopt()
to avoid slowing down.  As a result, no one noticed it had been broken
for years, so it's a good time to add a test to catch future regression.

The test does

  1) Create $(nproc) TCP listeners associated with each CPU.

  2) Create 32 child sockets for each listener by calling
     sched_setaffinity() for each CPU.

  3) Check if accept()ed sockets' sk_incoming_cpu matches
     listener's one.

If we see -EAGAIN, SO_INCOMING_CPU is broken.  However, we might not see
any error even if broken; the kernel could miraculously distribute all SYN
to correct listeners.  Not to let that happen, we must increase the number
of clients and CPUs to some extent, so the test requires $(nproc) >= 2 and
creates 64 sockets at least.

Test:
  $ nproc
  96
  $ ./so_incoming_cpu

Before the previous patch:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:191:test1:Expected cpu (5) == i (0)
  # test1: Test terminated by assertion
  #          FAIL  so_incoming_cpu.before_reuseport.test1
  not ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # FAILED: 0 / 12 tests passed.
  # Totals: pass:0 fail:12 xfail:0 xpass:0 skip:0 error:0

After:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:199:test1:SO_INCOMING_CPU is very likely to be working correctly with 3072 sockets.
  #            OK  so_incoming_cpu.before_reuseport.test1
  ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # PASSED: 12 / 12 tests passed.
  # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

soreuseport: Fix socket selection for SO_INCOMING_CPU.

Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-ipa-validation-cleanup'

Alex Elder says:

====================
net: ipa: validation cleanup

This series gathers a set of IPA driver cleanups, mostly involving
code that ensures certain things are known to be correct *early*
(either at build or initializatin time), so they can be assumed good
during normal operation.

The first removes three constant symbols, by making a (reasonable)
assumption that a routing table consists of entries for the modem
followed by entries for the AP, with no unused entries between them.

The second removes two checks that are redundant (they verify the
sizes of two memory regions are in range, which will have been done
earlier for all regions).

The third adds some new checks to routing and filter tables that
can be done at "init time" (without requiring any access to IPA
hardware).

The fourth moves a check that routing and filter table addresses can
be encoded within certain IPA immediate commands, so it's performed
earlier; the checks can be done without touching IPA hardware. The
fifth moves some other command-related checks earlier, for the same
reason.

The sixth removes the definition ipa_table_valid(), because what it
does has become redundant. Finally, the last patch moves two more
validation calls so they're done very early in the probe process.
This will be required by some upcoming patches, which will record
the size of the routing and filter tables at this time so they're
available for subsequent initialization.
====================

Link: https://lore.kernel.org/r/20221021191340.4187935-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: check table memory regions earlier

Verify that the sizes of the routing and filter table memory regions
are valid as part of memory initialization, rather than waiting for
table initialization. The main reason to do this is that upcoming
patches use these memory region sizes to determine the number of
entries in these tables, and we'll want to know these sizes are good
sooner.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: kill ipa_table_valid()

What ipa_table_valid() (and ipa_table_valid_one(), which it calls)
does is ensure that the memory regions that hold routing and filter
tables have reasonable size.  Specifically, it checks that the size
of a region is sufficient (or rather, exactly the right size) to
hold the maximum number of entries supported by the driver.  (There
is an additional check that's erroneous, but in practice it is never
reached.)

Recently ipa_table_mem_valid() was added, which is called by
ipa_table_init().  That function verifies that all table memory
regions are of sufficient size, and requires hashed tables to have
zero size if hashing is not supported.  It only ensures the filter
table is large enough to hold the number of endpoints that support
filtering, but that is adequate.

Therefore everything that ipa_table_valid() does is redundant, so
get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: introduce ipa_cmd_init()

Currently, ipa_cmd_data_valid() is called by ipa_mem_config().
Nothing it does requires access to hardware though, so it can be
done during the init phase of IPA driver startup.

Create a new function ipa_cmd_init(), whose purpose is to do early
initialization related to IPA immediate commands. It will call the
build-time validation function, then will make the two calls made
previously by ipa_cmd_data_valid(). This make ipa_cmd_data_valid()
unnecessary, so get rid of it.

Rename ipa_cmd_header_valid() to be ipa_cmd_header_init_local_valid(),
so its name is clearer about which IPA immediate command it is
associated with.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: verify table sizes fit in commands early

We currently verify the table size and offset fit in the immediate
command fields that must encode them in ipa_table_valid_one(). We
can now make this check earlier, in ipa_table_mem_valid().

The non-hashed IPv4 filter and route tables will always exist, and
their sizes will match the IPv6 tables, as well as the hashed tables
(if supported). So it's sufficient to verify the offset and size of
the IPv4 non-hashed tables fit into these fields.

Rename the function ipa_cmd_table_init_valid(), to reinforce that
it is the TABLE_INIT immediate command fields we're checking.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: validate IPA table memory earlier

Add checks in ipa_table_init() to ensure the memory regions defined
for IPA filter and routing tables are valid.

For routing tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 routing tables are defined
  - The non-hashed IPv4 and IPv6 routing tables are the same size
  - The number entries in the non-hashed IPv4 routing table is enough
    to hold the number entries available to the modem, plus at least
    one usable by the AP.

For filter tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 filter tables are defined
  - The non-hashed IPv4 and IPv6 filter tables are the same size
  - The number entries in the non-hashed IPv4 filter table is enough
    to hold the endpoint bitmap, plus an entry for each defined
    endpoint that supports filtering.

In addition, for both routing and filter tables:
  - If hashing isn't supported (IPA v4.2), hashed tables are zero size
  - If hashing *is* supported, all hashed tables are the same size as
    their non-hashed counterparts.

When validating the size of routing tables, require the AP to have
at least one entry (in addition to those used by the modem).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: remove two memory region checks

There's no need to ensure table memory regions fit within the
IPA-local memory range. And there's no need to ensure the modem
header memory region is in range either. These are verified for all
memory regions in ipa_mem_size_valid(), once we have settled on the
size of IPA memory.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipa: kill two constant symbols

The entries in each IPA routing table are divided between the modem
and the AP.  The modem always gets some number of entries located at
the base of the table; the AP gets all those that follow.

There's no reason to think the modem will use anything different
from the first entries in a routing table, so:
  - Get rid of IPA_ROUTE_MODEM_MIN (just assume it's 0)
  - Get rid of IPA_ROUTE_AP_MIN (just assume it's IPA_ROUTE_MODEM_COUNT)
And finally:
  - Open-code IPA_ROUTE_AP_COUNT and remove its definition

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'extend-action-skbedit-to-rx-queue-mapping'

Amritha Nambiar says:

====================
Extend action skbedit to RX queue mapping

Based on the discussion on
https://lore.kernel.org/netdev/166260012413.81018.8010396115034847972.stgit@anambiarhost.jf.intel.com/ ,
the following series extends skbedit tc action to RX queue mapping.
Currently, skbedit action in tc allows overriding of transmit queue.
Extending this ability of skedit action supports the selection of
receive queue for incoming packets. On the receive side, this action
is supported only in hardware, so the skip_sw flag is enforced.

Enabled ice driver to offload this type of filter into the hardware
for accepting packets to the device's receive queue.
====================

Link: https://lore.kernel.org/r/166633888716.52141.3425659377117969638.stgit@anambiarhost.jf.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Documentation: networking: TC queue based filtering

Add tc-queue-filters.rst with notes on TC filters for
selecting a set of queues and/or a queue.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ice: Enable RX queue selection using skbedit action

This patch uses TC skbedit queue_mapping action to support
forwarding packets to a device queue. Such filters with action
forward to queue will be the highest priority switch filter in
HW.
Example:
$ tc filter add dev ens4f0 protocol ip ingress flower\
dst_ip 192.168.1.12 ip_proto tcp dst_port 5001\
action skbedit queue_mapping 5 skip_sw

The above command adds an ingress filter, incoming packets
qualifying the match will be accepted into queue 5. The queue
number is in decimal format.

Refactored ice_add_tc_flower_adv_fltr() to consolidate code with
action FWD_TO_VSI and FWD_TO QUEUE.

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

act_skbedit: skbedit queue mapping for receive queue

Add support for skbedit queue mapping action on receive
side. This is supported only in hardware, so the skip_sw
flag is enforced. This enables offloading filters for
receive queue selection in the hardware using the
skbedit action. Traffic arrives on the Rx queue requested
in the skbedit action parameter. A new tc action flag
TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
traffic direction the action queue_mapping is requested
on during filter addition. This is used to disallow
offloading the skbedit queue mapping action on transmit
side.

Example:
$tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
action skbedit queue_mapping $rxq_id skip_sw

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-sfp-improve-high-power-module-implementation'

Russell King says:

====================
net: sfp: improve high power module implementation

This series aims to improve the power level switching between standard
level 1 and the higher power levels.

The first patch updates the DT binding documentation to include the
minimum and default of 1W, which is the base level that every SFP cage
must support. Hence, it makes sense to document this in the binding.

The second patch enforces a minimum of 1W when parsing the firmware
description, and optimises the code for that case; there's no need to
check for SFF8472 compliance since we will not need to touch the
A2h registers.

Patch 3 validates that the module supports SFF-8472 rev 10.2 before
checking for power level 2 - rev 10.2 is where support for power
levels was introduced, so if the module doesn't support this revision,
it doesn't support power levels. Setting the power level 2 declaration
bit is likely to be spurious.

Patch 4 does the same for power level 3, except this was introduced in
SFF-8472 rev 11.9. The revision code was never updated, so we use the
rev 11.4 to signify this.

Patch 5 cleans up the code - rather than using BIT(0), we now use a
properly named value for the power level select bit.

Patch 6 introduces a read-modify-write helper.

Patch 7 gets rid of the DM7052 hack (which sets a power level
declaration bit but is not compatible with SFF-8472 rev 10.2, and
the module does not implement the A2h I2C address.)

Series tested with my DM7052.

v2: update sff.sfp.yaml with Rob's feedback
====================

Andrew's review tags from v1.

Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: get rid of DM7052 hack when enabling high power

Since we no longer mis-detect high-power mode with the DM7052 module,
we no longer need the hack in sfp_module_enable_high_power(), and can
now switch this to use sfp_modify_u8().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: add sfp_modify_u8() helper

Add a helper to modify bits in a single byte in memory space, and use
it when updating the soft tx-disable flag in the module.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: provide a definition for the power level select bit

Provide a named definition for the power level select bit in the
extended status register, rather than using BIT(0) in the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: ignore power level 3 prior to SFF-8472 Rev 11.4

Power level 3 was included in SFF-8472 revision 11.9, but this does
not have a compliance code. Use revision 11.4 as the minimum
compliance level instead.

This should avoid any spurious indication of 2W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: ignore power level 2 prior to SFF-8472 Rev 10.2

Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
the power declaration bit for modules that are not compliant with
at least this revision.

This should remove any spurious indication of 1.5W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: check firmware provided max power

Check that the firmware provided maximum power is at least 1W, which
is the minimum power level for any SFP module.

Now that we enforce the minimum of 1W, we can exit early from
sfp_module_parse_power() if the module power is 1W or less.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: sff,sfp: update binding

Add a minimum and default for the maximum-power-milliwatt option;
module power levels were originally up to 1W, so this is the default
and the minimum power level we can have for a functional SFP cage.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-driver-updates'

Michael Chan says:

====================
bnxt_en: Driver updates

This patchset adds .get_module_eeprom_by_page() support and adds
an NVRAM resize step to allow larger firmware images to be flashed
to older firmware.
====================

Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: check and resize NVRAM UPDATE entry before flashing

Resize of the UPDATE entry is required if the image to
be flashed is larger than the available space. Add this step,
otherwise flashing larger firmware images by ethtool or devlink
may fail.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: add .get_module_eeprom_by_page() support

Add support for .get_module_eeprom_by_page() callback which
implements generic solution for module`s eeprom access.

v3: Add bnxt_get_module_status() to get a more specific extack error
    string.
    Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
    don't want to fallback to old method.
v2: Simplification suggested by Ido Schimmel

Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Update firmware interface to 1.10.2.118

The main changes are PTM timestamp support, CMIS EEPROM support, and
asymmetric CoS queues support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git./linux/kernel/git/netdev/net

include/linux/net.h
a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag")
e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.1-rc3-1' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.

  The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
  I'm biased.

  Current release - regressions:

   - eth: fman: re-expose location of the MAC address to userspace,
     apparently some udev scripts depended on the exact value

  Current release - new code bugs:

   - bpf:
       - wait for busy refill_work when destroying bpf memory allocator
       - allow bpf_user_ringbuf_drain() callbacks to return 1
       - fix dispatcher patchable function entry to 5 bytes nop

  Previous releases - regressions:

   - net-memcg: avoid stalls when under memory pressure

   - tcp: fix indefinite deferral of RTO with SACK reneging

   - tipc: fix a null-ptr-deref in tipc_topsrv_accept

   - eth: macb: specify PHY PM management done by MAC

   - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

  Previous releases - always broken:

   - eth: amd-xgbe: SFP fixes and compatibility improvements

  Misc:

   - docs: netdev: offer performance feedback to contributors"

* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
  net-memcg: avoid stalls when under memory pressure
  tcp: fix indefinite deferral of RTO with SACK reneging
  tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
  net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
  net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
  docs: netdev: offer performance feedback to contributors
  kcm: annotate data-races around kcm->rx_wait
  kcm: annotate data-races around kcm->rx_psock
  net: fman: Use physical address for userspace interfaces
  net/mlx5e: Cleanup MACsec uninitialization routine
  atlantic: fix deadlock at aq_nic_stop
  nfp: only clean `sp_indiff` when application firmware is unloaded
  amd-xgbe: add the bit rate quirk for Molex cables
  amd-xgbe: fix the SFP compliance codes check for DAC cables
  amd-xgbe: enable PLL_CTL for fixed PHY modes only
  amd-xgbe: use enums for mailbox cmd and sub_cmds
  amd-xgbe: Yellow carp devices do not need rrc
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  MAINTAINERS: add keyword match on PTP
  ...

Merge tag 'rcu-urgent.2022.10.20a' of git://git./linux/kernel/git/paulmck/linux-rcu

Pull RCU fix from Paul McKenney:
"Fix a regression caused by commit bf95b2bc3e42 ("rcu: Switch polled
  grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
  interrupts enabled after an early-boot call to synchronize_rcu().

  Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
  to properly interact with polled grace periods, but the code did not
  take into account the possibility of synchronize_rcu() being invoked
  from the portion of the boot sequence during which interrupts are
  disabled.

  This commit therefore switches the lock acquisition and release from
  irq to irqsave/irqrestore"

* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Keep synchronize_rcu() from enabling irqs in early boot

Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull KUnit fixes from Shuah Khan:
"One single fix to update alloc_string_stream() callers to check for
  IS_ERR() instead of NULL to be in sync with alloc_string_stream()
  returning an ERR_PTR()"

* tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: update NULL vs IS_ERR() tests

Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:

- futex, intel_pstate, kexec build fixes

- ftrace dynamic_events dependency check fix

- memory-hotplug fix to remove redundant warning from test report

* tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/ftrace: fix dynamic_events dependency check
  selftests/memory-hotplug: Remove the redundant warning information
  selftests/kexec: fix build for ARCH=x86_64
  selftests/intel_pstate: fix build for ARCH=x86_64
  selftests/futex: fix build for clang

Merge tag 'pinctrl-v6.1-3' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Fix typos in UART1 and MMC in the Ingenic driver

- A really well researched glitch bug fix to the Qualcomm driver that
   was tracked down and fixed by Dough Anderson from Chromium. Hats off
   for this one!

- Revert two patches on the Xilinx ZynqMP driver: this needs a proper
   solution making use of firmware version information to adapt to
   different firmware releases

- Fix interrupt triggers in the Ocelot driver

* tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: ocelot: Fix incorrect trigger of the interrupt.
  Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
  Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
  pinctrl: qcom: Avoid glitching lines when we first mux to output
  pinctrl: Ingenic: JZ4755 bug fixes

net-memcg: avoid stalls when under memory pressure

As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.

After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.

Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).

Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.

Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9fe5 ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: fix indefinite deferral of RTO with SACK reneging

This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.

The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.

This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.

Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.

Fixes: 5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ptp-ocxp-Oroli-ART-CARD'

Vadim Fedorenko says:

====================
ptp: ocp: add support for Orolia ART-CARD

Orolia company created alternative open source TimeCard. The hardware of
the card provides similar to OCP's card functions, that's why the support
is added to current driver.

The first patch in the series changes the way to store information about
serial ports and is more like preparation.

The patches 2 to 4 introduces actual hardware support.

The last patch removes fallback from devlink flashing interface to protect
against flashing wrong image. This became actual now as we have 2 different
boards supported and wrong image can ruin hardware easily.

v2:
  Address comments from Jonathan Lemon

v3:
  Fix issue reported by kernel test robot <lkp@intel.com>

v4:
  Fix clang build issue

v5:
  Fix warnings and per-patch build errors

v6:
  Fix more style issues
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: remove flash image header check fallback

Previously there was a fallback mode to flash firmware image without
proper header. But now we have different supported vendors and flashing
wrong image could destroy the hardware. Remove fallback mode and force
header check. Both vendors have published firmware images with headers.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: expose config and temperature for ART card

Orolia card has disciplining configuration and temperature table
stored in EEPROM. This patch exposes them as binary attributes to
have read and write access.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ocp: add serial port of mRO50 MAC on ART card

ART card provides interface to access to serial port of miniature atomic
clock found on the card. Add support for this device and configure it
during init phase.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>