OSDN Git Service

tomoyo/tomoyo-test1.git
5 years agonet_sched: sch_fq: avoid calling ktime_get_ns() if not needed
Eric Dumazet [Tue, 20 Nov 2018 01:30:19 +0000 (17:30 -0800)]
net_sched: sch_fq: avoid calling ktime_get_ns() if not needed

There are two cases were we can avoid calling ktime_get_ns() :

1) Queue is empty.
2) Internal queue is not empty.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'gred-add-offload-support'
David S. Miller [Tue, 20 Nov 2018 02:53:46 +0000 (18:53 -0800)]
Merge branch 'gred-add-offload-support'

Jakub Kicinski says:

====================
gred: add offload support

This series adds support for GRED offload in the nfp driver.  So
far we have only supported the RED Qdisc offload, but we need a
way to differentiate traffic types e.g. based on DSCP marking.

It may seem like PRIO+RED is a good match for this job, however,
(a) we don't need strict priority behaviour of PRIO, and (b) PRIO
uses the legacy way of mapping ToS fields to bands, which is quite
awkward and limitting.

The less commonly used GRED Qdisc is a better much for the scenario,
it allows multiple sets of RED parameters and queue lengths to be
maintained with a single FIFO queue.  This is exactly how nfp offload
behaves.  We use a trivial u32 classifier to assign packets to virtual
queues.

There is also the minor advantage that GRED can't have its child
changed, therefore limitting ways in which the configuration of SW
path can diverge from HW offload.

Last patch of the series adds support for (G)RED in non-ECN mode,
where packets are dropped instead of marked.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: add support for more threshold actions
Jakub Kicinski [Mon, 19 Nov 2018 23:21:50 +0000 (15:21 -0800)]
nfp: abm: add support for more threshold actions

Original FW only allowed us to perform ECN marking.  Newer releases
also support plain old drop.  Add the ability to configure drop
policy.  This is particularly useful in combination with GRED,
because different bands can have different ECN marking setting.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: add cls_u32 offload for simple band classification
Jakub Kicinski [Mon, 19 Nov 2018 23:21:49 +0000 (15:21 -0800)]
nfp: abm: add cls_u32 offload for simple band classification

Use offload of very simple u32 filters to direct packets to GRED
bands based on the DSCP marking.  No u32 hashing is supported,
just plain simple filters matching on ToS or Priority with
appropriate mask device can support.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: add functions to update DSCP -> virtual queue map
Jakub Kicinski [Mon, 19 Nov 2018 23:21:48 +0000 (15:21 -0800)]
nfp: abm: add functions to update DSCP -> virtual queue map

Learn how to set the DSCP map.  FW uses a packed array which
geometry depends on the number of supported priorities and
virtual queues.  Write code to assemble this map and to communicate
the setting to the FW via mailbox.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: calculate PRIO map len and check mailbox size
Jakub Kicinski [Mon, 19 Nov 2018 23:21:47 +0000 (15:21 -0800)]
nfp: abm: calculate PRIO map len and check mailbox size

In preparation for PRIO offload calculate how long the prio map
for FW will be and make sure the configuration can be performed
via the vNIC mailbox.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: cls_u32: add res to offload information
Jakub Kicinski [Mon, 19 Nov 2018 23:21:46 +0000 (15:21 -0800)]
net: sched: cls_u32: add res to offload information

In case of egress offloads the class/flowid assigned by the filter
may be very important for offloaded Qdisc selection.  Provide this
info to drivers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: add GRED offload
Jakub Kicinski [Mon, 19 Nov 2018 23:21:45 +0000 (15:21 -0800)]
nfp: abm: add GRED offload

Add support for GRED offload.  It behaves much like RED, but
can apply different parameters to different bands.  GRED operates
pretty much exactly like our HW/FW with a single FIFO and different
RED state instances.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: wrap RED parameters in bands
Jakub Kicinski [Mon, 19 Nov 2018 23:21:44 +0000 (15:21 -0800)]
nfp: abm: wrap RED parameters in bands

Wrap RED parameters and stats into a structure, and a 1-element
array.  Upcoming GRED offload will add the support for more bands.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: gred: support reporting stats from offloads
Jakub Kicinski [Mon, 19 Nov 2018 23:21:43 +0000 (15:21 -0800)]
net: sched: gred: support reporting stats from offloads

Allow drivers which offload GRED to report back statistics.  Since
A lot of GRED stats is fairly ad hoc in nature pass to drivers the
standard struct gnet_stats_basic/gnet_stats_queue pairs, and
untangle the values in the core.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: gred: add basic Qdisc offload
Jakub Kicinski [Mon, 19 Nov 2018 23:21:42 +0000 (15:21 -0800)]
net: sched: gred: add basic Qdisc offload

Add basic offload for the GRED Qdisc.  Inform the drivers any
time Qdisc or virtual queue configuration changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: add up bands for sto/non-sto stats
Jakub Kicinski [Mon, 19 Nov 2018 23:21:41 +0000 (15:21 -0800)]
nfp: abm: add up bands for sto/non-sto stats

Add up stats for all bands for the extra ethtool statistics.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: switch to extended stats for reading packet/byte counts
Jakub Kicinski [Mon, 19 Nov 2018 23:21:40 +0000 (15:21 -0800)]
nfp: abm: switch to extended stats for reading packet/byte counts

In PRIO-enabled FW read the statistics from per-band symbol, rather
than from the standard per-PCIe-queue counters.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: size threshold table to account for bands
Jakub Kicinski [Mon, 19 Nov 2018 23:21:39 +0000 (15:21 -0800)]
nfp: abm: size threshold table to account for bands

Make sure the threshold table is large enough to hold information
for all bands.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: pass band parameter to functions
Jakub Kicinski [Mon, 19 Nov 2018 23:21:38 +0000 (15:21 -0800)]
nfp: abm: pass band parameter to functions

In preparation for per-band RED offload pass band parameter to
functions.  For now it will always be 0.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: abm: map per-band symbols
Jakub Kicinski [Mon, 19 Nov 2018 23:21:37 +0000 (15:21 -0800)]
nfp: abm: map per-band symbols

In preparation for multi-band RED offload if FW is capable map
the extended symbols which will allow us to set per-band parameters
and read stats.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add common validation in hclge_dcb
Yunsheng Lin [Mon, 19 Nov 2018 13:02:15 +0000 (21:02 +0800)]
net: hns3: add common validation in hclge_dcb

Before setting tm related configuration to hardware, driver
needs to check the configuration provided by user is valid.
Currently hclge_ieee_setets and hclge_setup_tc both implement
their own checking, which has a lot in common.

This patch addes hclge_dcb_common_validate to do the common
checking. The checking in hclge_tm_prio_tc_info_update
and hclge_tm_schd_info_update is unnecessary now, so change
the return type to void, which removes the need to do error
handling when one of the checking fails.

Also, ets->prio_tc is indexed by user prio and ets->tc_tsa is
indexed by tc num, so this patch changes them to use different
index.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'selftests-Add-tests-for-VXLAN-at-an-802-1d-bridge'
David S. Miller [Tue, 20 Nov 2018 01:59:45 +0000 (17:59 -0800)]
Merge branch 'selftests-Add-tests-for-VXLAN-at-an-802-1d-bridge'

Ido Schimmel says:

====================
selftests: Add tests for VXLAN at an 802.1d bridge

Petr says:

This patchset adds several tests for VXLAN attached to an 802.1d bridge
and fixes a related bug.

First patch #1 fixes a bug in propagating SKB already-forwarded marks
over veth to bridges, where they are irrelevant. This bug causes the
vxlan_bridge_1d test suite from this patchset to fail as the packets
aren't forwarded by br2.

In patches #2 and #3, lib.sh is extended to support network namespaces.
The use of namespaces is necessitated by VXLAN, which allows only one
VXLAN device with a given VNI per namespace. Thus to host full topology
on a single box for selftests, the "remote" endpoints need to be in
namespaces.

In patches #4-#6, lib.sh is extended in other ways to facilitate the
following patches.

In patches #7-#15, first the skeleton, and later the generic tests
themselves are added.

Patch #16 then adds another test that serves as a wrapper around the
previous one, and runs it with a non-default port number.

Patches #17 and #18 add mlxsw-specific tests. About those, Ido writes:

The first test creates various configurations with regards to the VxLAN
and bridge devices and makes sure the driver correctly forbids
unsupported configuration and permits supported ones. It also verifies
that the driver correctly sets the offload indication on FDB entries and
the local route used for VxLAN decapsulation.

The second test verifies that the driver correctly configures the singly
linked list used to flood BUM traffic and that traffic is flooded as
expected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: mlxsw: Add a test for VxLAN flooding
Ido Schimmel [Mon, 19 Nov 2018 16:11:27 +0000 (16:11 +0000)]
selftests: mlxsw: Add a test for VxLAN flooding

The device stores flood records in a singly linked list where each
record stores up to three IPv4 addresses of remote VTEPs. The test
verifies that packets are correctly flooded in various cases such as
deletion of a record in the middle of the list.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: mlxsw: Add a test for VxLAN configuration
Ido Schimmel [Mon, 19 Nov 2018 16:11:26 +0000 (16:11 +0000)]
selftests: mlxsw: Add a test for VxLAN configuration

Test various aspects of VxLAN offloading which are specific to mlxsw,
such as sanitization of invalid configurations and offload indication.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d_port_8472: New test
Petr Machata [Mon, 19 Nov 2018 16:11:25 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d_port_8472: New test

This simple wrapper reruns the VXLAN ping test with a port number of
8472.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add an ECN decap test
Petr Machata [Mon, 19 Nov 2018 16:11:24 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add an ECN decap test

Test that when decapsulating from VXLAN, the values of inner and outer
TOS are handled appropriately. Because VXLAN driver on its own won't
produce the arbitrary TOS combinations necessary to test this feature,
simply open-code a single ICMP packet and have mausezahn assemble it.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add an ECN encap test
Petr Machata [Mon, 19 Nov 2018 16:11:22 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add an ECN encap test

Test that ECN bits in the VXLAN envelope are correctly deduced from the
overlay packet.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add a TOS test
Petr Machata [Mon, 19 Nov 2018 16:11:21 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add a TOS test

Test that TOS is inherited from the tunneled packet into the envelope as
configured at the VXLAN device.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add a TTL test
Petr Machata [Mon, 19 Nov 2018 16:11:20 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add a TTL test

This tests whether TTL of VXLAN envelope packets is properly set based
on the device configuration.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Reconfigure & rerun tests
Petr Machata [Mon, 19 Nov 2018 16:11:19 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Reconfigure & rerun tests

The ordering of the topology creation can have impact on whether a
driver is successful in offloading VXLAN. Therefore add a pseudo-test
that reshuffles bits of the topology, and then reruns the same suite of
tests again to make sure that the new setup is supported as well.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add unicast test
Petr Machata [Mon, 19 Nov 2018 16:11:18 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add unicast test

Test that when sending traffic to a learned MAC address, the traffic is
forwarded accurately only to the right endpoint.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add flood test
Petr Machata [Mon, 19 Nov 2018 16:11:17 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add flood test

Test that when sending traffic to an unlearned MAC address, the traffic
is flooded to both remote VXLAN endpoints.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: vxlan_bridge_1d: Add ping test
Petr Machata [Mon, 19 Nov 2018 16:11:15 +0000 (16:11 +0000)]
selftests: forwarding: vxlan_bridge_1d: Add ping test

Test end-to-end reachability between local and remote endpoints.

Note that because learning is disabled on the VXLAN device, the ICMP
requests will end up being flooded to all remotes.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: Add a skeleton of vxlan_bridge_1d
Petr Machata [Mon, 19 Nov 2018 16:11:14 +0000 (16:11 +0000)]
selftests: forwarding: Add a skeleton of vxlan_bridge_1d

This skeleton sets up a topology with three VXLAN endpoints: one
"local", possibly offloaded, and two "remote", formed using veth pairs
and likely purely software bridges. The "local" endpoint is connected to
host systems by a VLAN-unaware bridge.

Since VXLAN tunnels must be unique per namespace, each of the "remote"
endpoints is in its own namespace. H3 forms the bridge between the three
domains.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: lib: Add link_stats_rx_errors_get()
Petr Machata [Mon, 19 Nov 2018 16:11:13 +0000 (16:11 +0000)]
selftests: forwarding: lib: Add link_stats_rx_errors_get()

Such a function will be useful for counting malformed packets in the ECN
decap test.

To that end, introduce a common handler for handling stat-fetching, and
reuse it in link_stats_tx_packets_get() and link_stats_rx_errors_get().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: ping{6, }_do(): Allow passing ping arguments
Petr Machata [Mon, 19 Nov 2018 16:11:12 +0000 (16:11 +0000)]
selftests: forwarding: ping{6, }_do(): Allow passing ping arguments

Make the ping routine more generic by allowing passing arbitrary ping
command-line arguments.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: ping{6, }_test(): Add description argument
Petr Machata [Mon, 19 Nov 2018 16:11:11 +0000 (16:11 +0000)]
selftests: forwarding: ping{6, }_test(): Add description argument

Have ping_test() recognize an optional argument with a description of
the test. This is handy if there are several ping test, to make it clear
which is which.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: lib: Add in_ns()
Petr Machata [Mon, 19 Nov 2018 16:11:10 +0000 (16:11 +0000)]
selftests: forwarding: lib: Add in_ns()

In order to run a certain command inside another network namespace, it's
possible to use "ip netns exec ns command". However then one can't use
functions defined in lib.sh or a test suite.

One option is to do "ip netns exec ns bash -c command", provided that
all functions that one wishes to use (and their dependencies) are
published using "export -f". That may not be practical.

Therefore, introduce a helper in_ns(), which wraps a given command in a
boilerplate of "ip netns exec" and "source lib.sh", thus making all
library functions available. (Custom functions that a script wishes to
run within a namespace still need to be exported.)

Because quotes in "$@" aren't recognized in heredoc, hand-expand the
array in an explicit for loop, leveraging printf %q to handle proper
quoting.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: forwarding: lib: Support NUM_NETIFS of 0
Petr Machata [Mon, 19 Nov 2018 16:11:08 +0000 (16:11 +0000)]
selftests: forwarding: lib: Support NUM_NETIFS of 0

So far the case of NUM_NETIFS of 0 has not been interesting. However if
one wishes to reuse the lib.sh routines in a setup of a separate
namespace, being able to import like this is handy.

Therefore replace the {1..$NUM_NETIFS} references, which cause iteration
over 1 and 0, with an explicit for loop like we do in setup_wait() and
tc_offload_check(), so that for NUM_NETIFS of 0 no iteration is done.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: skb_scrub_packet(): Scrub offload_fwd_mark
Petr Machata [Mon, 19 Nov 2018 16:11:07 +0000 (16:11 +0000)]
net: skb_scrub_packet(): Scrub offload_fwd_mark

When a packet is trapped and the corresponding SKB marked as
already-forwarded, it retains this marking even after it is forwarded
across veth links into another bridge. There, since it ingresses the
bridge over veth, which doesn't have offload_fwd_mark, it triggers a
warning in nbp_switchdev_frame_mark().

Then nbp_switchdev_allowed_egress() decides not to allow egress from
this bridge through another veth, because the SKB is already marked, and
the mark (of 0) of course matches. Thus the packet is incorrectly
blocked.

Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
function is called from tunnels and also from veth, and thus catches the
cases where traffic is forwarded between bridges and transformed in a
way that invalidates the marking.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'octeontx2-af-NPC-MCAM-support-and-FLR-handling'
David S. Miller [Tue, 20 Nov 2018 01:56:09 +0000 (17:56 -0800)]
Merge branch 'octeontx2-af-NPC-MCAM-support-and-FLR-handling'

Sunil Goutham says:

====================
octeontx2-af: NPC MCAM support and FLR handling

This patchset is a continuation to earlier submitted three patch
series to add a new driver for Marvell's OcteonTX2 SOC's
Resource virtualization unit (RVU) admin function driver.

1. octeontx2-af: Add RVU Admin Function driver
   https://www.spinics.net/lists/netdev/msg528272.html
2. octeontx2-af: NPA and NIX blocks initialization
   https://www.spinics.net/lists/netdev/msg529163.html
3. octeontx2-af: NPC parser and NIX blocks initialization
   https://www.spinics.net/lists/netdev/msg530252.html

This patch series adds support for below
RVU generic:
- Function Level Reset irq handler
  When FLR is triggered for PFs, AF receives interrupt.
  This patchset adds logic for cleaning up of NPA, NIX
  and NPC block resources being used by PF.

- Mailbox communication between AF and it's VFs.
  Unlike VFs of PF1-PFn, AF which is PF0 can communicate
  with it's VFs directly. Added support for the same.

- AF's VFs IO configuration
  These VFs are mapped to use internal HW loopback channels
  instead of CGX LMACs. Each pair of VFs work as two of ends
  of hardwired interfaces. VF0's TX is VF1's Rx & viceversa.

NPC block:
- MCAM entry management
  Alloc/Free of contiguous/non-contiguous and lower/higher
  priority MCAM entry allocation and programming support.
- MCAM counters management and map/unmap with MCAM entries
- Default KEY extract profile
- HW errata workarounds

NIX block:
- Minimum and maximum allowed packet length config
- HW errata workarounds

Few more changes like shift to use mutex instead of spinlock etc
are done in this patchset.

Changes from v2:
 1 Fixed commit message of patch 'Relax resource lock into mutex'
   to a more unambiguous one.
   - Suggested by David Miller.

Changes from v1:
 1 Converted all mailbox message handler API names to small letters
   from mixed small and capital letters.
   - Suggested by David Miller.
 2 Fixed endian issues in patch 'Add support for stripping STAG/CTAG'
   - Suggested by Arnd Bergmann
 3 Elaborated commit message of patch 'Add FLR interrupt handler'
   to make it a bit more easy to understand.
   - Suggested by Arnd Bergmann

 Will fix the padding and alignment in mailbox message structure
 in a follow-up patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Workarounds for HW errata
Sunil Goutham [Mon, 19 Nov 2018 10:47:43 +0000 (16:17 +0530)]
octeontx2-af: Workarounds for HW errata

Errata 35038
  Software sets NIX_AF_RX_SW_SYNC[ENA] to sync (flush) in-flight packets
  the RX data path before configuration changes (e.g. disabling one or
  more RQs). Hardware clears [ENA] to indicate sync is done

  An issue exists whereby NIX may clear NIX_AF_RX_SW_SYNC [ENA] too
  early.

Errata 35057
  NIX may corrupt internal state when conditional clocks turn off.
  So turnon all clocks by default.

Errata 35786
 Parse nibble enable NPC configuration for KEY generation has to be
 identical for both Rx and Tx interfaces.

Also corrected endianness configuration for NIX i.e NIX_AF_CFG[AF_BE]
is bit8 and not bit1.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add interrupt handlers for Master Enable event
Linu Cherian [Mon, 19 Nov 2018 10:47:42 +0000 (16:17 +0530)]
octeontx2-af: Add interrupt handlers for Master Enable event

- Add interrupt handlers for Master Enable events from PFs
  and Master Enable events from VFs of AF
- Master Enable is required for the MSIX delivery to work
- Master Enable bit trap handler doesn't have to do any anything
  other than clearing the TRPEND bit, since the enable/disable
  requirements are already taken care using mbox requests/flr handler.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add FLR handling support for AF's VFs
Sunil Goutham [Mon, 19 Nov 2018 10:47:41 +0000 (16:17 +0530)]
octeontx2-af: Add FLR handling support for AF's VFs

Added support to handle FLR for AF's VFs (i.e LBK VFs).
Just the FLR interrupt enable/disable, handler registration
etc, actual HW resource cleanup or LFs teardown logic is
already there.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Configure AF VFs to talk over LBK channels
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:40 +0000 (16:17 +0530)]
octeontx2-af: Configure AF VFs to talk over LBK channels

Configure AF VFs such that they are able to talk over consecutive
loopback channels.

If 8 VFs are attached to AF then communication will work as below:

TX      RX
lbk0 -> lbk1
lbk1 -> lbk0

lbk2 -> lbk3
lbk3 -> lbk2

lbk4 -> lbk5
lbk5 -> lbk4

lbk6 -> lbk7
lbk7 -> lbk6

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Enable sriov on AF to create VFs
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:39 +0000 (16:17 +0530)]
octeontx2-af: Enable sriov on AF to create VFs

Enable all AF VFs during probe. Since AF's VFs work in pairs
(eg: Pkts sent on VF0 are received by VF1 and viceversa),
enable only even number of VFs out of totalVFs, which should
again be less than number of loopback (LBK) channels.

Also enable VF's mailbox interrupts.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Mbox communication support btw AF and it's VFs
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:38 +0000 (16:17 +0530)]
octeontx2-af: Mbox communication support btw AF and it's VFs

VFs attached to PFs other than AF can not communicate with AF
directly. Instead they are supposed to first send message to
the PF they are residing on and PF forwards it to the AF.
Responses to messages are handled in the reverse order.

On the other hand if VFs are on AF (PF0) itself then direct mailbox
communication is possible since there's no other PF in the way.

This patch addresses this particular case and adds support for
handling it.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Marko Kallio <mkallio@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Teardown NPA, NIX LF upon receiving FLR
Geetha sowjanya [Mon, 19 Nov 2018 10:47:37 +0000 (16:17 +0530)]
octeontx2-af: Teardown NPA, NIX LF upon receiving FLR

Upon receiving FLR IRQ for a RVU PF, teardown or cleanup
resources held by that PF_FUNC. This patch cleans up,
NIX LF
 - Stop ingress/egress traffic
 - Disable NPC MCAM entries being used.
 - Free Tx scheduler queues
 - Disable RQ/SQ/CQ HW contexts
NPA LF
 - Disable Pool/Aura HW contexts
In future teardown of SSO/SSOW/TIM/CPT will be added.

Also added a mailbox message for a RVU PF to request
AF, to perform FLR for a RVU VF under it.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add FLR interrupt handler
Geetha sowjanya [Mon, 19 Nov 2018 10:47:36 +0000 (16:17 +0530)]
octeontx2-af: Add FLR interrupt handler

RVU admin function (AF) has all the priviliges to cleanup
HW state when VFIO triggers a PCIe function level reset (FLR)
due to either reset or a VM crash. FLR for RVU PF1-PFn will
trigger an IRQ to AF.

This patch enables all RVU PF's FLR interrupts and registers a
handler. Upon receiving an interrupt, a workqueue is scheduled
to cleanup all hardware blocks being used by the PF which
received the FLR.

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Verify NPA/SSO/NIX PF_FUNC mapping
Sunil Goutham [Mon, 19 Nov 2018 10:47:35 +0000 (16:17 +0530)]
octeontx2-af: Verify NPA/SSO/NIX PF_FUNC mapping

While mapping a NIX LF to a NPA LF attached PF_FUNC or
SSO LF attached PF_FUNC, verify if PF_FUNC is valid and
if that PF_FUNC has a LF of that block attached to it or not.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add support for stripping STAG/CTAG
Tomasz Duszynski [Mon, 19 Nov 2018 10:47:34 +0000 (16:17 +0530)]
octeontx2-af: Add support for stripping STAG/CTAG

This works by shadowing existing UCAST MCAM entry
with a new one additionally matching either NPC_LT_LB_CTAG
or NPC_LT_LB_STAG. For this to fully work one needs to
send properly configured NIX_VTAG_CFG message afterwards i.e with
strip and capture enabled and type set to 0.

On receiving tagged packet NIX will remove outer VLAN and capture
TCI in NIX_RX_PARSE_S.

Also simplified RX Vtag configuration flow
With this setting STRIP/CAPTURE VTAG actions separately would be
possible. Following combinations are possible: STRIP,
STRIP and CAPTURE, CAPTURE or nothing (0 disables respective actions).

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to enable/disable default MCAM entries
Sunil Goutham [Mon, 19 Nov 2018 10:47:33 +0000 (16:17 +0530)]
octeontx2-af: Support to enable/disable default MCAM entries

For a PF/VF with a NIXLF attached has default/reserved MCAM entries
for receiving Ucast/Bcast/Promisc traffic. Ideally traffic should be
forwarded to NIXLF only after it's contexts are initialized. This
patch keeps these default entries disabled and adds mbox messages
for a PF/VF to enable these once NPA/NIXLF initialization is done.
Likewise while PF/VF is being teared down, it can send the disable
mailbox message to stop receiving traffic.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Add MKEX default profile
Santosh Shukla [Mon, 19 Nov 2018 10:47:32 +0000 (16:17 +0530)]
octeontx2-af: Add MKEX default profile

Added basic default MKEX profile. This profile tells
hardware what data to extract from packet and where to
place it (bit offset) in final KEY generated for the
parsed packet. Based on the bit placement of the packet
data, MCAM entries have to programmed for matching.

Also added a msg to retrieve this MKEX profile from PF/VF
which inturn can process it to determine how MCAM entry
has to be populated.

Signed-off-by: Santosh Shukla <sshukla@marvell.com>
Signed-off-by: Yuri Tolstov <ytolstov@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Alloc and config NPC MCAM entry at a time
Sunil Goutham [Mon, 19 Nov 2018 10:47:31 +0000 (16:17 +0530)]
octeontx2-af: Alloc and config NPC MCAM entry at a time

A new mailbox message is added to support allocating a MCAM entry
along with a counter and configuring it in one go. This reduces
the amount of mailbox communication involved in installing a new
MCAM rule.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Map or unmap NPC MCAM entry and counter
Sunil Goutham [Mon, 19 Nov 2018 10:47:30 +0000 (16:17 +0530)]
octeontx2-af: Map or unmap NPC MCAM entry and counter

Alloc memory to save MCAM 'entry to counter' mapping and since
multiple entries can map to same counter, added counter's reference
count tracking.

Do 'entry to counter' mapping when a entry is being installed
and mbox msg sender requested to configure a counter as well.
Mapping is removed when a entry or counter is being freed or
a explicit mbox msg is received to unmap them.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support for NPC MCAM counters
Sunil Goutham [Mon, 19 Nov 2018 10:47:29 +0000 (16:17 +0530)]
octeontx2-af: Support for NPC MCAM counters

NPC HW has counters which can be mapped to MCAM
entries to gather entry match statistics. This
patch adds support to allocate, free, clear and retrieve
stats of NPC MCAM counters. New mailbox messages have
been added for this. Similar to MCAM entries both
contiguous and non-contiguous counter allocation is
supported.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: MCAM entry installation support
Sunil Goutham [Mon, 19 Nov 2018 10:47:28 +0000 (16:17 +0530)]
octeontx2-af: MCAM entry installation support

Add support for a RVU PF/VF to enable, disable, configure
and shuffle MCAM entries via mbox commands. This patch adds
mailbox message formats and handling of these commands.

As of now otherthan validating MCAM entry index, info like
channel number e.t.c in MCAM config data sent by PF/VF are
not validated.

Also a max of 64 MCAM entries can be shuffled with a single
mbox command.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: NPC MCAM entry alloc/free support
Sunil Goutham [Mon, 19 Nov 2018 10:47:27 +0000 (16:17 +0530)]
octeontx2-af: NPC MCAM entry alloc/free support

This patch adds NPC MCAM entry management and support for
allocating and freeing them via mailbox. Both contiguous and
non-contiguous allocations are supported. Incase of contiguous,
if request cannot be met then max contiguous number of available
entries are allocated.

High or low priority index allocation w.r.t a reference MCAM index
is also supported.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Relax resource lock into mutex
Stanislaw Kardach [Mon, 19 Nov 2018 10:47:26 +0000 (16:17 +0530)]
octeontx2-af: Relax resource lock into mutex

Mailbox message handling is done in a workqueue context scheduled
from interrupt handler. So resource locks does not need to be a spinlock.
Therefore relax them into a mutex so that later on we may use them
in routines that might sleep.

Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to get NIX HW constants from AF
Kiran Kumar [Mon, 19 Nov 2018 10:47:25 +0000 (16:17 +0530)]
octeontx2-af: Support to get NIX HW constants from AF

This patch adds reading HW limits like number of Rx/Tx stats,
number of queue IRQs supported per NIX LF from AF registers
and sync them to PF/VF.

Signed-off-by: Kiran Kumar <kirankumark@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Support to modify min/max allowed packet lengths
Sunil Goutham [Mon, 19 Nov 2018 10:47:24 +0000 (16:17 +0530)]
octeontx2-af: Support to modify min/max allowed packet lengths

This patch adds support for RVU PF/VFs to modify min/max
packet lengths allowed by HW. For VFs on PF0, settings will
be automatically applied on LBK link. RX link's min/maxlen
is configured to min/max of PF and it's all VFs. On the TX side
if requested all SMQs attached to the requesting NIXLF will be
updated with new min/max lengths.

Also updates transmit credits for Tx links based on new maxlen.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoocteontx2-af: Convert mbox handlers APIs to lowercase
Sunil Goutham [Mon, 19 Nov 2018 10:47:23 +0000 (16:17 +0530)]
octeontx2-af: Convert mbox handlers APIs to lowercase

This patch converts all mailbox message handler API
names to lowercase.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'r8169-series-with-further-smaller-improvements'
David S. Miller [Tue, 20 Nov 2018 01:32:15 +0000 (17:32 -0800)]
Merge branch 'r8169-series-with-further-smaller-improvements'

Heiner Kallweit says:

====================
r8169: series with further smaller improvements

Again nothing exciting, just smaller improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: improve chip version identification
Heiner Kallweit [Mon, 19 Nov 2018 21:41:35 +0000 (22:41 +0100)]
r8169: improve chip version identification

Only the upper 12 bits are used for chip identification, this helps
to reduce the size of array mac_info.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: simplify ocp functions
Heiner Kallweit [Mon, 19 Nov 2018 21:40:04 +0000 (22:40 +0100)]
r8169: simplify ocp functions

rtl8168_oob_notify is used in rtl8168dp_driver_start and
rtl8168dp_driver_stop only, so we can rename it to r8168dp_oob_notify.
The same applies to condition rtl_ocp_read_cond which can be renamed
to rtl_dp_ocp_read_cond. This allows to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove workaround for ancient gcc bug
Heiner Kallweit [Mon, 19 Nov 2018 21:39:14 +0000 (22:39 +0100)]
r8169: remove workaround for ancient gcc bug

The kernel can't be built any longer with this ancient GCC version.
Eventually it becomes clear what this statement actually does.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove manual padding in struct ring_info
Heiner Kallweit [Mon, 19 Nov 2018 21:38:22 +0000 (22:38 +0100)]
r8169: remove manual padding in struct ring_info

The compiler takes care of alignment and padding, I see no need to
bother him with manual hints.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove "not PCI Express" message
Heiner Kallweit [Mon, 19 Nov 2018 21:37:34 +0000 (22:37 +0100)]
r8169: remove "not PCI Express" message

The ones who want to know can easily identify whether chip is PCI or
PCIe based on the chip name. I doubt there's any benefit in this
message, so remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove print_mac_version
Heiner Kallweit [Mon, 19 Nov 2018 21:36:15 +0000 (22:36 +0100)]
r8169: remove print_mac_version

The syslog message printed on driver load allows to easily identify
the mac version number (based on chip name and XID). So we don't
need this extra debug message which is wrong anyway because e.g.
RTL_GIGA_MAC_VER_01 has value 0.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use PCI_VDEVICE macro
Heiner Kallweit [Mon, 19 Nov 2018 21:35:08 +0000 (22:35 +0100)]
r8169: use PCI_VDEVICE macro

Using macro PCI_VDEVICE helps to simplify the PCI ID table.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: replace event_slow with irq_mask
Heiner Kallweit [Mon, 19 Nov 2018 21:34:17 +0000 (22:34 +0100)]
r8169: replace event_slow with irq_mask

Recently the "slow event" handler was removed, therefore the member
name isn't appropriate any longer. In addition store the full mask,
including the RTL_EVENT_NAPI interrupt source bits.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: remove unused interrupt sources
Heiner Kallweit [Mon, 19 Nov 2018 21:33:00 +0000 (22:33 +0100)]
r8169: remove unused interrupt sources

Setting PCSTimeout interrupt source was copied from the vendor driver
which uses the chip programmable timer interrupt. The mainline driver
doesn't use this timer interrupt.

SYSErr indicates a PCI error and isn't defined on the PCIe models.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use dev_get_drvdata where possible
Heiner Kallweit [Mon, 19 Nov 2018 21:32:18 +0000 (22:32 +0100)]
r8169: use dev_get_drvdata where possible

Using dev_get_drvdata directly is simpler here.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: merge rtl_irq_enable and rtl_irq_enable_all
Heiner Kallweit [Mon, 19 Nov 2018 21:31:32 +0000 (22:31 +0100)]
r8169: merge rtl_irq_enable and rtl_irq_enable_all

After the recent changes to the interrupt handler rtl_irq_enable and
rtl_irq_enable_all can be merged.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'sctp-add-subscribe-per-asoc-and-sockopt-SCTP_EVENT'
David S. Miller [Mon, 19 Nov 2018 20:25:43 +0000 (12:25 -0800)]
Merge branch 'sctp-add-subscribe-per-asoc-and-sockopt-SCTP_EVENT'

Xin Long says:

====================
sctp: add subscribe per asoc and sockopt SCTP_EVENT

This patchset mainly adds the Event Subscription sockopt described in
rfc6525#section-6.2:

"Subscribing to events as described in [RFC6458] uses a setsockopt()
call with the SCTP_EVENT socket option.  This option takes the
following structure, which specifies the association, the event type
(using the same value found in the event type field), and an on/off
boolean.

  struct sctp_event {
    sctp_assoc_t se_assoc_id;
    uint16_t     se_type;
    uint8_t      se_on;
  };

The user fills in the se_type field with the same value found in the
strreset_type field, i.e., SCTP_STREAM_RESET_EVENT.  The user will
also fill in the se_assoc_id field with either the association to set
this event on (this field is ignored for one-to-one style sockets) or
one of the reserved constant values defined in [RFC6458].  Finally,
the se_on field is set with a 1 to enable the event or a 0 to disable
the event."

As for the old SCTP_EVENTS Option with struct sctp_event_subscribe,
it's being DEPRECATED.

v1->v2:
  - fix some key word in changelog that triggerred the filters at
    vger.kernel.org.
v2->v3:
  - fix an array out of bounds noticed by Neil in patch 1/4.
====================

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: add sockopt SCTP_EVENT
Xin Long [Sun, 18 Nov 2018 08:08:54 +0000 (16:08 +0800)]
sctp: add sockopt SCTP_EVENT

This patch adds sockopt SCTP_EVENT described in rfc6525#section-6.2.
With this sockopt users can subscribe to an event from a specified
asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: rename enum sctp_event to sctp_event_type
Xin Long [Sun, 18 Nov 2018 08:08:53 +0000 (16:08 +0800)]
sctp: rename enum sctp_event to sctp_event_type

sctp_event is a structure name defined in RFC for sockopt
SCTP_EVENT. To avoid the conflict, rename it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: add subscribe per asoc
Xin Long [Sun, 18 Nov 2018 08:08:52 +0000 (16:08 +0800)]
sctp: add subscribe per asoc

The member subscribe should be per asoc, so that sockopt SCTP_EVENT
in the next patch can subscribe a event from one asoc only.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosctp: define subscribe in sctp_sock as __u16
Xin Long [Sun, 18 Nov 2018 08:08:51 +0000 (16:08 +0800)]
sctp: define subscribe in sctp_sock as __u16

The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).

Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Mon, 19 Nov 2018 18:55:00 +0000 (10:55 -0800)]
Merge git://git./linux/kernel/git/davem/net

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Mon, 19 Nov 2018 17:24:04 +0000 (09:24 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix some potentially uninitialized variables and use-after-free in
    kvaser_usb can drier, from Jimmy Assarsson.

 2) Fix leaks in qed driver, from Denis Bolotin.

 3) Socket leak in l2tp, from Xin Long.

 4) RSS context allocation fix in bnxt_en from Michael Chan.

 5) Fix cxgb4 build errors, from Ganesh Goudar.

 6) Route leaks in ipv6 when removing exceptions, from Xin Long.

 7) Memory leak in IDR allocation handling of act_pedit, from Davide
    Caratti.

 8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.

 9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
    Sabrina Dubroca.

10) When NAPI cached skb is reused, we must set it to the proper initial
    state which includes skb->pkt_type. From Eric Dumazet.

11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.

12) Set RX queue properly in various tuntap receive paths, from Matthew
    Cover.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tuntap: fix multiqueue rx
  ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
  tipc: don't assume linear buffer when reading ancillary data
  tipc: fix lockdep warning when reinitilaizing sockets
  net-gro: reset skb->pkt_type in napi_reuse_skb()
  tc-testing: tdc.py: Guard against lack of returncode in executed command
  tc-testing: tdc.py: ignore errors when decoding stdout/stderr
  ip_tunnel: don't force DF when MTU is locked
  MAINTAINERS: Add entry for CAKE qdisc
  net: bridge: fix vlan stats use-after-free on destruction
  socket: do a generic_file_splice_read when proto_ops has no splice_read
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  net/sched: act_pedit: fix memory leak when IDR allocation fails
  net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
  ipv6: fix a dst leak when removing its exception
  net: mvneta: Don't advertise 2.5G modes
  drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
  net/mlx4: Fix UBSAN warning of signed integer overflow
  ...

5 years agotuntap: fix multiqueue rx
Matthew Cover [Sun, 18 Nov 2018 07:46:00 +0000 (00:46 -0700)]
tuntap: fix multiqueue rx

When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.

Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.

The rx traffic can be generated by either of the following.
  1. a simple tap program which spins up multiple queues and writes packets
     to each of the file descriptors
  2. tx from a qemu vm with a tap multiqueue netdev

The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
  1. a simple netmap program which opens and reads from per-queue
     descriptors
  2. configuring RPS and doing per-cpu captures with rxtxcpu

Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.

Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.

Signed-off-by: Matthew Cover <matthew.cover@stackpath.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
David Ahern [Sun, 18 Nov 2018 18:45:30 +0000 (10:45 -0800)]
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF

Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.

Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <preethir@juniper.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Expose discard counters via ethtool
Shalom Toledo [Sun, 18 Nov 2018 16:43:03 +0000 (16:43 +0000)]
mlxsw: spectrum: Expose discard counters via ethtool

Expose packets discard counters via ethtool to help with debugging.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotun: use netdev_alloc_frag() in tun_napi_alloc_frags()
Eric Dumazet [Sun, 18 Nov 2018 15:37:33 +0000 (07:37 -0800)]
tun: use netdev_alloc_frag() in tun_napi_alloc_frags()

In order to cook skbs in the same way than Ethernet drivers,
it is probably better to not use GFP_KERNEL, but rather
use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by
netdev_alloc_frag().

This would allow to use tun driver even in memory stress
situations, especially if swap is used over this tun channel.

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petar Penkov <peterpenkov96@gmail.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'IP101GR-devicetree-based-configuration-of-SEL_INTR32'
David S. Miller [Mon, 19 Nov 2018 00:16:20 +0000 (16:16 -0800)]
Merge branch 'IP101GR-devicetree-based-configuration-of-SEL_INTR32'

Martin Blumenstingl says:

====================
IP101GR: devicetree based configuration of SEL_INTR32

The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.

The goal of this series is:
- some small cleanups in patches 3, 4 and 5
- allowing the kernel to detect IRQ floods on boards where the IP101GR
  is configured in RXER mode but the RXER line is configured on the
  host SoC as interrupt line (patch 6)
- configuration of the SEL_INTR32 register so we can use the interrupt
  function on boards where the RXER/INTR32 pin (pin 21) is routed to
  one of the host SoC's interrupt inputs (patches 1, 2, 7)

A use-case where this is needed is the Endless Mini (EC-100). I have
tested my changes on that board. This also confirms that Heiner
Kallweit's recent icplus.c PHY driver changes are working (at least on
my setup).

This series is based on net-next commit 7c460cf9cd1a ("net: aquantia:
fix spelling mistake "specfield" -> "specified"")

Changes since v1 at [0]:
- collected Andrew's Reviewed-by's (thank you!)
- updated description of patch #2 to explain why two properties were
  added instead of adding an "this is a IP101GR" property
- validate that there's no conflicting configuration in patch #7
- rebased on top of latest net-next

[0] https://patchwork.ozlabs.org/cover/999371/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: allow configuring the interrupt function on IP101GR
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:59 +0000 (22:23 +0100)]
net: phy: icplus: allow configuring the interrupt function on IP101GR

The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.
By default the PHY is configured to output the "receive error" status on
pin 21. Depending on the board layout and requirements we may want to
re-configure the PHY to output the interrupt signal there.

The mode of pin 21 can be configured in the "Digital I/O Specific
Control Register" (register 29), bit 2:
- 0 = RXER function
- 1 = INTR(32) function

Depending on the devicetree configuration we will now:
- change the mode to either ther RXER or INTR32 function
- keep the SEL_INTR32 value set by the bootloader (default) if no
  configuration is provided (to ensure that we're not breaking existing
  boards)
- error out if conflicting configuration is given (RXER and INTR32 mode
  are enabled at the same time)

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: implement .did_interrupt for IP101A/G
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:58 +0000 (22:23 +0100)]
net: phy: icplus: implement .did_interrupt for IP101A/G

The IP101A_G_IRQ_CONF_STATUS register has bits to detect which
interrupts have fired. Implement the .did_interrupt callback to let the
PHY core know whether the interrupt was for this specific PHY.

This is useful for debugging interrupt problems with 32-pin IP101GR PHYs
where the interrupt line is shared with the RX_ERR (receive error
status) signal. The default values are:
- RX_ERR is enabled by default (LOW means that there is no receive
  error)
- the PHY's interrupt line is configured "active low" by default

Without any additional changes there is a flood of interrupts if the
RX_ERR/INTR32 signal is configured in RX_ERR mode (which is the
default). Having a did_interrupt ensures that the PHY core returns
IRQ_NONE instead of endlessly triggering the PHY state machine.
Additionally the kernel will report this after a while:
  irq 28: nobody cared (try booting with the "irqpoll" option)

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: rename IP101A_G_NO_IRQ to IP101A_G_IRQ_ALL_MASK
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:57 +0000 (22:23 +0100)]
net: phy: icplus: rename IP101A_G_NO_IRQ to IP101A_G_IRQ_ALL_MASK

The datasheet uses the name "All Mask" for this bit. Change the name of
our #define to be consistent with the datasheet. While here also replace
the tab between the #define and IP101A_G_IRQ_ALL_MASK with a space.
No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: use the BIT macro where possible
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:56 +0000 (22:23 +0100)]
net: phy: icplus: use the BIT macro where possible

This makes the code consistent by using the BIT() macro instead of
manual bit-shifting for some of the fields. No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: icplus: keep all ip101a_g functions together
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:55 +0000 (22:23 +0100)]
net: phy: icplus: keep all ip101a_g functions together

This simply moves ip101a_g_config_init right above
ip101a_g_config_intr so all functions for the ICPlus IP101A/G PHYs are
grouped together.
No functional changes.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: net: phy: add bindings for the IC Plus Corp. IP101A/G PHYs
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:54 +0000 (22:23 +0100)]
dt-bindings: net: phy: add bindings for the IC Plus Corp. IP101A/G PHYs

The IP101A and IP101G series both have various models. Depending on the
board implementation we need a special property for the IP101GR (32-pin
LQFP package) PHY:
pin 21 ("RXER/INTR_32") outputs the "receive error" signal by default
(LOW means "normal operation", HIGH means that there's either a decoding
error of the received signal or that the PHY is receiving LPI). This pin
can also be switched to INTR32 mode, where the interrupt signal is
routed to this pin. The other PHYs don't need this special handling
because they have more pins available so the interrupt function gets a
dedicated pin.

This adds two properties to either select the "receive error" or
"interrupt" function of pin 21. Not specifying any function means that
the default set by the bootloader is used. This is required because the
IP101GR cannot be differentiated between other IP101 PHYs as the PHY
identification registers on all of these is 0x02430c54.

The IP101G (sold as die only, without package) may suffer from the same
issue depending on how it's integrated into a multi chip package by
another manufacturer. If only the RXER/INTR_32 pin is routed then the
users of the die-only variant may also have to explicitly configure the
mode of hte RXER/INTR_32 pin. This is the reason why no "is-ip101gr"
property was added. I have no evidence though which would confirm this
theory - so the binding itself is independent of that.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodt-bindings: vendor-prefix: add prefix for IC Plus Corp.
Martin Blumenstingl [Sun, 18 Nov 2018 21:23:53 +0000 (22:23 +0100)]
dt-bindings: vendor-prefix: add prefix for IC Plus Corp.

IC Plus Corp. has various Ethernet related products such as Ethernet
transceivers, Ethernet controllers, Ethernet switches, etc.

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoLinux 4.20-rc3 v4.20-rc3
Linus Torvalds [Sun, 18 Nov 2018 21:33:44 +0000 (13:33 -0800)]
Linux 4.20-rc3

5 years agotg3: optionally use eth_platform_get_mac_address() to get mac address
thesven73@gmail.com [Sat, 17 Nov 2018 15:56:18 +0000 (10:56 -0500)]
tg3: optionally use eth_platform_get_mac_address() to get mac address

This function will try to determine the mac address via the devicetree,
or via an architecture-specific method (e.g. a PROM on SPARC).

The SPARC-specific code in this driver (#ifdef SPARC) did exactly this,
and is therefore removed.

Note that you can now specify the tg3 mac address via the devicetree,
on any platform, not just SPARC:

Devicetree example:
(see Documentation/devicetree/bindings/pci/pci.txt)

&pcie {
host@0 {
#address-cells = <3>;
#size-cells = <2>;
reg = <0 0 0 0 0>;
bcm5778: bcm5778@0 {
reg = <0 0 0 0 0>;
mac-address = [CA 11 AB 1E 10 01];
};
};
};

Signed-off-by: Sven Van Asbroeck <svendev@arcx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Add part of TCP counts explanations in snmp_counters.rst
yupeng [Fri, 16 Nov 2018 19:17:40 +0000 (11:17 -0800)]
net: Add part of TCP counts explanations in snmp_counters.rst

Add explanations of some generic TCP counters, fast open
related counters and TCP abort related counters and several
examples.

Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 20:21:09 +0000 (12:21 -0800)]
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
 "A small batch of fixes for v4.20-rc3.

  The overflow continuation fix addresses something that has been broken
  for several releases. Arguably it could wait even longer, but it's a
  one line fix and this finishes the last of the known address range
  scrub bug reports. The revert addresses a lockdep regression. The unit
  tests are not critical to fix, but no reason to hold this fix back.

  Summary:

   - Address Range Scrub overflow continuation handling has been broken
     since it was initially merged. It was only recently that error
     injection and platform-BIOS support enabled this corner case to be
     exercised.

   - The recent attempt to provide more isolation for the kernel Address
     Range Scrub state machine from userapace initiated sessions
     triggers a lockdep report. Revert and try again at the next merge
     window.

   - Fix a kasan reported buffer overflow in libnvdimm unit test
     infrastrucutre (nfit_test)"

* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  Revert "acpi, nfit: Further restrict userspace ARS start requests"
  acpi, nfit: Fix ARS overflow continuation
  tools/testing/nvdimm: Fix the array size for dimm devices.

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 18 Nov 2018 19:31:26 +0000 (11:31 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "16 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
  mm, page_alloc: check for max order in hot path
  scripts/spdxcheck.py: make python3 compliant
  tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
  lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
  mm/vmstat.c: fix NUMA statistics updates
  mm/gup.c: fix follow_page_mask() kerneldoc comment
  ocfs2: free up write context when direct IO failed
  scripts/faddr2line: fix location of start_kernel in comment
  mm: don't reclaim inodes with many attached pages
  mm, memory_hotplug: check zone_movable in has_unmovable_pages
  mm/swapfile.c: use kvzalloc for swap_info_struct allocation
  MAINTAINERS: update OMAP MMC entry
  hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
  kernel/sched/psi.c: simplify cgroup_move_task()
  z3fold: fix possible reclaim races

5 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:58:20 +0000 (10:58 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix an exec() related scalability/performance regression, which was
  caused by incorrectly calculating load and migrating tasks on exec()
  when they shouldn't be"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix cpu_util_wake() for 'execl' type workloads

5 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:54:59 +0000 (10:54 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Fix uncore PMU enumeration for CofeeLake CPUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
  perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs

5 years agoMerge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 18 Nov 2018 18:52:26 +0000 (10:52 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull EFI fixes from Ingo Molnar:
 "Misc fixes: two warning splat fixes, a leak fix and persistent memory
  allocation fixes for ARM"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Permit calling efi_mem_reserve_persistent() from atomic context
  efi/arm: Defer persistent reservations until after paging_init()
  efi/arm/libstub: Pack FDT after populating it
  efi/arm: Revert deferred unmap of early memmap mapping
  efi: Fix debugobjects warning on 'efi_rts_work'

5 years agoMerge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Sun, 18 Nov 2018 18:45:09 +0000 (10:45 -0800)]
Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM spectre updates from Russell King:
 "These are the currently known final bits that resolve the Spectre
  issues. big.Little systems used to be sufficiently identical in that
  there were no differences between individual CPUs in the system that
  mattered to the kernel. With the advent of the Spectre problem, the
  CPUs now have differences in how the workaround is applied.

  As a result of previous Spectre patches, these systems ended up
  reporting quite a lot of:

     "CPUx: Spectre v2: incorrect context switching function, system vulnerable"

  messages due to the action of the big.Little switcher causing the CPUs
  to be re-initialised regularly. This series resolves that issue by
  making the CPU vtable unique to each CPU.

  However, since this is used very early, before per-cpu is setup,
  per-cpu can't be used. We also have a problem that two of the methods
  are not called from preempt-safe paths, but thankfully these remain
  identical between all CPUs in the system. To make sure, we validate
  that these are identical during boot"

* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: spectre-v2: per-CPU vtables to work around big.Little systems
  ARM: add PROC_VTABLE and PROC_TABLE macros
  ARM: clean up per-processor check_bugs method call
  ARM: split out processor lookup
  ARM: make lookup_processor_type() non-__init

5 years agomm/memblock.c: fix a typo in __next_mem_pfn_range() comments
Chen Chang [Fri, 16 Nov 2018 23:08:57 +0000 (15:08 -0800)]
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments

Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.com
Signed-off-by: Chen Chang <rainccrun@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, page_alloc: check for max order in hot path
Michal Hocko [Fri, 16 Nov 2018 23:08:53 +0000 (15:08 -0800)]
mm, page_alloc: check for max order in hot path

Konstantin has noticed that kvmalloc might trigger the following
warning:

  WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
  [...]
  Call Trace:
   fragmentation_index+0x76/0x90
   compaction_suitable+0x4f/0xf0
   shrink_node+0x295/0x310
   node_reclaim+0x205/0x250
   get_page_from_freelist+0x649/0xad0
   __alloc_pages_nodemask+0x12a/0x2a0
   kmalloc_large_node+0x47/0x90
   __kmalloc_node+0x22b/0x2e0
   kvmalloc_node+0x3e/0x70
   xt_alloc_table_info+0x3a/0x80 [x_tables]
   do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
   nf_setsockopt+0x44/0x60
   SyS_setsockopt+0x6f/0xc0
   do_syscall_64+0x67/0x120
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already.  This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile.  A recent UBSAN report just underlines that
by the following report

  UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
  shift exponent 51 is too large for 32-bit type 'int'
  CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0xd2/0x148 lib/dump_stack.c:113
   ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
   __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
   __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
   zone_watermark_fast mm/page_alloc.c:3216 [inline]
   get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
   __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
   alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
   alloc_pages include/linux/gfp.h:509 [inline]
   __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
   dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
   raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
   raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
   fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
   fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
   __blkdev_driver_ioctl block/ioctl.c:303 [inline]
   blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
   block_ioctl+0x105/0x150 fs/block_dev.c:1883
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
   ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
   __do_sys_ioctl fs/ioctl.c:709 [inline]
   __se_sys_ioctl fs/ioctl.c:707 [inline]
   __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
   do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path.  It is just that the fast path
really depends on having sanitzed order as well.  Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>