OSDN Git Service

uclinux-h8/linux.git
8 years ago6lowpan: add support for 802.15.4 short addr handling
Alexander Aring [Wed, 15 Jun 2016 19:20:27 +0000 (21:20 +0200)]
6lowpan: add support for 802.15.4 short addr handling

This patch adds necessary handling for use the short address for
802.15.4 6lowpan. It contains support for IPHC address compression
and new matching algorithmn to decide which link layer address will be
used for 802.15.4 frame.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years ago6lowpan: add support for getting short address
Alexander Aring [Wed, 15 Jun 2016 19:20:26 +0000 (21:20 +0200)]
6lowpan: add support for getting short address

In case of sending RA messages we need some way to get the short address
from an 802.15.4 6LoWPAN interface. This patch will add a temporary
debugfs entry for experimental userspace api.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years ago6lowpan: introduce 6lowpan-nd
Alexander Aring [Wed, 15 Jun 2016 19:20:25 +0000 (21:20 +0200)]
6lowpan: introduce 6lowpan-nd

This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoipv6: export several functions
Alexander Aring [Wed, 15 Jun 2016 19:20:24 +0000 (21:20 +0200)]
ipv6: export several functions

This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoipv6: introduce neighbour discovery ops
Alexander Aring [Wed, 15 Jun 2016 19:20:23 +0000 (21:20 +0200)]
ipv6: introduce neighbour discovery ops

This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.

These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoaddrconf: put prefix address add in an own function
Alexander Aring [Wed, 15 Jun 2016 19:20:22 +0000 (21:20 +0200)]
addrconf: put prefix address add in an own function

This patch moves the functionality to add a RA PIO prefix generated
address in an own function. This move prepares to add a hook for
adding a second address for a second link-layer address. E.g. short
address for 802.15.4 6LoWPAN.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agondisc: add __ndisc_fill_addr_option function
Alexander Aring [Wed, 15 Jun 2016 19:20:21 +0000 (21:20 +0200)]
ndisc: add __ndisc_fill_addr_option function

This patch adds __ndisc_fill_addr_option as low-level function for
ndisc_fill_addr_option which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agondisc: add __ndisc_opt_addr_data function
Alexander Aring [Wed, 15 Jun 2016 19:20:20 +0000 (21:20 +0200)]
ndisc: add __ndisc_opt_addr_data function

This patch adds __ndisc_opt_addr_data as low-level function for
ndisc_opt_addr_data which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agondisc: add __ndisc_opt_addr_space function
Alexander Aring [Wed, 15 Jun 2016 19:20:19 +0000 (21:20 +0200)]
ndisc: add __ndisc_opt_addr_space function

This patch adds __ndisc_opt_addr_space as low-level function for
ndisc_opt_addr_space which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years ago6lowpan: remove ipv6 module request
Alexander Aring [Wed, 15 Jun 2016 19:20:18 +0000 (21:20 +0200)]
6lowpan: remove ipv6 module request

Since we use exported function from ipv6 kernel module we don't need to
request the module anymore to have ipv6 functionality.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years ago6lowpan: add 802.15.4 short addr slaac
Alexander Aring [Wed, 15 Jun 2016 19:20:17 +0000 (21:20 +0200)]
6lowpan: add 802.15.4 short addr slaac

This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years ago6lowpan: add private neighbour data
Alexander Aring [Wed, 15 Jun 2016 19:20:16 +0000 (21:20 +0200)]
6lowpan: add private neighbour data

This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.

The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.

Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'cxgb4-sriov-sysfs'
David S. Miller [Wed, 15 Jun 2016 21:46:05 +0000 (14:46 -0700)]
Merge branch 'cxgb4-sriov-sysfs'

Hariprasad Shenai says:

====================
Add SRIOV configuration via sysfs and few fixes

This series adds support to configure SR-IOV via PCI sysfs interface,
reduces resource allocation in kdump kernel by disabling offload. Also
synchronize unicast and multicast mac address, even in the interface is in
Promiscuous mode.

This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agocxgb4/cxgb4vf: Synchronize all MAC addresses
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:32 +0000 (14:39 +0530)]
cxgb4/cxgb4vf: Synchronize all MAC addresses

Even if interface is in Promiscuous mode/Allmulti mode synchronize
MAC addresses.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agocxgb4: Enable SR-IOV configuration via PCI sysfs interface
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:31 +0000 (14:39 +0530)]
cxgb4: Enable SR-IOV configuration via PCI sysfs interface

Implement callback in the driver for the new PCI bus driver
interface that allows the user to enable/disable SR-IOV
virtual functions in a device via the sysfs interface.

Deprecate module parameter used to configure SRIOV

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agocxgb4: Force cxgb4 driver as MASTER in kdump kernel
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:30 +0000 (14:39 +0530)]
cxgb4: Force cxgb4 driver as MASTER in kdump kernel

When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can
reinitialize the Firmware/Chip. Also reduce memory usage by disabling
offload.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'sched_skb_free_defer'
David S. Miller [Wed, 15 Jun 2016 21:08:36 +0000 (14:08 -0700)]
Merge branch 'sched_skb_free_defer'

Eric Dumazet says:

====================
net_sched: defer skb freeing while changing qdiscs

qdiscs/classes are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

I saw spikes of 50+ ms on quite fast hardware...

This patch series adds a simple queue protected by RTNL
where skbs can be placed until RTNL is released.

Note that this might also serve in the future for optional
reinjection of packets when a qdisc is replaced.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_fq: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:59 +0000 (20:21 -0700)]
net_sched: sch_fq: defer skb freeing

sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_pie: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:58 +0000 (20:21 -0700)]
net_sched: sch_pie: defer skb freeing

pie_change() can use rtnl_qdisc_drop() to benefit from
deferred freeing.

pie_reset() is already using qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_netem: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:57 +0000 (20:21 -0700)]
net_sched: sch_netem: defer skb freeing

rtnl_kfree_skbs() can be used in tfifo_reset()

It would be nice if we could iterate through rb tree instead
of removing one skb at a time, and build a single skb chain.
But this is left for a future patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_htb: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:56 +0000 (20:21 -0700)]
net_sched: sch_htb: defer skb freeing

Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
instead of __skb_queue_purge() to defer skb freeing of internal
queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_hhf: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:55 +0000 (20:21 -0700)]
net_sched: sch_hhf: defer skb freeing

Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: fq_codel: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:54 +0000 (20:21 -0700)]
net_sched: fq_codel: defer skb freeing

Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_fq: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:53 +0000 (20:21 -0700)]
net_sched: sch_fq: defer skb freeing

Both fq_change() and fq_reset() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_codel: defer skb freeing in codel_change()
Eric Dumazet [Tue, 14 Jun 2016 03:21:52 +0000 (20:21 -0700)]
net_sched: sch_codel: defer skb freeing in codel_change()

codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_choke: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:51 +0000 (20:21 -0700)]
net_sched: sch_choke: defer skb freeing

choke_reset() and choke_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: add the ability to defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:50 +0000 (20:21 -0700)]
net_sched: add the ability to defer skb freeing

qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.

Actual freeing happens right after RTNL is released,
with appropriate scheduling points.

rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.

qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agotipc: add neighbor monitoring framework
Jon Paul Maloy [Tue, 14 Jun 2016 00:46:22 +0000 (20:46 -0400)]
tipc: add neighbor monitoring framework

TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.

This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its N
  known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N) - 1 nodes downstream from
  itself in the list, and chooses to actively monitor those. This is
  called its "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neighbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirectly
  monitored node has gone up or down. If a node is indicated lost, the
  receiving node temporarily activates its own direct monitoring towards
  that node in order to confirm, or not, that it is actually gone.

- Since each node is actively monitoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how it handles failures that
  cause massive network partitionings. If both a lost node and all its
  direct monitoring neighbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  will be more than two direct monitoring hops away. Because of this,
  each node, apart from monitoring the member of its local domain, will
  also typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain status and domain
  records are marked with a generation number. This saves senders from
  unnecessarily conveying  unaltered domain records, and receivers from
  performing unneeded re-adaptations of their node monitoring list, such
  as re-assigning domain heads.

- As a measure of caution we have added the possibility to disable the
  new algorithm through configuration. We do this by keeping a threshold
  value for the cluster size; a cluster that grows beyond this value
  will switch from full-mesh to ring monitoring, and vice versa when
  it shrinks below the value. This means that if the threshold is set to
  a value larger than any anticipated cluster size (default size is 32)
  the new algorithm is effectively disabled. A patch set for altering the
  threshold value and for listing the table contents will follow shortly.

- This change is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: vrf: Update flags and features settings
David Ahern [Tue, 14 Jun 2016 00:14:12 +0000 (17:14 -0700)]
net: vrf: Update flags and features settings

1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
   can add one as desired.

2. Disable adding a VLAN to a VRF device.

3. Enable offloads and hardware features similar to other logical
   devices (e.g., dummy, veth)

Change provides a significant boost in TCP stream Tx performance,
from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
using qemu with virtio+vhost for the NICs

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agotun: fix csum generation for tap devices
Paolo Abeni [Mon, 13 Jun 2016 22:00:04 +0000 (00:00 +0200)]
tun: fix csum generation for tap devices

The commit 34166093639b ("tuntap: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: 34166093639 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'skb_array'
David S. Miller [Wed, 15 Jun 2016 20:58:34 +0000 (13:58 -0700)]
Merge branch 'skb_array'

Michael S. Tsirkin says:

====================
skb_array: array based FIFO for skbs

This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

On the flip side, cache pressure may be caused by using large queues.
tun has a queue of 1000 entries by default and that's 8K.
At this point I'm not sure this can be solved efficiently.
The correct solution might be sizing the queues appropriately.

Here's an implementation of this idea: it can be used more
or less whenever sk_buff_head can be used, except you need
to know the queue size in advance.

As this might be useful outside of networking, I implemented
a generic array of void pointers, with a type-safe wrapper for skbs.

It remains to be seen whether resizing is required, in case it is
I included patches implementing resizing by holding both the
consumer and the producer locks.

I think this code works fine without any extra memory barriers since we
always read and write the same location, so the accesses can not be
reordered.
Multiple writes of the same value into memory would mess things up
for us, I don't think compilers would do it though.
But if people feel it's better to be safe wrt compiler optimizations,
specifying queue as volatile would probably do it in a cleaner way
than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?

The only issue is with calls within a loop using the __ptr_ring_XXX
accessors - in theory compiler could hoist accesses out of the loop.

Following volatile-considered-harmful.txt I merely
documented that callers that busy-poll should invoke cpu_relax().
Most people will use the external skb_array_XXX APIs with a spinlock,
so this should not be an issue for them.

Eric Dumazet suggested adding an extra pointer to skb for when
we have a single outstanding packet. I could not figure out
a way to implement this without a shared consumer/producer lock
though, which would cause cache line bounces by itself.

Jesper, Jason, I know that both of you tested this,
please post Tested-by tags for whatever was tested.

changes since v7
fix typos noticed by Jesper Brouer

changes since v6
resize implemented. peek/full calls are no longer lockless

replaced _FIELD macros with _CALL which invoke a function
on the pointer rather than just returning a value

destroy now scans the array and frees all queued skbs

changes since v5
implemented a generic ptr_ring api, and
made skb_array a type-safe wrapper
apis for taking the spinlock in different contexts
following expected usecase in tun
changes since v4 (v3 was never posted)
documentation
dropped SKB_ARRAY_MIN_SIZE heuristic
unit test (in userspace, included as patch 2)

changes since v2:
        fixed integer overflow pointed out by Eric.
        added some comments.

changes since v1:
        fixed bug pointed out by Eric.
====================

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoskb_array: resize support
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:50 +0000 (23:54 +0300)]
skb_array: resize support

Update skb_array after ptr_ring API changes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoptr_ring: resize support
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:45 +0000 (23:54 +0300)]
ptr_ring: resize support

This adds ring resize support. Seems to be necessary as
users such as tun allow userspace control over queue size.

If resize is used, this costs us ability to peek at queue without
consumer lock - should not be a big deal as peek and consumer are
usually run on the same CPU.

If ring is made bigger, ring contents is preserved.  If ring is made
smaller, extra pointers are passed to an optional destructor callback.

Cleanup function also gains destructor callback such that
all pointers in queue can be cleaned up.

This changes some APIs but we don't have any users yet,
so it won't break bisect.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoskb_array: array based FIFO for skbs
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:41 +0000 (23:54 +0300)]
skb_array: array based FIFO for skbs

A simple array based FIFO of pointers.  Intended for net stack so uses
skbs for type safety. Implemented as a set of wrappers around ptr_ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoptr_ring: ring test
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:36 +0000 (23:54 +0300)]
ptr_ring: ring test

Add ringtest based unit test for ptr ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoptr_ring: array based FIFO for pointers
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:31 +0000 (23:54 +0300)]
ptr_ring: array based FIFO for pointers

A simple array based FIFO of pointers.  Intended for net stack which
commonly has a single consumer/producer.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: make tcf_hash_check() boolean
WANG Cong [Mon, 13 Jun 2016 20:46:28 +0000 (13:46 -0700)]
net_sched: make tcf_hash_check() boolean

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'vrf-ipv6-mcast-link-local'
David S. Miller [Wed, 15 Jun 2016 19:34:34 +0000 (12:34 -0700)]
Merge branch 'vrf-ipv6-mcast-link-local'

David Ahern says:

====================
net: vrf: Handle ipv6 multicast and link-local addresses

IPv6 multicast and link-local addresses require special handling by the
VRF driver. Rather than using the VRF device index and full FIB lookups,
packets to/from these addresses should use direct FIB lookups based on
the VRF device table.

Multicast routes do not make sense for the L3 master device directly.
Accordingly, do not add mcast routes for the device, and the VRF driver
should fail attempts to send packets to ipv6 mcast addresses on the
device (e.g, ping6 ff02::1%<vrf> should fail)

With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses (icmp, tcp, and udp).  e.g.,

1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1
    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: vrf: Handle ipv6 multicast and link-local addresses
David Ahern [Mon, 13 Jun 2016 20:44:19 +0000 (13:44 -0700)]
net: vrf: Handle ipv6 multicast and link-local addresses

IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
   packets to/from these addresses should use direct FIB lookups based on
   the VRF device table.

2. fail sends/receives on a VRF device to/from a multicast address
   (e.g, make ping6 ff02::1%<vrf> fail)

3. move the setting of the flow oif to the first dst lookup and revert
   the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF
   support to IPv6 stack"). Linklocal/mcast addresses require use of the
   skb->dev.

With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,

1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1

    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: ipv6: Do not add multicast route for l3 master devices
David Ahern [Mon, 13 Jun 2016 20:44:18 +0000 (13:44 -0700)]
net: ipv6: Do not add multicast route for l3 master devices

L3 master devices are virtual devices similar to the loopback
device. Link local and multicast routes for these devices do
not make sense. The ipv6 addrconf code already skips adding a
linklocal address; do the same for the mcast route.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: l3mdev: Remove const from flowi6 arg to get_rt6_dst
David Ahern [Mon, 13 Jun 2016 20:44:17 +0000 (13:44 -0700)]
net: l3mdev: Remove const from flowi6 arg to get_rt6_dst

Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'af_iucv-big-bufs'
David S. Miller [Wed, 15 Jun 2016 19:21:05 +0000 (12:21 -0700)]
Merge branch 'af_iucv-big-bufs'

Ursula Braun says:

====================
s390: af_iucv patches

here are improvements for af_iucv relaxing the pressure to allocate
big contiguous kernel buffers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoaf_iucv: use paged SKBs for big inbound messages
Eugene Crosser [Mon, 13 Jun 2016 16:46:16 +0000 (18:46 +0200)]
af_iucv: use paged SKBs for big inbound messages

When an inbound message is bigger than a page, allocate a paged SKB,
and subsequently use IUCV receive primitive with IPBUFLST flag.
This relaxes the pressure to allocate big contiguous kernel buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoaf_iucv: remove fragment_skb() to use paged SKBs
Eugene Crosser [Mon, 13 Jun 2016 16:46:15 +0000 (18:46 +0200)]
af_iucv: remove fragment_skb() to use paged SKBs

Before introducing paged skbs in the receive path, get rid of the
function `iucv_fragment_skb()` that replaces one large linear skb
with several smaller linear skbs.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoaf_iucv: use paged SKBs for big outbound messages
Eugene Crosser [Mon, 13 Jun 2016 16:46:14 +0000 (18:46 +0200)]
af_iucv: use paged SKBs for big outbound messages

When an outbound message is bigger than a page, allocate and fill
a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
flag. This relaxes the pressure to allocate big contiguous kernel
buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip
Alexander Shiyan [Mon, 13 Jun 2016 15:52:17 +0000 (18:52 +0300)]
dt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip

Add device tree binding documentation details for Cirrus Logic
CS8900/CS8920 ethernet chip.

Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: cx89x0: Add DT support
Alexander Shiyan [Mon, 13 Jun 2016 15:51:05 +0000 (18:51 +0300)]
net: cx89x0: Add DT support

Add DT support to the Cirrus Logic CS89x0 driver.

Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoact_police: rename tcf_act_police_locate() to tcf_act_police_init()
WANG Cong [Mon, 13 Jun 2016 17:47:44 +0000 (10:47 -0700)]
act_police: rename tcf_act_police_locate() to tcf_act_police_init()

This function is just ->init(), rename it to make it obvious.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: remove internal use of TC_POLICE_*
WANG Cong [Mon, 13 Jun 2016 17:47:43 +0000 (10:47 -0700)]
net_sched: remove internal use of TC_POLICE_*

These should be gone when we removed CONFIG_NET_CLS_POLICE.
We can not totally remove them since they are exposed
to userspace.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'rds-mprds-foundations'
David S. Miller [Wed, 15 Jun 2016 06:50:44 +0000 (23:50 -0700)]
Merge branch 'rds-mprds-foundations'

Sowmini Varadhan says:

====================
RDS: multiple connection paths for scaling

Today RDS-over-TCP is implemented by demux-ing multiple PF_RDS sockets
between any 2 endpoints (where endpoint == [IP address, port]) over a
single TCP socket between the 2 IP addresses involved. This has the
limitation that it ends up funneling multiple RDS flows over a single
TCP flow, thus the rds/tcp connection is
   (a) upper-bounded to the single-flow bandwidth,
   (b) suffers from head-of-line blocking for the RDS sockets.

Better throughput (for a fixed small packet size, MTU) can be achieved
by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
connection. RDS sockets will be attached to a path based on some hash
(e.g., of local address and RDS port number) and packets for that RDS
socket will be sent over the attached path using TCP to segment/reassemble
RDS datagrams on that path.

The table below, generated using a prototype that implements mprds,
shows that this is significant for scaling to 40G.  Packet sizes
used were: 8K byte req, 256 byte resp. MTU: 1500.  The parameters for
RDS-concurrency used below are described in the rds-stress(1) man page-
the number listed is proportional to the number of threads at which max
throughput was attained.

  -------------------------------------------------------------------
     RDS-concurrency   Num of       tx+rx K/s (iops)       throughput
     (-t N -d N)       TCP paths
  -------------------------------------------------------------------
        16             1             600K -  700K            4 Gbps
        28             8            5000K - 6000K           32 Gbps
  -------------------------------------------------------------------

FAQ: what is the relation between mprds and mptcp?
  mprds is orthogonal to mptcp. Whereas mptcp creates
  sub-flows for a single TCP connection, mprds parallelizes tx/rx
  at the RDS layer. MPRDS with N paths will allow N datagrams to
  be sent in parallel; each path will continue to send one
  datagram at a time, with sender and receiver keeping track of
  the retransmit and dgram-assembly state based on the RDS header.
  If desired, mptcp can additionally be used to speed up each TCP
  path. That acceleration is orthogonal to the parallelization benefits
  of mprds.

This patch series lays down the foundational data-structures to support
mprds in the kernel. It implements the changes to split up the
rds_connection structure into a common (to all paths) part,
and a per-path rds_conn_path. All I/O workqs are driven from
the rds_conn_path.

Note that this patchset does not (yet) actually enable multipathing
for any of the transports; all transports will continue to use a
single path with the refactored data-structures. A subsequent patchset
will  add the changes to the rds-tcp module to actually use mprds
in rds-tcp.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Update rds_conn_destroy to be MP capable
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:42 +0000 (09:44 -0700)]
RDS: Update rds_conn_destroy to be MP capable

Refactor rds_conn_destroy() so that the per-path dismantling
is done in rds_conn_path_destroy, and then iterate as needed
over rds_conn_path_destroy().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Update rds_conn_shutdown to work with rds_conn_path
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:41 +0000 (09:44 -0700)]
RDS: Update rds_conn_shutdown to work with rds_conn_path

This commit changes rds_conn_shutdown to take a rds_conn_path *
argument, allowing it to shutdown paths other than c_path[0] for
MP-capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:40 +0000 (09:44 -0700)]
RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create

Add a for() loop in __rds_conn_create to initialize all the
conn_paths, in preparate for MP capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Add rds_conn_path_error()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:39 +0000 (09:44 -0700)]
RDS: Add rds_conn_path_error()

rds_conn_path_error() is the MP-aware analog of rds_conn_error,
to be used by multipath-capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: update rds-info related functions to traverse multiple conn_paths
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:38 +0000 (09:44 -0700)]
RDS: update rds-info related functions to traverse multiple conn_paths

This commit updates the callbacks related to the rds-info command
so that they walk through all the rds_conn_path structures and
report the requested info.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Add rds_conn_path_connect_if_down() for MP-aware callers
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:37 +0000 (09:44 -0700)]
RDS: Add rds_conn_path_connect_if_down() for MP-aware callers

rds_conn_path_connect_if_down() works on the rds_conn_path
that it is passed. Callers who are not t_m_capable may continue
calling rds_conn_connect_if_down, which will invoke
rds_conn_path_connect_if_down() with the default c_path[0].

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Make rds_send_pong() take a rds_conn_path argument
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:36 +0000 (09:44 -0700)]
RDS: Make rds_send_pong() take a rds_conn_path argument

This commit allows rds_send_pong() callers to send back
the rds pong message on some path other than c_path[0] by
passing in a struct rds_conn_path * argument.  It also
removes the last dependency on the #defines in rds_single.h
from send.c

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Extract rds_conn_path from i_conn_path in rds_send_drop_to() for MP-capable...
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:35 +0000 (09:44 -0700)]
RDS: Extract rds_conn_path from i_conn_path in rds_send_drop_to() for MP-capable transports

Explicitly set up rds_conn_path, either from i_conn_path (for
MP capable transpots) or as c_path[0], and use this in
rds_send_drop_to()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Pass rds_conn_path to rds_send_xmit()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:34 +0000 (09:44 -0700)]
RDS: Pass rds_conn_path to rds_send_xmit()

Pass a struct rds_conn_path to rds_send_xmit so that MP capable
transports can transmit packets on something other than c_path[0].
The eventual goal for MP capable transports is to hash the rds
socket to a path based on the bound local address/port, and use
this path as the argument to rds_send_xmit()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Make rds_send_queue_rm() rds_conn_path aware
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:33 +0000 (09:44 -0700)]
RDS: Make rds_send_queue_rm() rds_conn_path aware

Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize
the i_conn_path field in struct rds_incoming. This commit also makes
rds_send_queue_rm() MP capable, because it now takes locks
specific to the rds_conn_path passed in, instead of defaulting to
the c_path[0] based defines from rds_single_path.h

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Remove stale function rds_send_get_message()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:32 +0000 (09:44 -0700)]
RDS: Remove stale function rds_send_get_message()

The only caller of rds_send_get_message() was
rds_iw_send_cq_comp_handler() which was removed as part of
commit dcdede0406d3 ("RDS: Drop stale iWARP RDMA transport"),
so remove rds_send_get_message() for the same reason.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Add rds_send_path_drop_acked()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:31 +0000 (09:44 -0700)]
RDS: Add rds_send_path_drop_acked()

rds_send_path_drop_acked() is the path-specific version of
rds_send_drop_acked() to be invoked by MP capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: Add rds_send_path_reset()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:30 +0000 (09:44 -0700)]
RDS: Add rds_send_path_reset()

rds_send_path_reset() is the path specific version of rds_send_reset()
intended for MP capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: rds_inc_path_init() helper function for MP capable transports
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:29 +0000 (09:44 -0700)]
RDS: rds_inc_path_init() helper function for MP capable transports

t_mp_capable transports can use rds_inc_path_init to initialize
all fields in struct rds_incoming, including the i_conn_path.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: recv path gets the conn_path from rds_incoming for MP capable transports
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:28 +0000 (09:44 -0700)]
RDS: recv path gets the conn_path from rds_incoming for MP capable transports

Transports that are t_mp_capable should set the rds_conn_path
on which the datagram was recived in the ->i_conn_path field
of struct rds_incoming.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: add t_mp_capable bit to be set by MP capable transports
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:27 +0000 (09:44 -0700)]
RDS: add t_mp_capable bit to be set by MP capable transports

The t_mp_capable bit will be used in the core rds module
to support multipathing logic when the transport supports it.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoRDS: split out connection specific state from rds_connection to rds_conn_path
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:26 +0000 (09:44 -0700)]
RDS: split out connection specific state from rds_connection to rds_conn_path

In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.

This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in  subsequent commits.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agotcp: return sizeof tcp_dctcp_info in dctcp_get_info()
Neal Cardwell [Mon, 13 Jun 2016 15:20:35 +0000 (11:20 -0400)]
tcp: return sizeof tcp_dctcp_info in dctcp_get_info()

Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info).  There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agosctp: fix error return code in sctp_init()
Wei Yongjun [Mon, 13 Jun 2016 15:08:26 +0000 (23:08 +0800)]
sctp: fix error return code in sctp_init()

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Wed, 15 Jun 2016 06:30:32 +0000 (23:30 -0700)]
Merge tag 'rxrpc-rewrite-20160613' of git://git./linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rename rxrpc source files

Here's the next part of the AF_RXRPC rewrite.  In this set I rename some of
the files in the net/rxrpc/ directory and adjust the Makefile and
ar-internal.h to reflect the changes.

The aim is twofold:

 (1) Remove the "ar-" prefix on those files that have it as it's not really
     useful, especially now that I'm building rxkad in.

 (2) To aid splitting the local, peer, connection and call handling code
     into separate files for object and event handling in future patches by
     making it easier to come up with new filenames.

There are two commits:

 (1) The first commit does a bunch of renames of .c files and alters the
     Makefile.  ar-internal.h isn't renamed at this time to avoid having to
     change the contents of the files being renamed.

 (2) The second commit changes the section label comments in ar-internal.h
     to reflect the changed filenames and reorders the file so that the
     sections are back in filename order.

The patches can be found here also:

http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160613
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: hns: update the dependency
Kejian Yan [Mon, 13 Jun 2016 08:56:20 +0000 (16:56 +0800)]
net: hns: update the dependency

After the patchset about adding support of ACPI (commit id is 6343488)
being applied, HNS does not depend on OF. It depends on OF or ACPI, so
the Kconfig file needs to be updated.

Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'r8152-phy-adjustments'
David S. Miller [Wed, 15 Jun 2016 05:38:02 +0000 (22:38 -0700)]
Merge branch 'r8152-phy-adjustments'

Hayes Wang says:

====================
r8152: code adjustment for PHY

These patches are for adjusting the code about PHY and setting speed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agor8152: save the speed
hayeswang [Mon, 13 Jun 2016 09:49:38 +0000 (17:49 +0800)]
r8152: save the speed

The user may change the speed. Use it to replace the default one.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agor8152: move the setting for the default speed
hayeswang [Mon, 13 Jun 2016 09:49:37 +0000 (17:49 +0800)]
r8152: move the setting for the default speed

Move calling set_speed() from open() to rtl_hw_phy_work_func_t().
Then, we would set the default speed only for first initialization
or after resuming.

Besides, the set_speed() could handle the flag of PHY_RESET which
would be set in rtl_ops.hw_phy_cfg().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agor8152: move the settings of PHY to a work queue
hayeswang [Mon, 13 Jun 2016 09:49:36 +0000 (17:49 +0800)]
r8152: move the settings of PHY to a work queue

Move the settings of PHY to a work queue and schedule it after
rtl_ops.init().

There are some reasons for this. First, the settings are only
needed for the first time initialization or after the power
down occurs.

Second, the settings are independent with the others.

Last, the settings may take more time than the others. Leave
they in probe() or open() may delay the following flows.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet/sched: flower: Return error when hw can't offload and skip_sw is set
Amir Vadai [Mon, 13 Jun 2016 09:06:39 +0000 (12:06 +0300)]
net/sched: flower: Return error when hw can't offload and skip_sw is set

When skip_sw is set and hardware fails to apply filter, return error to
user. This will make error propagation logic similar to the one
currently used in u32 classifier.
Also, changed code to use tc_skip_sw() utility function.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'bnxt_en-updates'
David S. Miller [Tue, 14 Jun 2016 23:16:18 +0000 (19:16 -0400)]
Merge branch 'bnxt_en-updates'

Michael Chan says:

====================
bnxt_en: Updates for net-next.

-Add default VLAN support for VFs.
-Add NPAR (NIC partioning) support.
-Add support for new device 5731x and 5741x. GRO logic is different.
-Support new ETHTOOL_{G|S}LINKSETTINGS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Support new ETHTOOL_{G|S}LINKSETTINGS API.
Michael Chan [Mon, 13 Jun 2016 06:25:38 +0000 (02:25 -0400)]
bnxt_en: Support new ETHTOOL_{G|S}LINKSETTINGS API.

To fully support 25G and 50G link settings.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Don't allow autoneg on cards that don't support it.
Michael Chan [Mon, 13 Jun 2016 06:25:37 +0000 (02:25 -0400)]
bnxt_en: Don't allow autoneg on cards that don't support it.

Some cards do not support autoneg.  The current code does not prevent the
user from enabling autoneg with ethtool on such cards, causing confusion.
Firmware provides the autoneg capability information and we just need to
store it in the support_auto_speeds field in bnxt_link_info struct.
The ethtool set_settings() call will check this field before proceeding
with autoneg.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Add BCM5731X and BCM5741X device IDs.
Michael Chan [Mon, 13 Jun 2016 06:25:36 +0000 (02:25 -0400)]
bnxt_en: Add BCM5731X and BCM5741X device IDs.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Add GRO logic for BCM5731X chips.
Michael Chan [Mon, 13 Jun 2016 06:25:35 +0000 (02:25 -0400)]
bnxt_en: Add GRO logic for BCM5731X chips.

Add bnxt_gro_func_5731x() to handle GRO packets for this chip.  The
completion structures used in the new chip have new data to help determine
the header offsets.  The offsets can be off by 4 if the packet is an
internal loopback packet (e.g. from one VF to another VF).  Some additional
logic is added to adjust the offsets if it is a loopback packet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Refactor bnxt_gro_skb().
Michael Chan [Mon, 13 Jun 2016 06:25:34 +0000 (02:25 -0400)]
bnxt_en: Refactor bnxt_gro_skb().

Newer chips require different logic to handle GRO packets.  So refactor
the code so that we can call different functions depending on the chip.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Define the supported chip numbers.
Michael Chan [Mon, 13 Jun 2016 06:25:33 +0000 (02:25 -0400)]
bnxt_en: Define the supported chip numbers.

Define all the supported chip numbers and chip categories.  Store the
chip_num returned by firmware.  If the call to get the version and chip
number fails, we should abort.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Add PCI device ID for 57404 NPAR devices.
Michael Chan [Mon, 13 Jun 2016 06:25:32 +0000 (02:25 -0400)]
bnxt_en: Add PCI device ID for 57404 NPAR devices.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Enable NPAR (NIC Partitioning) Support.
Satish Baddipadige [Mon, 13 Jun 2016 06:25:31 +0000 (02:25 -0400)]
bnxt_en: Enable NPAR (NIC Partitioning) Support.

NPAR type is read from bnxt_hwrm_func_qcfg.  Do not allow changing link
parameters if in NPAR mode sinc ethe port is shared among multiple
partitions.  The link parameters are set up by firmware.

Signed-off-by: Satish Baddipadige <sbaddipa@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Handle VF_CFG_CHANGE event from firmware.
Michael Chan [Mon, 13 Jun 2016 06:25:30 +0000 (02:25 -0400)]
bnxt_en: Handle VF_CFG_CHANGE event from firmware.

When the VF driver gets this event, the VF configuration has changed (such
as default VLAN).  The VF driver will initiate a silent reset to pick up
the new configuration.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Add new function bnxt_reset().
Michael Chan [Mon, 13 Jun 2016 06:25:29 +0000 (02:25 -0400)]
bnxt_en: Add new function bnxt_reset().

When a default VLAN is added to the VF, the VF driver needs to reset to
pick up the default VLAN ID.  We can use the same tx timeout reset logic
to do that, without the debug output.  This new function, with the
silent parameter to suppress debug output will now serve both purposes.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agobnxt_en: Add function for VF driver to query default VLAN.
Michael Chan [Mon, 13 Jun 2016 06:25:28 +0000 (02:25 -0400)]
bnxt_en: Add function for VF driver to query default VLAN.

The PF can setup a default VLAN for a VF.  The default VLAN tag is
automatically inserted and stripped without the knowledge of the
stack running on the VF.  The VF driver needs to know that default
VLAN is enabled as VLAN acceleration on the RX side is no longer
supported.  Call netdev_update_features() to fix up the VLAN features
as necessary.  Also, VLAN strip mode must be enabled to strip out
the default VLAN tag.

Only allow VF default VLAN to be set if the firmware spec is >= 1.2.1.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: ethernet: enic: move to new ethtool api {get|set}_link_ksettings
Philippe Reynes [Sun, 12 Jun 2016 21:30:54 +0000 (23:30 +0200)]
net: ethernet: enic: move to new ethtool api {get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move the enic driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agovirtio_net: fix csum generation for virtio-net devices
Mike Rapoport [Tue, 14 Jun 2016 05:29:38 +0000 (08:29 +0300)]
virtio_net: fix csum generation for virtio-net devices

The commit e858fae2b0b8 ("virtio_net: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: e858fae2b0b8 ("virtio_net: use common code for virtio_net_hdr and
skb GSO conversion")

Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agorxrpc: Update the comments in ar-internal.h to reflect renames
David Howells [Mon, 13 Jun 2016 12:30:30 +0000 (13:30 +0100)]
rxrpc: Update the comments in ar-internal.h to reflect renames

Update the section comments in ar-internal.h that indicate the locations of
the referenced items to reflect the renames done to the .c files in
net/rxrpc/.

This also involves some rearrangement to reflect keep the sections in order
of filename.

Signed-off-by: David Howells <dhowells@redhat.com>
8 years agorxrpc: Rename files matching ar-*.c to git rid of the "ar-" prefix
David Howells [Mon, 13 Jun 2016 11:16:05 +0000 (12:16 +0100)]
rxrpc: Rename files matching ar-*.c to git rid of the "ar-" prefix

Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.

Note that the not all files are simply renamed from ar-X.c to X.c.  The
following exceptions are made:

 (*) ar-call.c -> call_object.c
     ar-ack.c -> call_event.c

     call_object.c is going to contain the core of the call object
     handling.  Call event handling is all going to be in call_event.c.

 (*) ar-accept.c -> call_accept.c

     Incoming call handling is going to be here.

 (*) ar-connection.c -> conn_object.c
     ar-connevent.c -> conn_event.c

     The former file is going to have the basic connection object handling,
     but there will likely be some differentiation between client
     connections and service connections in additional files later.  The
     latter file will have all the connection-level event handling.

 (*) ar-local.c -> local_object.c

     This will have the local endpoint object handling code.  The local
     endpoint event handling code will later be split out into
     local_event.c.

 (*) ar-peer.c -> peer_object.c

     This will have the peer endpoint object handling code.  Peer event
     handling code will be placed in peer_event.c (for the moment, there is
     none).

 (*) ar-error.c -> peer_event.c

     This will become the peer event handling code, though for the moment
     it's actually driven from the local endpoint's perspective.

Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.

The only file that actually has its contents changed is net/rxrpc/Makefile.

net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename.  I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.

Signed-off-by: David Howells <dhowells@redhat.com.
8 years agosched: remove NET_XMIT_POLICED
Florian Westphal [Sat, 11 Jun 2016 10:46:04 +0000 (12:46 +0200)]
sched: remove NET_XMIT_POLICED

sch_atm returns this when TC_ACT_SHOT classification occurs.

But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.

BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.

Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet: fec: handle small PHY reset durations more precisely
Stefan Wahren [Wed, 8 Jun 2016 20:42:46 +0000 (20:42 +0000)]
net: fec: handle small PHY reset durations more precisely

Since msleep is based on jiffies the PHY reset could take longer
than expected. So use msleep for values greater than 20 msec otherwise
usleep_range.

Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoipv6: use TOS marks from sockets for routing decision
Hannes Frederic Sowa [Sat, 11 Jun 2016 18:08:19 +0000 (20:08 +0200)]
ipv6: use TOS marks from sockets for routing decision

In IPv6 the ToS values are part of the flowlabel in flowi6 and get
extracted during fib rule lookup, but we forgot to correctly initialize
the flowlabel before the routing lookup.

Reported-by: <liam.mcbirnie@boeing.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'remove-qdisc-throttle'
David S. Miller [Sat, 11 Jun 2016 06:58:29 +0000 (23:58 -0700)]
Merge branch 'remove-qdisc-throttle'

Eric Dumazet says:

====================
net_sched: remove qdisc_is_throttled()

HTB, CBQ and HFSC pay a very high cost updating the qdisc 'throttled'
status that nothing but CBQ seems to use.

CBQ usage is flaky anyway, since no qdisc ->enqueue() updates the
'throttled' qdisc status.

This looks like some 'optimization' that actually cost more than code
without the optimization, and might cause latency issues with CBQ.

In my tests, I could achieve a 8 % performance increase in TCP_RR
workload through HTB qdisc, in presence of throttled classes,
and 5 % without throttled classes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: remove generic throttled management
Eric Dumazet [Fri, 10 Jun 2016 23:41:39 +0000 (16:41 -0700)]
net_sched: remove generic throttled management

__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b5b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: netem: remove qdisc_is_throttled() use
Eric Dumazet [Fri, 10 Jun 2016 23:41:38 +0000 (16:41 -0700)]
net_sched: netem: remove qdisc_is_throttled() use

Looks like it is only there as some optimization attempt.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: cbq: remove a flaky use of qdisc_is_throttled()
Eric Dumazet [Fri, 10 Jun 2016 23:41:37 +0000 (16:41 -0700)]
net_sched: cbq: remove a flaky use of qdisc_is_throttled()

So far no qdisc ever unset the throttled bit at enqueue() time,
so CBQ usage of qdisc_is_throttled() was flaky.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive
considering that only CBQ was eventually caring for this status,
it would make sense to implement a Qdisc ops ->is_throttled()
if we find that this is needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet_sched: sch_plug: use a private throttled status
Eric Dumazet [Fri, 10 Jun 2016 23:41:36 +0000 (16:41 -0700)]
net_sched: sch_plug: use a private throttled status

We want to get rid of generic qdisc throttled management,
so this qdisc has to use a private flag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>