summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2016-12-23sctp: fix recovering from 0 win with small data chunksMarcelo Ricardo Leitner1-1/+1
Currently if SCTP closes the receive window with window pressure, mostly caused by excessive skb overhead on payload/overheads ratio, SCTP will close the window abruptly while saving the delta on rwnd_press. It will start recovering rwnd as the chunks are consumed by the application and the rwnd_press will be only recovered after rwnd reach the same value as of rwnd_press, mostly to prevent silly window syndrome. Thing is, this is very inefficient with small data chunks, as with those it will never reach back that value, and thus it will never recover from such pressure. This means that we will not issue window updates when recovering from 0 window and will rely on a sender retransmit to notice it. The fix here is to remove such threshold, as no value is good enough: it depends on the (avg) chunk sizes being used. Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by sending SIGSTOP to netserver, sleep 1.2, and SIGCONT. Rate limited to 845kbps, for visibility. Capture done at netserver side. Previously: 01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [ 01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS 01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS 01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap 01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS 01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap 01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS 01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap 01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS 01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap 02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS 02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap (window update missed) 04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS 04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g 04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS 04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS 04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g After: 02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153] 02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S 02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S 02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga 02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S 02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga 02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S 02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga 02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S 02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga 03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S 03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga 03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508] 03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017] 03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526] 03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035] 03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544] 03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053] 03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562] (following data transmission triggered by window updates above) 03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S 03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000] 03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S 03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S 03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894] 03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S 03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23sctp: do not loose window information if in rwnd_overMarcelo Ricardo Leitner1-1/+1
It's possible that we receive a packet that is larger than current window. If it's the first packet in this way, it will cause it to increase rwnd_over. Then, if we receive another data chunk (specially as SCTP allows you to have one data chunk in flight even during 0 window), rwnd_over will be overwritten instead of added to. In the long run, this could cause the window to grow bigger than its initial size, as rwnd_over would be charged only for the last received data chunk while the code will try open the window for all packets that were received and had its value in rwnd_over overwritten. This, then, can lead to the worsening of payload/buffer ratio and cause rwnd_press to kick in more often. The fix is to sum it too, same as is done for rwnd_press, so that if we receive 3 chunks after closing the window, we still have to release that same amount before re-opening it. Log snippet from sctp_test exhibiting the issue: [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000 rwnd decreased by 1 to (0, 1, 114221) [ 146.209232] sctp: sctp_assoc_rwnd_decrease: association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1! [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000 rwnd decreased by 1 to (0, 1, 114221) [ 146.209232] sctp: sctp_assoc_rwnd_decrease: association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1! [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000 rwnd decreased by 1 to (0, 1, 114221) [ 146.209232] sctp: sctp_assoc_rwnd_decrease: association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1! [ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000 rwnd decreased by 1 to (0, 1, 114221) Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23neigh: Send netevent after marking neigh as deadIdo Schimmel1-0/+1
neigh_cleanup_and_release() is always called after marking a neighbour as dead, but it only notifies user space and not in-kernel listeners of the netevent notification chain. This can cause multiple problems. In my specific use case, it causes the listener (a switch driver capable of L3 offloads) to believe a neighbour entry is still valid, and is thus erroneously kept in the device's table. Fix that by sending a netevent after marking the neighbour as dead. Fixes: a6bf9e933daf ("mlxsw: spectrum_router: Offload neighbours based on NUD state change") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23ipv6: handle -EFAULT from skb_copy_bitsDave Jones1-1/+5
By setting certain socket options on ipv6 raw sockets, we can confuse the length calculation in rawv6_push_pending_frames triggering a BUG_ON. RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40 RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282 RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002 RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00 RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009 R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030 R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80 Call Trace: [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830 [<ffffffff81772697>] inet_sendmsg+0x67/0xa0 [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50 [<ffffffff816d982f>] SYSC_sendto+0xef/0x170 [<ffffffff816da27e>] SyS_sendto+0xe/0x10 [<ffffffff81002910>] do_syscall_64+0x50/0xa0 [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25 Handle by jumping to the failure path if skb_copy_bits gets an EFAULT. Reproducer: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #define LEN 504 int main(int argc, char* argv[]) { int fd; int zero = 0; char buf[LEN]; memset(buf, 0, LEN); fd = socket(AF_INET6, SOCK_RAW, 7); setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4); setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN); sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110); } Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23inet: fix IP(V6)_RECVORIGDSTADDR for udp socketsWillem de Bruijn2-2/+2
Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within the packet. For sockets that have transport headers pulled, transport offset can be negative. Use signed comparison to avoid overflow. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Nisar Jagabar <njagabar@cloudmark.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23net/sched: cls_flower: Mandate mask when matching on flagsOr Gerlitz1-11/+12
When matching on flags, we should require the user to provide the mask and avoid using an all-ones mask. Not doing so causes matching on flags provided w.o mask to hit on the value being unset for all flags, which may not what the user wanted to happen. Fixes: faa3ffce7829 ('net/sched: cls_flower: Add support for matching on flags') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23net/sched: act_tunnel_key: Fix setting UDP dst port in metadata under IPv6Or Gerlitz1-2/+2
The UDP dst port was provided to the helper function which sets the IPv6 IP tunnel meta-data under a wrong param order, fix that. Fixes: 75bfbca01e48 ('net/sched: act_tunnel_key: Add UDP dst port option') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-22net: ipv4: Don't crash if passing a null sk to ip_do_redirect.Lorenzo Colitti1-1/+2
Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") made ip_do_redirect call sock_net(sk) to determine the network namespace of the passed-in socket. This crashes if sk is NULL. Fix this by getting the network namespace from the skb instead. Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-21tcp: add a missing barrier in tcp_tasklet_func()Eric Dumazet1-0/+1
Madalin reported crashes happening in tcp_tasklet_func() on powerpc64 Before TSQ_QUEUED bit is cleared, we must ensure the changes done by list_del(&tp->tsq_node); are committed to memory, otherwise corruption might happen, as an other cpu could catch TSQ_QUEUED clearance too soon. We can notice that old kernels were immune to this bug, because TSQ_QUEUED was cleared after a bh_lock_sock(sk)/bh_unlock_sock(sk) section, but they could have missed a kick to write additional bytes, when NIC interrupts for a given flow are spread to multiple cpus. Affected TCP flows would need an incoming ACK or RTO timer to add more packets to the pipe. So overall situation should be better now. Fixes: b223feb9de2a ("tcp: tsq: add shortcut in tcp_tasklet_func()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Madalin Bucur <madalin.bucur@nxp.com> Tested-by: Madalin Bucur <madalin.bucur@nxp.com> Tested-by: Xing Lei <xing.lei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20RDS: use rb_entry()Geliang Tang1-1/+1
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20net_sched: sch_netem: use rb_entry()Geliang Tang1-1/+1
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20net_sched: sch_fq: use rb_entry()Geliang Tang1-7/+7
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20sctp: not copying duplicate addrs to the assoc's bind address listXin Long2-0/+6
sctp.local_addr_list is a global address list that is supposed to include all the local addresses. sctp updates this list according to NETDEV_UP/ NETDEV_DOWN notifications. However, if multiple NICs have the same address, the global list would have duplicate addresses. Even if for one NIC, promote secondaries in __inet_del_ifa can also lead to accumulating duplicate addresses. When sctp binds address 'ANY' and creates a connection, it copies all the addresses from global list into asoc's bind addr list, which makes sctp pack the duplicate addresses into INIT/INIT_ACK packets. This patch is to filter the duplicate addresses when copying the addrs from global list in sctp_copy_local_addr_list and unpacking addr_param from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list. Note that we can't filter the duplicate addrs when global address list gets updated, As NETDEV_DOWN event may remove an addr that still exists in another NIC. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20sctp: reduce indent level in sctp_copy_local_addr_listXin Long1-18/+19
This patch is to reduce indent level by using continue when the addr is not allowed, and also drop end_copy by using break. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20openvswitch: Add a missing break statement.Jarno Rajahalme1-0/+1
Add a break statement to prevent fall-through from OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL. Without the break actions setting ethernet addresses fail to validate with log messages complaining about invalid tunnel attributes. Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20ipv4: Should use consistent conditional judgement for ip fragment in ↵zheng li1-1/+1
__ip_append_data and ip_finish_output There is an inconsistent conditional judgement in __ip_append_data and ip_finish_output functions, the variable length in __ip_append_data just include the length of application's payload and udp header, don't include the length of ip header, but in ip_finish_output use (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the length of ip header. That causes some particular application's udp payload whose length is between (MTU - IP Header) and MTU were fragmented by ip_fragment even though the rst->dev support UFO feature. Add the length of ip header to length in __ip_append_data to keep consistent conditional judgement as ip_finish_output for ip fragment. Signed-off-by: Zheng Li <james.z.li@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds19-89/+105
Pull networking fixes and cleanups from David Miller: 1) Revert bogus nla_ok() change, from Alexey Dobriyan. 2) Various bpf validator fixes from Daniel Borkmann. 3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04 drivers, from Dongpo Li. 4) Several ethtool ksettings conversions from Philippe Reynes. 5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert. 6) XDP support for virtio_net, from John Fastabend. 7) Fix NAT handling within a vrf, from David Ahern. 8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits) net: mv643xx_eth: fix build failure isdn: Constify some function parameters mlxsw: spectrum: Mark split ports as such cgroup: Fix CGROUP_BPF config qed: fix old-style function definition net: ipv6: check route protocol when deleting routes r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected irda: w83977af_ir: cleanup an indent issue net: sfc: use new api ethtool_{get|set}_link_ksettings net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings bpf: fix mark_reg_unknown_value for spilled regs on map value marking bpf: fix overflow in prog accounting bpf: dynamically allocate digest scratch buffer gtp: Fix initialization of Flags octet in GTPv1 header gtp: gtp_check_src_ms_ipv4() always return success net/x25: use designated initializers isdn: use designated initializers ...
2016-12-17Merge tag 'mac80211-for-davem-2016-12-16' of ↵David S. Miller3-8/+11
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Three fixes: * avoid a WARN_ON() when trying to use WEP with AP_VLANs * ensure enough headroom on mesh forwarding packets * don't report unknown/invalid rates to userspace ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net: ipv6: check route protocol when deleting routesMantas M1-0/+2
The protocol field is checked when deleting IPv4 routes, but ignored for IPv6, which causes problems with routing daemons accidentally deleting externally set routes (observed by multiple bird6 users). This can be verified using `ip -6 route del <prefix> proto something`. Signed-off-by: Mantas Mikulėnas <grawity@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net/x25: use designated initializersKees Cook1-1/+1
Prepare to mark sensitive kernel structures for randomization by making sure they're using designated initializers. These were identified during allyesconfig builds of x86, arm, and arm64, with most initializer fixes extracted from grsecurity. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net: use designated initializersKees Cook1-1/+1
Prepare to mark sensitive kernel structures for randomization by making sure they're using designated initializers. These were identified during allyesconfig builds of x86, arm, and arm64, with most initializer fixes extracted from grsecurity. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17ATM: use designated initializersKees Cook4-55/+54
Prepare to mark sensitive kernel structures for randomization by making sure they're using designated initializers. These were identified during allyesconfig builds of x86, arm, and arm64, with most initializer fixes extracted from grsecurity. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net: xdp: add invalid buffer warningJohn Fastabend1-0/+6
This adds a warning for drivers to use when encountering an invalid buffer for XDP. For normal cases this should not happen but to catch this in virtual/qemu setups that I may not have expected from the emulation layer having a standard warning is useful. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17sctp: sctp_transport_lookup_process should rcu_read_unlock when transport is ↵Xin Long1-4/+3
null Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock when it failed to find a transport by sctp_addrs_lookup_transport. This patch is to fix it by moving up rcu_read_unlock right before checking transport and also to remove the out path. Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17sctp: sctp_epaddr_lookup_transport should be protected by rcu_read_lockXin Long1-1/+4
Since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable"), sctp has changed to use rhlist_lookup to look up transport, but rhlist_lookup doesn't call rcu_read_lock inside, unlike rhashtable_lookup_fast. It is called in sctp_epaddr_lookup_transport and sctp_addrs_lookup_transport. sctp_addrs_lookup_transport is always in the protection of rcu_read_lock(), as __sctp_lookup_association is called in rx path or sctp_lookup_association which are in the protection of rcu_read_lock() already. But sctp_epaddr_lookup_transport is called by sctp_endpoint_lookup_assoc, it doesn't call rcu_read_lock, which may cause "suspicious rcu_dereference_check usage' in __rhashtable_lookup. This patch is to fix it by adding rcu_read_lock in sctp_endpoint_lookup_assoc before calling sctp_epaddr_lookup_transport. Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17irda: irnet: add member name to the miscdevice declarationLABBE Corentin1-3/+3
Since the struct miscdevice have many members, it is dangerous to init it without members name relying only on member order. This patch add member name to the init declaration. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17irda: irnet: Remove unused IRNET_MAJOR defineLABBE Corentin1-3/+0
The IRNET_MAJOR define is not used, so this patch remove it. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17irnet: ppp: move IRNET_MINOR to include/linux/miscdevice.hLABBE Corentin1-1/+0
This patch move the define for IRNET_MINOR to include/linux/miscdevice.h It is better that all minor number definitions are in the same place. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17irda: irnet: Move linux/miscdevice.h includeLABBE Corentin2-1/+1
The only use of miscdevice is irda_ppp so no need to include linux/miscdevice.h for all irda files. This patch move the linux/miscdevice.h include to irnet_ppp.h Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17irda: irproc.c: Remove unneeded linux/miscdevice.h includeLABBE Corentin1-1/+0
irproc.c does not use any miscdevice so this patch remove this unnecessary inclusion. Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17inet: Fix get port to handle zero port number with soreuseport setTom Herbert2-8/+13
A user may call listen with binding an explicit port with the intent that the kernel will assign an available port to the socket. In this case inet_csk_get_port does a port scan. For such sockets, the user may also set soreuseport with the intent a creating more sockets for the port that is selected. The problem is that the initial socket being opened could inadvertently choose an existing and unreleated port number that was already created with soreuseport. This patch adds a boolean parameter to inet_bind_conflict that indicates rather soreuseport is allowed for the check (in addition to sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port the argument is set to true if an explicit port is being looked up (snum argument is nonzero), and is false if port scan is done. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17inet: Don't go into port scan when looking for specific bind portTom Herbert1-1/+1
inet_csk_get_port is called with port number (snum argument) that may be zero or nonzero. If it is zero, then the intent is to find an available ephemeral port number to bind to. If snum is non-zero then the caller is asking to allocate a specific port number. In the latter case we never want to perform the scan in ephemeral port range. It is conceivable that this can happen if the "goto again" in "tb_found:" is done. This patch adds a check that snum is zero before doing the "goto again". Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net/sched: cls_flower: Use masked key when calling HW offloadsPaul Blakey1-1/+1
Zero bits on the mask signify a "don't care" on the corresponding bits in key. Some HWs require those bits on the key to be zero. Since these bits are masked anyway, it's okay to provide the masked key to all drivers. Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support') Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net/sched: cls_flower: Use mask for addr_typePaul Blakey1-0/+4
When addr_type is set, mask should also be set. Fixes: 66530bdf85eb ('sched,cls_flower: set key address type when present') Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels') Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-16Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds9-507/+256
Pull ceph updates from Ilya Dryomov: "A varied set of changes: - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK (myself). Also fixed a deadlock caused by a bogus allocation on the writeback path and authorize reply verification. - a fix for long stalls during fsync (Jeff Layton). The client now has a way to force the MDS log flush, leading to ~100x speedups in some synthetic tests. - a new [no]require_active_mds mount option (Zheng Yan). On mount, we will now check whether any of the MDSes are available and bail rather than block if none are. This check can be avoided by specifying the "no" option. - a couple of MDS cap handling fixes and a few assorted patches throughout" * tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits) libceph: remove now unused finish_request() wrapper libceph: always signal completion when done ceph: avoid creating orphan object when checking pool permission ceph: properly set issue_seq for cap release ceph: add flags parameter to send_cap_msg ceph: update cap message struct version to 10 ceph: define new argument structure for send_cap_msg ceph: move xattr initialzation before the encoding past the ceph_mds_caps ceph: fix minor typo in unsafe_request_wait ceph: record truncate size/seq for snap data writeback ceph: check availability of mds cluster on mount ceph: fix splice read for no Fc capability case ceph: try getting buffer capability for readahead/fadvise ceph: fix scheduler warning due to nested blocking ceph: fix printing wrong return variable in ceph_direct_read_write() crush: include mapper.h in mapper.c rbd: silence bogus -Wmaybe-uninitialized warning libceph: no need to drop con->mutex for ->get_authorizer() libceph: drop len argument of *verify_authorizer_reply() libceph: verify authorize reply on connect ...
2016-12-16Merge branch 'overlayfs-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This update contains: - try to clone on copy-up - allow renaming a directory - split source into managable chunks - misc cleanups and fixes It does not contain the read-only fd data inconsistency fix, which Al didn't like. I'll leave that to the next year..." * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits) ovl: fix reStructuredText syntax errors in documentation ovl: fix return value of ovl_fill_super ovl: clean up kstat usage ovl: fold ovl_copy_up_truncate() into ovl_copy_up() ovl: create directories inside merged parent opaque ovl: opaque cleanup ovl: show redirect_dir mount option ovl: allow setting max size of redirect ovl: allow redirect_dir to default to "on" ovl: check for emptiness of redirect dir ovl: redirect on rename-dir ovl: lookup redirects ovl: consolidate lookup for underlying layers ovl: fix nested overlayfs mount ovl: check namelen ovl: split super.c ovl: use d_is_dir() ovl: simplify lookup ovl: check lower existence of rename target ovl: rename: simplify handling of lower/merged directory ...
2016-12-16Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds9-169/+101
Pull nfsd updates from Bruce Fields: "The one new feature is support for a new NFSv4.2 mode_umask attribute that makes ACL inheritance a little more useful in environments that default to restrictive umasks. Requires client-side support, also on its way for 4.10. Other than that, miscellaneous smaller fixes and cleanup, especially to the server rdma code" [ The client side of the umask attribute was merged yesterday ] * tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux: nfsd: add support for the umask attribute sunrpc: use DEFINE_SPINLOCK() svcrdma: Further clean-up of svc_rdma_get_inv_rkey() svcrdma: Break up dprintk format in svc_rdma_accept() svcrdma: Remove unused variable in rdma_copy_tail() svcrdma: Remove unused variables in xprt_rdma_bc_allocate() svcrdma: Remove svc_rdma_op_ctxt::wc_status svcrdma: Remove DMA map accounting svcrdma: Remove BH-disabled spin locking in svc_rdma_send() svcrdma: Renovate sendto chunk list parsing svcauth_gss: Close connection when dropping an incoming message svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm nfsd: constify reply_cache_stats_operations structure nfsd: update workqueue creation sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code nfsd: catch errors in decode_fattr earlier nfsd: clean up supported attribute handling nfsd: fix error handling for clients that fail to return the layout nfsd: more robust allocation failure handling in nfsd_reply_cache_init
2016-12-16Merge branch 'for-linus' of ↵Linus Torvalds6-16/+13
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: - more ->d_init() stuff (work.dcache) - pathname resolution cleanups (work.namei) - a few missing iov_iter primitives - copy_from_iter_full() and friends. Either copy the full requested amount, advance the iterator and return true, or fail, return false and do _not_ advance the iterator. Quite a few open-coded callers converted (and became more readable and harder to fuck up that way) (work.iov_iter) - several assorted patches, the big one being logfs removal * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: logfs: remove from tree vfs: fix put_compat_statfs64() does not handle errors namei: fold should_follow_link() with the step into not-followed link namei: pass both WALK_GET and WALK_MORE to should_follow_link() namei: invert WALK_PUT logics namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link() namei: saner calling conventions for mountpoint_last() namei.c: get rid of user_path_parent() switch getfrag callbacks to ..._full() primitives make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success [iov_iter] new primitives - copy_from_iter_full() and friends don't open-code file_inode() ceph: switch to use of ->d_init() ceph: unify dentry_operations instances lustre: switch to use of ->d_init()
2016-12-16Revert "af_unix: fix hard linked sockets on overlay"Miklos Szeredi1-3/+3
This reverts commit eb0a4a47ae89aaa0674ab3180de6a162f3be2ddf. Since commit 51f7e52dc943 ("ovl: share inode for hard link") there's no need to call d_real_inode() to check two overlay inodes for equality. Side effect of this revert is that it's no longer possible to connect one socket on overlayfs to one on the underlying layer (something which didn't make sense anyway). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-15Merge tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds10-120/+138
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - Fix a pnfs deadlock between read resends and layoutreturn - Don't invalidate the layout stateid while a layout return is outstanding - Don't schedule a layoutreturn if the layout stateid is marked as invalid - On a pNFS error, do not send LAYOUTGET until the LAYOUTRETURN is complete - SUNRPC: fix refcounting problems with auth_gss messages. Features: - Add client support for the NFSv4 umask attribute. - NFSv4: Correct support for flock() stateids. - Add a LAYOUTRETURN operation to CLOSE and DELEGRETURN when return-on-close is specified - Allow the pNFS/flexfiles layoutstat information to piggyback on LAYOUTRETURN - Optimise away redundant GETATTR calls when doing state recovery and/or when not required by cache revalidation rules or close-to-open cache consistency. - Attribute cache improvements - RPC/RDMA support for SG_GAP devices Bugfixes: - NFS: Fix performance regressions in readdir - pNFS/flexfiles: Fix a deadlock on LAYOUTGET - NFSv4: Add missing nfs_put_lock_context() - NFSv4.1: Fix regression in callback retry handling - Fix false positive NFSv4.0 trunking detection. - pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated - Various layout stateid related bugfixes - RPC/RDMA bugfixes" * tag 'nfs-for-4.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (82 commits) SUNRPC: fix refcounting problems with auth_gss messages. nfs: add support for the umask attribute pNFS/flexfiles: Ensure we have enough buffer for layoutreturn pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr() pNFS/flexfiles: Fix a deadlock on LAYOUTGET pNFS: Layoutreturn must free the layout after the layout-private data pNFS/flexfiles: Fix ff_layout_add_ds_error_locked() NFSv4: Add missing nfs_put_lock_context() pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery() NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE NFS: Only look at the change attribute cache state in nfs_check_verifier NFS: Fix incorrect size revalidation when holding a delegation NFS: Fix incorrect mapping revalidation when holding a delegation pNFS/flexfiles: Support sending layoutstats in layoutreturn pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn NFS: Fix up read of mirror stats pNFS/flexfiles: Clean up layoutstats pNFS/flexfiles: Refactor encoding of the layoutreturn payload pNFS: Add a layoutreturn callback to performa layout-private setup ...
2016-12-15Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds9-22/+13
Pull virtio updates from Michael Tsirkin: "virtio, vhost: new device, fixes, speedups This includes the new virtio crypto device, and fixes all over the place. In particular enabling endian-ness checks for sparse builds found some bugs which this fixes. And it appears that everyone is in agreement that disabling endian-ness sparse checks shouldn't be necessary any longer. So this enables them for everyone, and drops the __CHECK_ENDIAN__ and __bitwise__ APIs. IRQ handling in virtio has been refactored somewhat, the larger switch to IRQ_SHARED will have to wait as it proved too aggressive" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) Makefile: drop -D__CHECK_ENDIAN__ from cflags fs/logfs: drop __CHECK_ENDIAN__ Documentation/sparse: drop __CHECK_ENDIAN__ linux: drop __bitwise__ everywhere checkpatch: replace __bitwise__ with __bitwise Documentation/sparse: drop __bitwise__ tools: enable endian checks for all sparse builds linux/types.h: enable endian checks for all sparse builds virtio_mmio: Set dev.release() to avoid warning vhost: remove unused feature bit virtio_ring: fix description of virtqueue_get_buf vhost/scsi: Remove unused but set variable tools/virtio: use {READ,WRITE}_ONCE() in uaccess.h vringh: kill off ACCESS_ONCE() tools/virtio: fix READ_ONCE() crypto: add virtio-crypto driver vhost: cache used event for better performance vsock: lookup and setup guest_cid inside vhost_vsock_lock virtio_pci: split vp_try_to_find_vqs into INTx and MSI-X variants virtio_pci: merge vp_free_vectors into vp_del_vqs ...
2016-12-16Makefile: drop -D__CHECK_ENDIAN__ from cflagsMichael S. Tsirkin5-9/+1
That's the default now, no need for makefiles to set it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Kalle Valo <kvalo@codeaurora.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
2016-12-16linux: drop __bitwise__ everywhereMichael S. Tsirkin2-3/+3
__bitwise__ used to mean "yes, please enable sparse checks unconditionally", but now that we dropped __CHECK_ENDIAN__ __bitwise is exactly the same. There aren't many users, replace it by __bitwise everywhere. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Akced-by: Lee Duncan <lduncan@suse.com>
2016-12-15Merge tag 'for-linus' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is the complete update for the rdma stack for this release cycle. Most of it is typical driver and core updates, but there is the entirely new VMWare pvrdma driver. You may have noticed that there were changes in DaveM's pull request to the bnxt Ethernet driver to support a RoCE RDMA driver. The bnxt_re driver was tentatively set to be pulled in this release cycle, but it simply wasn't ready in time and was dropped (a few review comments still to address, and some multi-arch build issues like prefetch() not working across all arches). Summary: - shared mlx5 updates with net stack (will drop out on merge if Dave's tree has already been merged) - driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe - debug cleanups - new connection rejection helpers - SRP updates - various misc fixes - new paravirt driver from vmware" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits) IB: Add vmw_pvrdma driver IB/mlx4: fix improper return value IB/ocrdma: fix bad initialization infiniband: nes: return value of skb_linearize should be handled MAINTAINERS: Update Intel RDMA RNIC driver maintainers MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers IB/core: fix unmap_sg argument qede: fix general protection fault may occur on probe IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc mlx5, calc_sq_size(): Make a debug message more informative mlx5: Remove a set-but-not-used variable mlx5: Use { } instead of { 0 } to init struct IB/srp: Make writing the add_target sysfs attr interruptible IB/srp: Make mapping failures easier to debug IB/srp: Make login failures easier to debug IB/srp: Introduce a local variable in srp_add_one() IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build IB/multicast: Check ib_find_pkey() return value IPoIB: Avoid reading an uninitialized member variable IB/mad: Fix an array index check ...
2016-12-15mac80211: fix legacy and invalid rx-rate reportBen Greear1-6/+8
This fixes obtaining the rate info via sta_set_sinfo when the rx rate is invalid (for instance, on IBSS interface that has received no frames from one of its peers). Also initialize rinfo->flags for legacy rates, to not rely on the whole sinfo being initialized to zero. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-15vsock/virtio: fix src/dst cid formatMichael S. Tsirkin1-7/+7
These fields are 64 bit, using le32_to_cpu and friends on these will not do the right thing. Fix this up. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15vsock/virtio: mark an internal function staticMichael S. Tsirkin1-2/+1
virtio_transport_alloc_pkt is only used locally, make it static. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15vsock/virtio: add a missing __le annotationMichael S. Tsirkin1-1/+1
guest cid is read from config space, therefore it's in little endian format and is treated as such, annotate it accordingly. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-7/+8
Merge more updates from Andrew Morton: - a few misc things - kexec updates - DMA-mapping updates to better support networking DMA operations - IPC updates - various MM changes to improve DAX fault handling - lots of radix-tree changes, mainly to the test suite. All leading up to reimplementing the IDA/IDR code to be a wrapper layer over the radix-tree. However the final trigger-pulling patch is held off for 4.11. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) radix tree test suite: delete unused rcupdate.c radix tree test suite: add new tag check radix-tree: ensure counts are initialised radix tree test suite: cache recently freed objects radix tree test suite: add some more functionality idr: reduce the number of bits per level from 8 to 6 rxrpc: abstract away knowledge of IDR internals tpm: use idr_find(), not idr_find_slowpath() idr: add ida_is_empty radix tree test suite: check multiorder iteration radix-tree: fix replacement for multiorder entries radix-tree: add radix_tree_split_preload() radix-tree: add radix_tree_split radix-tree: add radix_tree_join radix-tree: delete radix_tree_range_tag_if_tagged() radix-tree: delete radix_tree_locate_item() radix-tree: improve multiorder iterators btrfs: fix race in btrfs_free_dummy_fs_info() radix-tree: improve dump output radix-tree: make radix_tree_find_next_bit more useful ...
2016-12-14rxrpc: abstract away knowledge of IDR internalsMatthew Wilcox2-7/+8
Add idr_get_cursor() / idr_set_cursor() APIs, and remove the reference to IDR_SIZE. Link: http://lkml.kernel.org/r/1480369871-5271-65-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>