summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2017-02-23bpf: Fix bpf_xdp_event_outputMartin KaFai Lau1-2/+2
Fix a typo. xdp->data instead of xdp should be copied to the perf-event's dst_buff. Fixes: 4de16969523c ("bpf: enable event output helper also for xdp types") Reported-by: Huapeng Zhou <hzhou@fb.com> Tested-by: Feixiong Zhang <feixiong@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller6-39/+81
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Revisit warning logic when not applying default helper assignment. Jiri Kosina considers we are breaking existing setups and not warning our users accordinly now that automatic helper assignment has been turned off by default. So let's make him happy by spotting the warning by when we find a helper but we cannot attach, instead of warning on the former deprecated behaviour. Patch from Jiri Kosina. 2) Two patches to fix regression in ctnetlink interfaces with nfnetlink_queue. Specifically, perform more relaxed in CTA_STATUS and do not bail out if CTA_HELP indicates the same helper that we already have. Patches from Kevin Cernekee. 3) A couple of bugfixes for ipset via Jozsef Kadlecsik. Due to wrong index logic in hash set types and null pointer exception in the list:set type. 4) hashlimit bails out with correct userspace parameters due to wrong arithmetics in the code that avoids "divide by zero" when transforming the userspace timing in milliseconds to token credits. Patch from Alban Browaeys. 5) Fix incorrect NFQA_VLAN_MAX definition, patch from Ken-ichirou MATSUZAWA. 6) Don't not declare nfnetlink batch error list as static, since this may be used by several subsystems at the same time. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22tcp: account for ts offset only if tsecr not zeroAlexey Kodanev1-1/+2
We can get SYN with zero tsecr, don't apply offset in this case. Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22tcp: setup timestamp offset when write_seq already setAlexey Kodanev2-12/+20
Found that when randomized tcp offsets are enabled (by default) TCP client can still start new connections without them. Later, if server does active close and re-uses sockets in TIME-WAIT state, new SYN from client can be rejected on PAWS check inside tcp_timewait_state_process(), because either tw_ts_recent or rcv_tsval doesn't really have an offset set. Here is how to reproduce it with LTP netstress tool: netstress -R 1 & netstress -H 127.0.0.1 -lr 1000000 -a1 [...] < S seq 1956977072 win 43690 TS val 295618 ecr 459956970 > . ack 1956911535 win 342 TS val 459967184 ecr 1547117608 < R seq 1956911535 win 0 length 0 +1. < S seq 1956977072 win 43690 TS val 296640 ecr 459956970 > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640 Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each connection") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22net/dccp: fix use after free in tw_timer_handler()Andrey Ryabinin2-0/+12
DCCP doesn't purge timewait sockets on network namespace shutdown. So, after net namespace destroyed we could still have an active timer which will trigger use after free in tw_timer_handler(): BUG: KASAN: use-after-free in tw_timer_handler+0x4a/0xa0 at addr ffff88010e0d1e10 Read of size 8 by task swapper/1/0 Call Trace: __asan_load8+0x54/0x90 tw_timer_handler+0x4a/0xa0 call_timer_fn+0x127/0x480 expire_timers+0x1db/0x2e0 run_timer_softirq+0x12f/0x2a0 __do_softirq+0x105/0x5b4 irq_exit+0xdd/0xf0 smp_apic_timer_interrupt+0x57/0x70 apic_timer_interrupt+0x90/0xa0 Object at ffff88010e0d1bc0, in cache net_namespace size: 6848 Allocated: save_stack_trace+0x1b/0x20 kasan_kmalloc+0xee/0x180 kasan_slab_alloc+0x12/0x20 kmem_cache_alloc+0x134/0x310 copy_net_ns+0x8d/0x280 create_new_namespaces+0x23f/0x340 unshare_nsproxy_namespaces+0x75/0xf0 SyS_unshare+0x299/0x4f0 entry_SYSCALL_64_fastpath+0x18/0xad Freed: save_stack_trace+0x1b/0x20 kasan_slab_free+0xae/0x180 kmem_cache_free+0xb4/0x350 net_drop_ns+0x3f/0x50 cleanup_net+0x3df/0x450 process_one_work+0x419/0xbb0 worker_thread+0x92/0x850 kthread+0x192/0x1e0 ret_from_fork+0x2e/0x40 Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge timewait sockets on net namespace destruction and prevent above issue. Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22l2tp: Avoid schedule while atomic in exit_netRidge Kennedy1-1/+3
While destroying a network namespace that contains a L2TP tunnel a "BUG: scheduling while atomic" can be observed. Enabling lockdep shows that this is happening because l2tp_exit_net() is calling l2tp_tunnel_closeall() (via l2tp_tunnel_delete()) from within an RCU critical section. l2tp_exit_net() takes rcu_read_lock_bh() << list_for_each_entry_rcu() >> l2tp_tunnel_delete() l2tp_tunnel_closeall() __l2tp_session_unhash() synchronize_rcu() << Illegal inside RCU critical section >> BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 0, pid: 86, name: kworker/u16:2 INFO: lockdep is turned off. CPU: 2 PID: 86 Comm: kworker/u16:2 Tainted: G W O 4.4.6-at1 #2 Hardware name: Xen HVM domU, BIOS 4.6.1-xs125300 05/09/2016 Workqueue: netns cleanup_net 0000000000000000 ffff880202417b90 ffffffff812b0013 ffff880202410ac0 ffffffff81870de8 ffff880202417bb8 ffffffff8107aee8 ffffffff81870de8 0000000000000c51 0000000000000000 ffff880202417be0 ffffffff8107b024 Call Trace: [<ffffffff812b0013>] dump_stack+0x85/0xc2 [<ffffffff8107aee8>] ___might_sleep+0x148/0x240 [<ffffffff8107b024>] __might_sleep+0x44/0x80 [<ffffffff810b21bd>] synchronize_sched+0x2d/0xe0 [<ffffffff8109be6d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8105c7bb>] ? __local_bh_enable_ip+0x6b/0xc0 [<ffffffff816a1b00>] ? _raw_spin_unlock_bh+0x30/0x40 [<ffffffff81667482>] __l2tp_session_unhash+0x172/0x220 [<ffffffff81667397>] ? __l2tp_session_unhash+0x87/0x220 [<ffffffff8166888b>] l2tp_tunnel_closeall+0x9b/0x140 [<ffffffff81668c74>] l2tp_tunnel_delete+0x14/0x60 [<ffffffff81668dd0>] l2tp_exit_net+0x110/0x270 [<ffffffff81668d5c>] ? l2tp_exit_net+0x9c/0x270 [<ffffffff815001c3>] ops_exit_list.isra.6+0x33/0x60 [<ffffffff81501166>] cleanup_net+0x1b6/0x280 ... This bug can easily be reproduced with a few steps: $ sudo unshare -n bash # Create a shell in a new namespace # ip link set lo up # ip addr add 127.0.0.1 dev lo # ip l2tp add tunnel remote 127.0.0.1 local 127.0.0.1 tunnel_id 1 \ peer_tunnel_id 1 udp_sport 50000 udp_dport 50000 # ip l2tp add session name foo tunnel_id 1 session_id 1 \ peer_session_id 1 # ip link set foo up # exit # Exit the shell, in turn exiting the namespace $ dmesg ... [942121.089216] BUG: scheduling while atomic: kworker/u16:3/13872/0x00000200 ... To fix this, move the call to l2tp_tunnel_closeall() out of the RCU critical section, and instead call it from l2tp_tunnel_del_work(), which is running from the l2tp_wq workqueue. Fixes: 2b551c6e7d5b ("l2tp: close sessions before initiating tunnel delete") Signed-off-by: Ridge Kennedy <ridge.kennedy@alliedtelesis.co.nz> Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds455-6052/+19494
Pull networking updates from David Miller: "Highlights: 1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini Varadhan. 2) Simplify classifier state on sk_buff in order to shrink it a bit. From Willem de Bruijn. 3) Introduce SIPHASH and it's usage for secure sequence numbers and syncookies. From Jason A. Donenfeld. 4) Reduce CPU usage for ICMP replies we are going to limit or suppress, from Jesper Dangaard Brouer. 5) Introduce Shared Memory Communications socket layer, from Ursula Braun. 6) Add RACK loss detection and allow it to actually trigger fast recovery instead of just assisting after other algorithms have triggered it. From Yuchung Cheng. 7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot. 8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert. 9) Export MPLS packet stats via netlink, from Robert Shearman. 10) Significantly improve inet port bind conflict handling, especially when an application is restarted and changes it's setting of reuseport. From Josef Bacik. 11) Implement TX batching in vhost_net, from Jason Wang. 12) Extend the dummy device so that VF (virtual function) features, such as configuration, can be more easily tested. From Phil Sutter. 13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric Dumazet. 14) Add new bpf MAP, implementing a longest prefix match trie. From Daniel Mack. 15) Packet sample offloading support in mlxsw driver, from Yotam Gigi. 16) Add new aquantia driver, from David VomLehn. 17) Add bpf tracepoints, from Daniel Borkmann. 18) Add support for port mirroring to b53 and bcm_sf2 drivers, from Florian Fainelli. 19) Remove custom busy polling in many drivers, it is done in the core networking since 4.5 times. From Eric Dumazet. 20) Support XDP adjust_head in virtio_net, from John Fastabend. 21) Fix several major holes in neighbour entry confirmation, from Julian Anastasov. 22) Add XDP support to bnxt_en driver, from Michael Chan. 23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan. 24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi. 25) Support GRO in IPSEC protocols, from Steffen Klassert" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits) Revert "ath10k: Search SMBIOS for OEM board file extension" net: socket: fix recvmmsg not returning error from sock_error bnxt_en: use eth_hw_addr_random() bpf: fix unlocking of jited image when module ronx not set arch: add ARCH_HAS_SET_MEMORY config net: napi_watchdog() can use napi_schedule_irqoff() tcp: Revert "tcp: tcp_probe: use spin_lock_bh()" net/hsr: use eth_hw_addr_random() net: mvpp2: enable building on 64-bit platforms net: mvpp2: switch to build_skb() in the RX path net: mvpp2: simplify MVPP2_PRS_RI_* definitions net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT net: mvpp2: remove unused register definitions net: mvpp2: simplify mvpp2_bm_bufs_add() net: mvpp2: drop useless fields in mvpp2_bm_pool and related code net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue' net: mvpp2: release reference to txq_cpu[] entry after unmapping net: mvpp2: handle too large value in mvpp2_rx_time_coal_set() net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set() net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set ...
2017-02-21Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds1-3/+14
Pull audit updates from Paul Moore: "The audit changes for v4.11 are relatively small compared to what we did for v4.10, both in terms of size and impact. - two patches from Steve tweak the formatting for some of the audit records to make them more consistent with other audit records. - three patches from Richard record the name of a module on module load, fix the logging of sockaddr information when using socketcall() on 32-bit systems, and add the ability to reset audit's lost record counter. - my lone patch just fixes an annoying style nit that I was reminded about by one of Richard's patches. All these patches pass our test suite" * 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit: audit: remove unnecessary curly braces from switch/case statements audit: log module name on init_module audit: log 32-bit socketcalls audit: add feature audit_lost reset audit: Make AUDIT_ANOM_ABEND event normalized audit: Make AUDIT_KERNEL event conform to the specification
2017-02-21net: socket: fix recvmmsg not returning error from sock_errorMaxime Jayat1-1/+3
Commit 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path"), changed the exit path of recvmmsg to always return the datagrams variable and modified the error paths to set the variable to the error code returned by recvmsg if necessary. However in the case sock_error returned an error, the error code was then ignored, and recvmmsg returned 0. Change the error path of recvmmsg to correctly return the error code of sock_error. The bug was triggered by using recvmmsg on a CAN interface which was not up. Linux 4.6 and later return 0 in this case while earlier releases returned -ENETDOWN. Fixes: 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path") Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: napi_watchdog() can use napi_schedule_irqoff()Eric Dumazet1-1/+1
hrtimer handlers run with masked hard IRQ, we can therefore use napi_schedule_irqoff() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"Eric Dumazet1-2/+2
This reverts commit e70ac171658679ecf6bea4bbd9e9325cd6079d2b. jtcp_rcv_established() is in fact called with hard irq being disabled. Initial bug report from Ricardo Nabinger Sanchez [1] still needs to be investigated, but does not look like a TCP bug. [1] https://www.spinics.net/lists/netdev/msg420960.html Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <xiaolong.ye@intel.com> Cc: Ricardo Nabinger Sanchez <rnsanchez@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net/hsr: use eth_hw_addr_random()Tobias Klauser1-1/+1
Use eth_hw_addr_random() to set a random MAC address in order to make sure dev->addr_assign_type will be properly set to NET_ADDR_RANDOM. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: sock: Use USEC_PER_SEC macro instead of literal 1000000Gao Feng1-3/+3
The USEC_PER_SEC is used once in sock_set_timeout as the max value of tv_usec. But there are other similar codes which use the literal 1000000 in this file. It is minor cleanup to keep consitent. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21ip: fix IP_CHECKSUM handlingPaolo Abeni1-4/+4
The skbs processed by ip_cmsg_recv() are not guaranteed to be linear e.g. when sending UDP packets over loopback with MSGMORE. Using csum_partial() on [potentially] the whole skb len is dangerous; instead be on the safe side and use skb_checksum(). Thanks to syzkaller team to detect the issue and provide the reproducer. v1 -> v2: - move the variable declaration in a tighter scope Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21rtnl: simplify error return path in rtnl_create_link()Tobias Klauser1-6/+1
There is only one possible error path which reaches the err label, so return ERR_PTR(-ENOMEM) directly if alloc_netdev_mqs() fails. This also allows to omit the err variable. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21Merge branch 'master' of git://blackhole.kfki.hu/nfPablo Neira Ayuso2-4/+7
Jozsef Kadlecsik says: ==================== ipset patches for nf Please apply the next patches for ipset in your nf branch. Both patches should go into the stable kernel branches as well, because these are important bugfixes: * Sometimes valid entries in hash:* types of sets were evicted due to a typo in an index. The wrong evictions happen when entries are deleted from the set and the bucket is shrinked. Bug was reported by Eric Ewanco and the patch fixes netfilter bugzilla id #1119. * Fixing of a null pointer exception when someone wants to add an entry to an empty list type of set and specifies an add before/after option. The fix is from Vishwanath Pai. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-21netfilter: nfnetlink: remove static declaration from err_listLiping Zhang1-1/+1
Otherwise, different subsys will race to access the err_list, with holding the different nfnl_lock(subsys_id). But this will not happen now, since ->call_batch is only implemented by nftables, so the err_list is protected by nfnl_lock(NFNL_SUBSYS_NFTABLES). Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-20Merge branch 'locking-core-for-linus' of ↵Linus Torvalds10-25/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Implement wraparound-safe refcount_t and kref_t types based on generic atomic primitives (Peter Zijlstra) - Improve and fix the ww_mutex code (Nicolai Hähnle) - Add self-tests to the ww_mutex code (Chris Wilson) - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr Bueso) - Micro-optimize the current-task logic all around the core kernel (Davidlohr Bueso) - Tidy up after recent optimizations: remove stale code and APIs, clean up the code (Waiman Long) - ... plus misc fixes, updates and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) fork: Fix task_struct alignment locking/spinlock/debug: Remove spinlock lockup detection code lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS lkdtm: Convert to refcount_t testing kref: Implement 'struct kref' using refcount_t refcount_t: Introduce a special purpose refcount type sched/wake_q: Clarify queue reinit comment sched/wait, rcuwait: Fix typo in comment locking/mutex: Fix lockdep_assert_held() fail locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock() locking/rwsem: Reinit wake_q after use locking/rwsem: Remove unnecessary atomic_long_t casts jump_labels: Move header guard #endif down where it belongs locking/atomic, kref: Implement kref_put_lock() locking/ww_mutex: Turn off __must_check for now locking/atomic, kref: Avoid more abuse locking/atomic, kref: Use kref_get_unless_zero() more locking/atomic, kref: Kill kref_sub() locking/atomic, kref: Add kref_read() locking/atomic, kref: Add KREF_INIT() ...
2017-02-20Merge branch 'for-upstream' of ↵David S. Miller2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2017-02-19 Here's a set of Bluetooth patches for the 4.11 kernel: - New USB IDs to the btusb driver - Race fix in btmrvl driver - Added out-of-band wakeup support to the btusb driver - NULL dereference fix to bt_sock_recvmsg Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20net: mpls: Add support for netconfDavid Ahern2-3/+211
Add netconf support to MPLS. Allows userpsace to learn and be notified of changes to 'input' enable setting per interface. Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20sctp: add support for MSG_MOREXin Long2-6/+4
This patch is to add support for MSG_MORE on sctp. It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after creating datamsg according to the send flag. sctp_packet_can_append_data then uses it to decide if the chunks of this msg will be sent at once or delay it. Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of in assoc. As sctp enqueues the chunks first, then dequeue them one by one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may affect other chunks' bundling. Since last patch, sctp flush out queue once assoc state falls into SHUTDOWN_PENDING, the close block problem mentioned in [1] has been solved as well. [1] https://patchwork.ozlabs.org/patch/372404/ Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20sctp: flush out queue once assoc state falls into SHUTDOWN_PENDINGXin Long1-0/+4
This patch is to flush out queue when assoc state falls into SHUTDOWN_PENDING if there are still chunks in it, so that the data can be sent out as soon as possible before sending SHUTDOWN chunk. When sctp supports MSG_MORE flag in next patch, this improvement can also solve the problem that the chunks with MSG_MORE flag may be stuck in queue when closing an assoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20openvswitch: Set event bit after initializing labels.Jarno Rajahalme1-3/+6
Connlabels are included in conntrack netlink event messages only if the IPCT_LABEL bit is set in the event cache (see ctnetlink_conntrack_event()). Set it after initializing labels for a new connection. Found upon further system testing, where it was noticed that labels were missing from the conntrack events. Fixes: 193e30967897 ("openvswitch: Do not trigger events for unconfirmed connections.") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: check duplicate node before inserting a new transportXin Long1-0/+13
sctp has changed to use rhlist for transport rhashtable since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable"). But rhltable_insert_key doesn't check the duplicate node when inserting a node, unlike rhashtable_lookup_insert_key. It may cause duplicate assoc/transport in rhashtable. like: client (addr A, B) server (addr X, Y) connect to X INIT (1) ------------> connect to Y INIT (2) ------------> INIT_ACK (1) <------------ INIT_ACK (2) <------------ After sending INIT (2), one transport will be created and hashed into rhashtable. But when receiving INIT_ACK (1) and processing the address params, another transport will be created and hashed into rhashtable with the same addr Y and EP as the last transport. This will confuse the assoc/transport's lookup. This patch is to fix it by returning err if any duplicate node exists before inserting it. Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable") Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: add reconf chunk eventXin Long1-0/+30
This patch is to add reconf chunk event based on the sctp event frame in rx path, it will call sctp_sf_do_reconf to process the reconf chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: add reconf chunk processXin Long1-0/+54
This patch is to add a function to process the incoming reconf chunk, in which it verifies the chunk, and traverses the param and process it with the right function one by one. sctp_sf_do_reconf would be the process function of reconf chunk event. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: add a function to verify the sctp reconf chunkXin Long1-0/+59
This patch is to add a function sctp_verify_reconf to do some length check and multi-params check for sctp stream reconf according to rfc6525 section 3.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: implement receiver-side procedures for the Incoming SSN Reset Request ↵Xin Long2-7/+71
Parameter This patch is to implement Receiver-Side Procedures for the Incoming SSN Reset Request Parameter described in rfc6525 section 5.2.3. It's also to move str_list endian conversion out of sctp_make_strreset_req, so that sctp_make_strreset_req can be used more conveniently to process inreq. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: implement receiver-side procedures for the Outgoing SSN Reset Request ↵Xin Long1-0/+113
Parameter This patch is to implement Receiver-Side Procedures for the Outgoing SSN Reset Request Parameter described in rfc6525 section 5.2.2. Note that some checks must be after request_seq check, as even those checks fail, strreset_inseq still has to be increase by 1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: add support for generating stream ssn reset event notificationXin Long1-0/+29
This patch is to add Stream Reset Event described in rfc6525 section 6.1.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: add support for generating stream reconf resp chunkXin Long1-0/+74
This patch is to define Re-configuration Response Parameter described in rfc6525 section 4.4. As optional fields are only for SSN/TSN Reset Request Parameter, it uses another function to make that. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19netfilter: xt_hashlimit: Fix integer divide round to zero.Alban Browaeys1-16/+9
Diving the divider by the multiplier before applying to the input. When this would "divide by zero", divide the multiplier by the divider first then multiply the input by this value. Currently user2creds outputs zero when input value is bigger than the number of slices and lower than scale. This as then user input is applied an integer divide operation to a number greater than itself (scale). That rounds up to zero, then we multiply zero by the credits slice size. iptables -t filter -I INPUT --protocol tcp --match hashlimit --hashlimit 40/second --hashlimit-burst 20 --hashlimit-mode srcip --hashlimit-name syn-flood --jump RETURN thus trigger the overflow detection code: xt_hashlimit: overflow, try lower: 25000/20 (25000 as hashlimit avg and 20 the burst) Here: 134217 slices of (HZ * CREDITS_PER_JIFFY) size. 500000 is user input value 1000000 is XT_HASHLIMIT_SCALE_v2 gives: 0 as user2creds output Setting burst to "1" typically solve the issue ... but setting it to "40" does too ! This is on 32bit arch calling into revision 2 of hashlimit. Signed-off-by: Alban Browaeys <alban.browaeys@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-19netfilter: ipset: Null pointer exception in ipset list:setVishwanath Pai1-3/+6
If we use before/after to add an element to an empty list it will cause a kernel panic. $> cat crash.restore create a hash:ip create b hash:ip create test list:set timeout 5 size 4 add test b before a $> ipset -R < crash.restore Executing the above will crash the kernel. Signed-off-by: Vishwanath Pai <vpai@akamai.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2017-02-19Fix bug: sometimes valid entries in hash:* types of sets were evictedJozsef Kadlecsik1-1/+1
Wrong index was used and therefore when shrinking a hash bucket at deleting an entry, valid entries could be evicted as well. Thanks to Eric Ewanco for the thorough bugreport. Fixes netfilter bugzilla #1119 Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2017-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-30/+44
2017-02-18ipv6: release dst on error in ip6_dst_lookup_tailWillem de Bruijn1-2/+4
If ip6_dst_lookup_tail has acquired a dst and fails the IPv4-mapped check, release the dst before returning an error. Fixes: ec5e3b0a1d41 ("ipv6: Inhibit IPv4-mapped src address on the wire.") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17irda: Fix lockdep annotations in hashbin_delete().David S. Miller1-18/+16
A nested lock depth was added to the hasbin_delete() code but it doesn't actually work some well and results in tons of lockdep splats. Fix the code instead to properly drop the lock around the operation and just keep peeking the head of the hashbin queue. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17tcp: use page_ref_inc() in tcp_sendmsg()Eric Dumazet1-1/+1
sk_page_frag_refill() allocates either a compound page or an order-0 page. We can use page_ref_inc() which is slightly faster than get_page() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17tcp: accommodate sequence number to a peer's shrunk receive window caused by ↵Cui, Cheng1-2/+5
precision loss in window scaling Prevent sending out a left-shifted sequence number from a Linux sender in response to a peer's shrunk receive-window caused by losing least significant bits in window-scaling. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cheng Cui <Cheng.Cui@netapp.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17rds:Remove unnecessary ib_ring unallocZhu Yanjun1-1/+0
In the function rds_ib_xmit_atomic, ib_ring is not allocated successfully. As such, it is not necessary to unalloc it. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17pkt_sched: Remove useless qdisc_stab_lockGao Feng1-12/+0
The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab. These two functions are invoked in qdisc_create, qdisc_change, and qdisc_destroy which run fully under RTNL. So it already makes sure only one could access the qdisc_stab_list at the same time. Then it is unnecessary to use qdisc_stab_lock now. Signed-off-by: Gao Feng <fgao@ikuai8.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17rxrpc: Change module filename to rxrpc.koDavid Howells1-6/+6
Change module filename from af-rxrpc.ko to rxrpc.ko so as to be consistent with the other protocol drivers. Also adjust the documentation to reflect this. Further, there is no longer a standalone rxkad module, as it has been merged into the rxrpc core, so get rid of references to that. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17rtnl: don't account unused struct ifla_port_vsi in rtnl_port_sizeDaniel Borkmann1-4/+7
When allocating rtnl dump messages, struct ifla_port_vsi is never dumped, so we can save header plus payload in rtnl_port_size(). Infact, attribute IFLA_PORT_VSI_TYPE and struct ifla_port_vsi are not used anywhere in the kernel. We only need to keep the nla policy should applications in user space be filling this out. Same NLA_BINARY issue exists as was fixed in 364d5716a7ad ("rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY") and others, but then again IFLA_PORT_VSI_TYPE is not used anywhere, so just add a comment that it's unused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17bridge: don't indicate expiry on NTF_EXT_LEARNED fdb entriesRoopa Prabhu1-1/+1
added_by_external_learn fdb entries are added and expired by external entities like switchdev driver or external controllers. ageing is already disabled for such entries. Hence, don't indicate expiry for such fdb entries. CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17bpf: make jited programs visible in tracesDaniel Borkmann2-1/+9
Long standing issue with JITed programs is that stack traces from function tracing check whether a given address is kernel code through {__,}kernel_text_address(), which checks for code in core kernel, modules and dynamically allocated ftrace trampolines. But what is still missing is BPF JITed programs (interpreted programs are not an issue as __bpf_prog_run() will be attributed to them), thus when a stack trace is triggered, the code walking the stack won't see any of the JITed ones. The same for address correlation done from user space via reading /proc/kallsyms. This is read by tools like perf, but the latter is also useful for permanent live tracing with eBPF itself in combination with stack maps when other eBPF types are part of the callchain. See offwaketime example on dumping stack from a map. This work tries to tackle that issue by making the addresses and symbols known to the kernel. The lookup from *kernel_text_address() is implemented through a latched RB tree that can be read under RCU in fast-path that is also shared for symbol/size/offset lookup for a specific given address in kallsyms. The slow-path iteration through all symbols in the seq file done via RCU list, which holds a tiny fraction of all exported ksyms, usually below 0.1 percent. Function symbols are exported as bpf_prog_<tag>, in order to aide debugging and attribution. This facility is currently enabled for root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening is active in any mode. The rationale behind this is that still a lot of systems ship with world read permissions on kallsyms thus addresses should not get suddenly exposed for them. If that situation gets much better in future, we always have the option to change the default on this. Likewise, unprivileged programs are not allowed to add entries there either, but that is less of a concern as most such programs types relevant in this context are for root-only anyway. If enabled, call graphs and stack traces will then show a correct attribution; one example is illustrated below, where the trace is now visible in tooling such as perf script --kallsyms=/proc/kallsyms and friends. Before: 7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so) After: 7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux) [...] 7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17bpf: mark all registered map/prog types as __ro_after_initDaniel Borkmann1-9/+9
All map types and prog types are registered to the BPF core through bpf_register_map_type() and bpf_register_prog_type() during init and remain unchanged thereafter. As by design we don't (and never will) have any pluggable code that can register to that at any later point in time, lets mark all the existing bpf_{map,prog}_type_list objects in the tree as __ro_after_init, so they can be moved to read-only section from then onwards. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17dccp: fix freeing skb too early for IPV6_RECVPKTINFOAndrey Konovalov1-1/+2
In the current DCCP implementation an skb for a DCCP_PKT_REQUEST packet is forcibly freed via __kfree_skb in dccp_rcv_state_process if dccp_v6_conn_request successfully returns. However, if IPV6_RECVPKTINFO is set on a socket, the address of the skb is saved to ireq->pktopts and the ref count for skb is incremented in dccp_v6_conn_request, so skb is still in use. Nevertheless, it gets freed in dccp_rcv_state_process. Fix by calling consume_skb instead of doing goto discard and therefore calling __kfree_skb. Similar fixes for TCP: fb7e2399ec17f1004c0e0ccfd17439f8759ede01 [TCP]: skb is unexpectedly freed. 0aea76d35c9651d55bbaf746e7914e5f9ae5a25d tcp: SYN packets are now simply consumed Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17bridge: vlan_tunnel: explicitly reset metadata attrs to NULL on failureRoopa Prabhu1-0/+2
Fixes: efa5356b0d97 ("bridge: per vlan dst_metadata netlink support") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17tipc: Fix tipc_sk_reinit race conditionsHerbert Xu2-11/+23
There are two problems with the function tipc_sk_reinit. Firstly it's doing a manual walk over an rhashtable. This is broken as an rhashtable can be resized and if you manually walk over it during a resize then you may miss entries. Secondly it's missing memory barriers as previously the code used spinlocks which provide the barriers implicitly. This patch fixes both problems. Fixes: 07f6c4bc048a ("tipc: convert tipc reference table to...") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17net/sched: cls_bpf: Reflect HW offload statusOr Gerlitz1-2/+11
BPF classifier support for the "in hw" offloading flags. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Amir Vadai <amir@vadai.me> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>