summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2020-05-25tcp: allow traceroute -Mtcp for unpriv usersEric Dumazet1-0/+2
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones. $ traceroute -Mtcp 8.8.8.8 You do not have enough privileges to use this traceroute method. $ traceroute -n -Mudp 8.8.8.8 traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets 1 192.168.86.1 3.631 ms 3.512 ms 3.405 ms 2 10.1.10.1 4.183 ms 4.125 ms 4.072 ms 3 96.120.88.125 20.621 ms 19.462 ms 20.553 ms 4 96.110.177.65 24.271 ms 25.351 ms 25.250 ms 5 69.139.199.197 44.492 ms 43.075 ms 44.346 ms 6 68.86.143.93 27.969 ms 25.184 ms 25.092 ms 7 96.112.146.18 25.323 ms 96.112.146.22 25.583 ms 96.112.146.26 24.502 ms 8 72.14.239.204 24.405 ms 74.125.37.224 16.326 ms 17.194 ms 9 209.85.251.9 18.154 ms 209.85.247.55 14.449 ms 209.85.251.9 26.296 ms^C We can easily support traceroute over TCP, by queueing an error message into socket error queue. Note that applications need to set IP_RECVERR/IPV6_RECVERR option to enable this feature, and that the error message is only queued while in SYN_SNT state. socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0 setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0 connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host) recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}], msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}}, {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}], msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144 Suggested-by: Maciej Żenczykowski <maze@google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-3/+4
The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-3/+6
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-05-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 8 day(s) which contain a total of 109 files changed, 2776 insertions(+), 2887 deletions(-). The main changes are: 1) Add a new AF_XDP buffer allocation API to the core in order to help lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe as well as mlx5 have been moved over to the new API and also gained a small improvement in performance, from Björn Töpel and Magnus Karlsson. 2) Add getpeername()/getsockname() attach types for BPF sock_addr programs in order to allow for e.g. reverse translation of load-balancer backend to service address/port tuple from a connected peer, from Daniel Borkmann. 3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers being non-NULL, e.g. if after an initial test another non-NULL test on that pointer follows in a given path, then it can be pruned right away, from John Fastabend. 4) Larger rework of BPF sockmap selftests to make output easier to understand and to reduce overall runtime as well as adding new BPF kTLS selftests that run in combination with sockmap, also from John Fastabend. 5) Batch of misc updates to BPF selftests including fixing up test_align to match verifier output again and moving it under test_progs, allowing bpf_iter selftest to compile on machines with older vmlinux.h, and updating config options for lirc and v6 segment routing helpers, from Stanislav Fomichev, Andrii Nakryiko and Alan Maguire. 6) Conversion of BPF tracing samples outdated internal BPF loader to use libbpf API instead, from Daniel T. Lee. 7) Follow-up to BPF kernel test infrastructure in order to fix a flake in the XDP selftests, from Jesper Dangaard Brouer. 8) Minor improvements to libbpf's internal hashmap implementation, from Ian Rogers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: add generic MPLS receive supportVadim Fedorenko1-0/+59
Add support for MPLS in receive side. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22tunnel6: support for IPPROTO_MPLSVadim Fedorenko1-4/+83
This patch is just preparation for MPLS support in ip6_tunnel Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: add MPLS transmit supportVadim Fedorenko1-2/+6
Add ETH_P_MPLS_UC as supported protocol. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: simplify transmit pathVadim Fedorenko1-103/+79
Merge ip{4,6}ip6_tnl_xmit functions into one universal ipxip6_tnl_xmit in preparation for adding MPLS support. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22nexthop: support for fdb ecmp nexthopsRoopa Prabhu1-0/+5
This patch introduces ecmp nexthops and nexthop groups for mac fdb entries. In subsequent patches this is used by the vxlan driver fdb entries. The use case is E-VPN multihoming [1,2,3] which requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (This is analogous to a multi-homed LAG but over vxlan). Changes include new nexthop flag NHA_FDB for nexthops referenced by fdb entries. These nexthops only have ip. This patch includes appropriate checks to avoid routes referencing such nexthops. example: $ip nexthop add id 12 via 172.16.1.2 fdb $ip nexthop add id 13 via 172.16.1.3 fdb $ip nexthop add id 102 group 12/13 fdb $bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self [1] E-VPN https://tools.ietf.org/html/rfc7432 [2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365 [3] LPC talk with mention of nexthop groups for L2 ecmp http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf v4 - fixed uninitialized variable reported by kernel test robot Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21net: don't return invalid table id error when we fall back to PF_UNSPECSabrina Dubroca2-2/+2
In case we can't find a ->dumpit callback for the requested (family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're in the same situation as if userspace had requested a PF_UNSPEC dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all the registered RTM_GETROUTE handlers. The requested table id may or may not exist for all of those families. commit ae677bbb4441 ("net: Don't return invalid table id error when dumping all families") fixed the problem when userspace explicitly requests a PF_UNSPEC dump, but missed the fallback case. For example, when we pass ipv6.disable=1 to a kernel with CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y, the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in rtnl_dump_all, and listing IPv6 routes will unexpectedly print: # ip -6 r Error: ipv4: MR table does not exist. Dump terminated commit ae677bbb4441 introduced the dump_all_families variable, which gets set when userspace requests a PF_UNSPEC dump. However, we can't simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the fallback case to get dump_all_families == true, because some messages types (for example RTM_GETRULE and RTM_GETNEIGH) only register the PF_UNSPEC handler and use the family to filter in the kernel what is dumped to userspace. We would then export more entries, that userspace would have to filter. iproute does that, but other programs may not. Instead, this patch removes dump_all_families and updates the RTM_GETROUTE handlers to check if the family that is being dumped is their own. When it's not, which covers both the intentional PF_UNSPEC dumps (as dump_all_families did) and the fallback case, ignore the missing table id error. Fixes: cb167893f41e ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-20handle the group_source_req options directlyAl Viro1-2/+21
Native ->setsockopt() handling of these options (MCAST_..._SOURCE_GROUP and MCAST_{,UN}BLOCK_SOURCE) consists of copyin + call of a helper that does the actual work. The only change needed for ->compat_setsockopt() is a slightly different copyin - the helpers can be reused as-is. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20ipv6: take handling of group_source_req options into a helperAl Viro1-29/+36
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20ipv[46]: do compat setsockopt for MCAST_{JOIN,LEAVE}_GROUP directlyAl Viro1-0/+28
direct parallel to the way these two are handled in the native ->setsockopt() instances - the helpers that do the real work are already separated and can be reused as-is in this case. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20ipv6: do compat setsockopt for MCAST_MSFILTER directlyAl Viro1-1/+47
similar to the ipv4 counterpart of that patch - the same trick used to align the tail array properly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20ip6_mc_msfilter(): pass the address list separatelyAl Viro2-4/+5
that way we'll be able to reuse it for compat case Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20get rid of compat_mc_getsockopt()Al Viro1-3/+38
now we can do MCAST_MSFILTER in compat ->getsockopt() without playing silly buggers with copying things back and forth. We can form a native struct group_filter (sans the variable-length tail) on stack, pass that + pointer to the tail of original request to the helper doing the bulk of the work, then do the rest of copyout - same as the native getsockopt() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20ip*_mc_gsfget(): lift copyout of struct group_filter into callersAl Viro2-11/+17
pass the userland pointer to the array in its tail, so that part gets copied out by our functions; copyout of everything else is done in the callers. Rationale: reuse for compat; the array is the same in native and compat, the layout of parts before it is different for compat. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20compat_ip{,v6}_setsockopt(): enumerate MCAST_... options explicitlyAl Viro1-1/+9
We want to check if optname is among the MCAST_... ones; do that as an explicit switch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-19ipv6: use ->ndo_tunnel_ctl in addrconf_set_dstaddrChristoph Hellwig1-7/+2
Use the new ->ndo_tunnel_ctl instead of overriding the address limit and using ->ndo_do_ioctl just to do a pointless user copy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19ipv6: streamline addrconf_set_dstaddrChristoph Hellwig1-49/+38
Factor out a addrconf_set_sit_dstaddr helper for the actual work if we found a SIT device, and only hold the rtnl lock around the device lookup and that new helper, as there is no point in holding it over a copy_from_user call. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19ipv6: stub out even more of addrconf_set_dstaddr if SIT is disabledChristoph Hellwig1-2/+3
There is no point in copying the structure from userspace or looking up a device if SIT support is not disabled and we'll eventually return -ENODEV anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19sit: impement ->ndo_tunnel_ctlChristoph Hellwig1-39/+34
Implement the ->ndo_tunnel_ctl method, and use ip_tunnel_ioctl to handle userspace requests for the SIOCGETTUNNEL, SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19sit: refactor ipip6_tunnel_ioctlChristoph Hellwig1-158/+210
Split the ioctl handler into one function per command instead of having a all the logic sit in one giant switch statement. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann1-3/+6
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-18ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctlChristoph Hellwig2-0/+54
To prepare removing the global routing_ioctl hack start lifting the code into a newly added ipv6 ->compat_ioctl handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18ipv6: lift copy_from_user out of ipv6_route_ioctlChristoph Hellwig2-34/+26
Prepare for better compat ioctl handling by moving the user copy out of ipv6_route_ioctl. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16netns: enable to inherit devconf from current netnsNicolas Dichtel1-3/+20
The goal is to be able to inherit the initial devconf parameters from the current netns, ie the netns where this new netns has been created. This is useful in a containers environment where /proc/sys is read only. For example, if a pod is created with specifics devconf parameters and has the capability to create netns, the user expects to get the same parameters than his 'init_net', which is not the real init_net in this case. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16ipv6: Fix suspicious RCU usage warning in ip6mrMadhuparna Bhowmik1-1/+2
This patch fixes the following warning: ============================= WARNING: suspicious RCU usage 5.7.0-rc4-next-20200507-syzkaller #0 Not tainted ----------------------------- net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!! ipmr_new_table() returns an existing table, but there is no table at init. Therefore the condition: either holding rtnl or the list is empty is used. Fixes: d1db275dd3f6e ("ipv6: ip6mr: support multiple tables") Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-3/+6
Move the bpf verifier trace check into the new switch statement in HEAD. Resolve the overlapping changes in hinic, where bug fixes overlap the addition of VF support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-12/+112
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-14 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON. 2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey. 3) bpf benchmark runner, from Andrii. 4) arm and riscv JIT optimizations, from Luke. 5) bpf iterator infrastructure, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13ipv6: set msg_control_is_user in do_ipv6_getsockoptChristoph Hellwig1-0/+1
While do_ipv6_getsockopt does not call the high-level recvmsg helper, the msghdr eventually ends up being passed to put_cmsg anyway, and thus needs msg_control_is_user set to the proper value. Fixes: 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13bpf: Enable bpf_iter targets registering ctx argument typesYonghong Song2-5/+5
Commit b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL support") adds a field btf_id_or_null_non0_off to bpf_prog->aux structure to indicate that the first ctx argument is PTR_TO_BTF_ID reg_type and all others are PTR_TO_BTF_ID_OR_NULL. This approach does not really scale if we have other different reg types in the future, e.g., a pointer to a buffer. This patch enables bpf_iter targets registering ctx argument reg types which may be different from the default one. For example, for pointers to structures, the default reg_type is PTR_TO_BTF_ID for tracing program. The target can register a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can be used by the verifier to enforce accesses. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
2020-05-13bpf: Change func bpf_iter_unreg_target() signatureYonghong Song1-1/+1
Change func bpf_iter_unreg_target() parameter from target name to target reg_info, similar to bpf_iter_reg_target(). Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com
2020-05-13bpf: net: Refactor bpf_iter target registrationYonghong Song1-9/+9
Currently bpf_iter_reg_target takes parameters from target and allocates memory to save them. This is really not necessary, esp. in the future we may grow information passed from targets to bpf_iter manager. The patch refactors the code so target reg_info becomes static and bpf_iter manager can just take a reference to it. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com
2020-05-12netlabel: cope with NULL catmapPaolo Abeni1-1/+2
The cipso and calipso code can set the MLS_CAT attribute on successful parsing, even if the corresponding catmap has not been allocated, as per current configuration and external input. Later, selinux code tries to access the catmap if the MLS_CAT flag is present via netlbl_catmap_getlong(). That may cause null ptr dereference while processing incoming network traffic. Address the issue setting the MLS_CAT flag only if the catmap is really allocated. Additionally let netlbl_catmap_getlong() cope with NULL catmap. Reported-by: Matthew Sheets <matthew.sheets@gd-ms.com> Fixes: 4b8feff251da ("netlabel: fix the horribly broken catmap functions") Fixes: ceba1832b1b2 ("calipso: Set the calipso socket label to match the secattr.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-09net: bpf: Add netlink and ipv6_route bpf_iter targetsYonghong Song2-2/+100
This patch added netlink and ipv6_route targets, using the same seq_ops (except show() and minor changes for stop()) for /proc/net/{netlink,ipv6_route}. The net namespace for these targets are the current net namespace at file open stage, similar to /proc/net/{netlink,ipv6_route} reference counting the net namespace at seq_file open stage. Since module is not supported for now, ipv6_route is supported only if the IPV6 is built-in, i.e., not compiled as a module. The restriction can be lifted once module is properly supported for bpf_iter. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
2020-05-08ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()Eric Dumazet1-1/+1
We currently have to adjust ipv6 route gc_thresh/max_size depending on number of cpus on a server, this makes very little sense. If the kernels sets /proc/sys/net/ipv6/route/gc_thresh to 1024 and /proc/sys/net/ipv6/route/max_size to 4096, then we better not track the percpu dst that our implementation uses. Only routes not added (directly or indirectly) by the admin should be tracked and limited. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@kernel.org> Cc: Maciej Żenczykowski <maze@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08net/dst: use a smaller percpu_counter batch for dst entries accountingEric Dumazet1-0/+3
percpu_counter_add() uses a default batch size which is quite big on platforms with 256 cpus. (2*256 -> 512) This means dst_entries_get_fast() can be off by +/- 2*(nr_cpus^2) (131072 on servers with 256 cpus) Reduce the batch size to something more reasonable, and add logic to ip6_dst_gc() to call dst_entries_get_slow() before calling the _very_ expensive fib6_run_gc() function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-09bpf: Allow any port in bpf_bind helperStanislav Fomichev1-5/+7
We want to have a tighter control on what ports we bind to in the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means connect() becomes slightly more expensive. The expensive part comes from the fact that we now need to call inet_csk_get_port() that verifies that the port is not used and allocates an entry in the hash table for it. Since we can't rely on "snum || !bind_address_no_port" to prevent us from calling POST_BIND hook anymore, let's add another bind flag to indicate that the call site is BPF program. v5: * fix wrong AF_INET (should be AF_INET6) in the bpf program for v6 v3: * More bpf_bind documentation refinements (Martin KaFai Lau) * Add UDP tests as well (Martin KaFai Lau) * Don't start the thread, just do socket+bind+listen (Martin KaFai Lau) v2: * Update documentation (Andrey Ignatov) * Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com
2020-05-09net: Refactor arguments of inet{,6}_bindStanislav Fomichev1-5/+5
The intent is to add an additional bind parameter in the next commit. Instead of adding another argument, let's convert all existing flag arguments into an extendable bit field. No functional changes. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200508174611.228805-4-sdf@google.com
2020-05-07Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu"Maciej Żenczykowski1-2/+4
This reverts commit 19bda36c4299ce3d7e5bce10bebe01764a655a6d: | ipv6: add mtu lock check in __ip6_rt_update_pmtu | | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu. | It leaded to that mtu lock doesn't really work when receiving the pkt | of ICMPV6_PKT_TOOBIG. | | This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4 | did in __ip_rt_update_pmtu. The above reasoning is incorrect. IPv6 *requires* icmp based pmtu to work. There's already a comment to this effect elsewhere in the kernel: $ git grep -p -B1 -A3 'RTAX_MTU lock' net/ipv6/route.c=4813= static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg) ... /* In IPv6 pmtu discovery is not optional, so that RTAX_MTU lock cannot disable it. We still use this lock to block changes caused by addrconf/ndisc. */ This reverts to the pre-4.9 behaviour. Cc: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Fixes: 19bda36c4299 ("ipv6: add mtu lock check in __ip6_rt_update_pmtu") Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-2/+33
Conflicts were all overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06seg6: fix SRH processing to comply with RFC8754Ahmed Abdelsalam1-2/+8
The Segment Routing Header (SRH) which defines the SRv6 dataplane is defined in RFC8754. RFC8754 (section 4.1) defines the SR source node behavior which encapsulates packets into an outer IPv6 header and SRH. The SR source node encodes the full list of Segments that defines the packet path in the SRH. Then, the first segment from list of Segments is copied into the Destination address of the outer IPv6 header and the packet is sent to the first hop in its path towards the destination. If the Segment list has only one segment, the SR source node can omit the SRH as he only segment is added in the destination address. RFC8754 (section 4.1.1) defines the Reduced SRH, when a source does not require the entire SID list to be preserved in the SRH. A reduced SRH does not contain the first segment of the related SR Policy (the first segment is the one already in the DA of the IPv6 header), and the Last Entry field is set to n-2, where n is the number of elements in the SR Policy. RFC8754 (section 4.3.1.1) defines the SRH processing and the logic to validate the SRH (S09, S10, S11) which works for both reduced and non-reduced behaviors. This patch updates seg6_validate_srh() to validate the SRH as per RFC8754. Signed-off-by: Ahmed Abdelsalam <ahabdels@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06ipv6: Implement draft-ietf-6man-rfc4941bisFernando Gont1-52/+39
Implement the upcoming rev of RFC4941 (IPv6 temporary addresses): https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09 * Reduces the default Valid Lifetime to 2 days The number of extra addresses employed when Valid Lifetime was 7 days exacerbated the stress caused on network elements/devices. Additionally, the motivation for temporary addresses is indeed privacy and reduced exposure. With a default Valid Lifetime of 7 days, an address that becomes revealed by active communication is reachable and exposed for one whole week. The only use case for a Valid Lifetime of 7 days could be some application that is expecting to have long lived connections. But if you want to have a long lived connections, you shouldn't be using a temporary address in the first place. Additionally, in the era of mobile devices, general applications should nevertheless be prepared and robust to address changes (e.g. nodes swap wifi <-> 4G, etc.) * Employs different IIDs for different prefixes To avoid network activity correlation among addresses configured for different prefixes * Uses a simpler algorithm for IID generation No need to store "history" anywhere Signed-off-by: Fernando Gont <fgont@si6networks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-26/+18
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-01 (v2) The following pull-request contains BPF updates for your *net-next* tree. We've added 61 non-merge commits during the last 6 day(s) which contain a total of 153 files changed, 6739 insertions(+), 3367 deletions(-). The main changes are: 1) pulled work.sysctl from vfs tree with sysctl bpf changes. 2) bpf_link observability, from Andrii. 3) BTF-defined map in map, from Andrii. 4) asan fixes for selftests, from Andrii. 5) Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH, from Jakub. 6) production cloudflare classifier as a selftes, from Lorenz. 7) bpf_ktime_get_*_ns() helper improvements, from Maciej. 8) unprivileged bpftool feature probe, from Quentin. 9) BPF_ENABLE_STATS command, from Song. 10) enable bpf_[gs]etsockopt() helpers for sock_ops progs, from Stanislav. 11) enable a bunch of common helpers for cg-device, sysctl, sockopt progs, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01ipv6: Use global sernum for dst validation with nexthop objectsDavid Ahern1-0/+25
Nik reported a bug with pcpu dst cache when nexthop objects are used illustrated by the following: $ ip netns add foo $ ip -netns foo li set lo up $ ip -netns foo addr add 2001:db8:11::1/128 dev lo $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1 $ ip li add veth1 type veth peer name veth2 $ ip li set veth1 up $ ip addr add 2001:db8:10::1/64 dev veth1 $ ip li set dev veth2 netns foo $ ip -netns foo li set veth2 up $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2 $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1 $ ip -6 route add 2001:db8:11::1/128 nhid 100 Create a pcpu entry on cpu 0: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 Re-add the route entry: $ ip -6 ro del 2001:db8:11::1 $ ip -6 route add 2001:db8:11::1/128 nhid 100 Route get on cpu 0 returns the stale pcpu: $ taskset -a -c 0 ip -6 route get 2001:db8:11::1 RTNETLINK answers: Network is unreachable While cpu 1 works: $ taskset -a -c 1 ip -6 route get 2001:db8:11::1 2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium Conversion of FIB entries to work with external nexthop objects missed an important difference between IPv4 and IPv6 - how dst entries are invalidated when the FIB changes. IPv4 has a per-network namespace generation id (rt_genid) that is bumped on changes to the FIB. Checking if a dst_entry is still valid means comparing rt_genid in the rtable to the current value of rt_genid for the namespace. IPv6 also has a per network namespace counter, fib6_sernum, but the count is saved per fib6_node. With the per-node counter only dst_entries based on fib entries under the node are invalidated when changes are made to the routes - limiting the scope of invalidations. IPv6 uses a reference in the rt6_info, 'from', to track the corresponding fib entry used to create the dst_entry. When validating a dst_entry, the 'from' is used to backtrack to the fib6_node and check the sernum of it to the cookie passed to the dst_check operation. With the inline format (nexthop definition inline with the fib6_info), dst_entries cached in the fib6_nh have a 1:1 correlation between fib entries, nexthop data and dst_entries. With external nexthops, IPv6 looks more like IPv4 which means multiple fib entries across disparate fib6_nodes can all reference the same fib6_nh. That means validation of dst_entries based on external nexthops needs to use the IPv4 format - the per-network namespace counter. Add sernum to rt6_info and set it when creating a pcpu dst entry. Update rt6_get_cookie to return sernum if it is set and update dst_check for IPv6 to look for sernum set and based the check on it if so. Finally, rt6_get_pcpu_route needs to validate the cached entry before returning a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and __mkroute_output for IPv4). This problem only affects routes using the new, external nexthops. Thanks to the kbuild test robot for catching the IS_ENABLED needed around rt_genid_ipv6 before I sent this out. Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects") Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29ila: remove unused inline function ila_addr_is_ilaYueHaibing1-5/+0
There's no callers in-tree anymore since commit 84287bb32856 ("ila: add checksum neutral map auto"). Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28docs: networking: convert ipv6.txt to ReSTMauro Carvalho Chehab1-1/+1
Not much to be done here: - add SPDX header; - add a document title; - mark a literal as such, in order to avoid a warning; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28net: ipv4: add sysctl for nexthop api compatibility modeRoopa Prabhu1-1/+2
Current route nexthop API maintains user space compatibility with old route API by default. Dumps and netlink notifications support both new and old API format. In systems which have moved to the new API, this compatibility mode cancels some of the performance benefits provided by the new nexthop API. This patch adds new sysctl nexthop_compat_mode which is on by default but provides the ability to turn off compatibility mode allowing systems to run entirely with the new routing API. Old route API behaviour and support is not modified by this sysctl. Uses a single sysctl to cover both ipv4 and ipv6 following other sysctls. Covers dumps and delete notifications as suggested by David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28net: ipv6: new arg skip_notify to ip6_rt_delRoopa Prabhu5-14/+18
Used in subsequent work to skip route delete notifications on nexthop deletes. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28Merge branch 'work.sysctl' of ↵Daniel Borkmann4-26/+18
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/