summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2023-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-42/+76
Cross-merge networking fixes after downstream PR. Conflicts: net/dsa/port.c 9945c1fb03a3 ("net: dsa: fix older DSA drivers using phylink") a88dd7538461 ("net: dsa: remove legacy_pre_march2020 detection") https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/ net/xdp/xsk.c 3c5b4d69c358 ("net: annotate data-races around sk->sk_mark") b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path") https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt.c 37b61cda9c16 ("bnxt: don't handle XDP in netpoll") 2b56b3d99241 ("eth: bnxt: handle invalid Tx completions more gracefully") https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c 62da08331f1a ("net/mlx5e: Set proper IPsec source port in L4 selector") fbd517549c32 ("net/mlx5e: Add function to get IPsec offload namespace") drivers/net/ethernet/sfc/selftest.c 55c1528f9b97 ("sfc: fix field-spanning memcpy in selftest") ae9d445cd41f ("sfc: Miscellaneous comment removals") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopenEric Dumazet1-4/+5
Whenever tcpm_new() reclaims an old entry, tcpm_suck_dst() would overwrite data that could be read from tcp_fastopen_cache_get() or tcp_metrics_fill_info(). We need to acquire fastopen_seqlock to maintain consistency. For newly allocated objects, tcpm_new() can switch to kzalloc() to avoid an extra fastopen_seqlock acquisition. Fixes: 1fe4c481ba63 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: annotate data-races around tm->tcpm_netEric Dumazet1-4/+7
tm->tcpm_net can be read or written locklessly. Instead of changing write_pnet() and read_pnet() and potentially hurt performance, add the needed READ_ONCE()/WRITE_ONCE() in tm_net() and tcpm_new(). Fixes: 849e8a0ca8d5 ("tcp_metrics: Add a field tcpm_net and verify it matches on lookup") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: annotate data-races around tm->tcpm_vals[]Eric Dumazet1-9/+14
tm->tcpm_vals[] values can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this, and force use of tcp_metric_get() and tcp_metric_set() Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: annotate data-races around tm->tcpm_lockEric Dumazet1-2/+4
tm->tcpm_lock can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this. Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: annotate data-races around tm->tcpm_stampEric Dumazet1-6/+13
tm->tcpm_stamp can be read or written locklessly. Add needed READ_ONCE()/WRITE_ONCE() to document this. Also constify tcpm_check_stamp() dst argument. Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03tcp_metrics: fix addr_same() helperEric Dumazet1-1/+1
Because v4 and v6 families use separate inetpeer trees (respectively net->ipv4.peers and net->ipv6.peers), inetpeer_addr_cmp(a, b) assumes a & b share the same family. tcp_metrics use a common hash table, where entries can have different families. We must therefore make sure to not call inetpeer_addr_cmp() if the families do not match. Fixes: d39d14ffa24c ("net: Add helper function to compare inetpeer addresses") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230802131500.1478140-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGESDavid Howells1-0/+9
__ip_append_data() can get into an infinite loop when asked to splice into a partially-built UDP message that has more than the frag-limit data and up to the MTU limit. Something like: pipe(pfd); sfd = socket(AF_INET, SOCK_DGRAM, 0); connect(sfd, ...); send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE); write(pfd[1], buffer, 8); splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0); where the amount of data given to send() is dependent on the MTU size (in this instance an interface with an MTU of 8192). The problem is that the calculation of the amount to copy in __ip_append_data() goes negative in two places, and, in the second place, this gets subtracted from the length remaining, thereby increasing it. This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and MSG_SPLICE_PAGES), because the terms in: copy = datalen - transhdrlen - fraggap - pagedlen; then mostly cancel when pagedlen is substituted for, leaving just -fraggap. This causes: length -= copy + transhdrlen; to increase the length to more than the amount of data in msg->msg_iter, which causes skb_splice_from_iter() to be unable to fill the request and it returns less than 'copied' - which means that length never gets to 0 and we never exit the loop. Fix this by: (1) Insert a note about the dodgy calculation of 'copy'. (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above equation, so that 'offset' isn't regressed and 'length' isn't increased, which will mean that length and thus copy should match the amount left in the iterator. (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if we're asked to splice more than is in the iterator. It might be better to not give the warning or even just give a 'short' write. [!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY avoids the problem by simply assuming that everything asked for got copied, not just the amount that was in the iterator. This is a potential bug for the future. Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES") Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-29net: annotate data-races around sk->sk_priorityEric Dumazet4-5/+5
sk_getsockopt() runs locklessly. This means sk->sk_priority can be read while other threads are changing its value. Other reads also happen without socket lock being held. Add missing annotations where needed. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_markEric Dumazet4-7/+7
sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: gro: fix misuse of CB in udp socket lookupRichard Gobert2-4/+11
This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to `udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the device from CB. The fix changes it to fetch the device from `skb->dev`. l3mdev case requires special attention since it has a master and a slave device. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Reported-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-20nexthop: Do not return invalid nexthop object during multipath selectionBenjamin Poirier1-6/+8
With legacy nexthops, when net.ipv4.fib_multipath_use_neigh is set, fib_select_multipath() will never set res->nhc to a nexthop that is not good (as per fib_good_nh()). OTOH, with nexthop objects, nexthop_select_path_hthr() may return a nexthop that failed the nexthop_is_good_nh() test even if there was one that passed. Refactor nexthop_select_path_hthr() to follow a selection logic more similar to fib_select_multipath(). The issue can be demonstrated with the following sequence of commands. The first block shows that things work as expected with legacy nexthops. The last sequence of `ip rou get` in the second block shows the problem case - some routes still use the .2 nexthop. sysctl net.ipv4.fib_multipath_use_neigh=1 ip link add dummy1 up type dummy ip rou add 198.51.100.0/24 nexthop via 192.0.2.1 dev dummy1 onlink nexthop via 192.0.2.2 dev dummy1 onlink for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh add 192.0.2.1 dev dummy1 nud failed echo ".1 failed:" # results should not use .1 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh del 192.0.2.1 dev dummy1 ip neigh add 192.0.2.2 dev dummy1 nud failed echo ".2 failed:" # results should not use .2 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip link del dummy1 ip link add dummy1 up type dummy ip nexthop add id 1 via 192.0.2.1 dev dummy1 onlink ip nexthop add id 2 via 192.0.2.2 dev dummy1 onlink ip nexthop add id 1001 group 1/2 ip rou add 198.51.100.0/24 nhid 1001 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh add 192.0.2.1 dev dummy1 nud failed echo ".1 failed:" # results should not use .1 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh del 192.0.2.1 dev dummy1 ip neigh add 192.0.2.2 dev dummy1 nud failed echo ".2 failed:" # results should not use .2 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip link del dummy1 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230719-nh_select-v2-3-04383e89f868@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20nexthop: Factor out neighbor validity checkBenjamin Poirier1-13/+16
For legacy nexthops, there is fib_good_nh() to check the neighbor validity. In order to make the nexthop object code more similar to the legacy nexthop code, factor out the nexthop object neighbor validity check into its own function. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230719-nh_select-v2-2-04383e89f868@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20nexthop: Factor out hash threshold fdb nexthop selectionBenjamin Poirier1-2/+20
The loop in nexthop_select_path_hthr() includes code to check for neighbor validity. Since this does not apply to fdb nexthops, simplify the loop by moving the fdb nexthop selection to its own function. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230719-nh_select-v2-1-04383e89f868@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski11-68/+73
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around fastopenq.max_qlenEric Dumazet2-3/+5
This field can be read locklessly. Fixes: 1536e2857bd3 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around icsk->icsk_user_timeoutEric Dumazet1-3/+3
This field can be read locklessly from do_tcp_getsockopt() Fixes: dca43c75e7e5 ("tcp: Add TCP_USER_TIMEOUT socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->notsent_lowatEric Dumazet1-2/+2
tp->notsent_lowat can be read locklessly from do_tcp_getsockopt() and tcp_poll(). Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around rskq_defer_acceptEric Dumazet1-5/+6
do_tcp_getsockopt() reads rskq_defer_accept while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->linger2Eric Dumazet1-4/+4
do_tcp_getsockopt() reads tp->linger2 while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around icsk->icsk_syn_retriesEric Dumazet2-4/+4
do_tcp_getsockopt() and reqsk_timer_handler() read icsk->icsk_syn_retries while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->keepalive_probesEric Dumazet1-2/+3
do_tcp_getsockopt() reads tp->keepalive_probes while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->keepalive_intvlEric Dumazet1-2/+2
do_tcp_getsockopt() reads tp->keepalive_intvl while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->keepalive_timeEric Dumazet1-1/+2
do_tcp_getsockopt() reads tp->keepalive_time while another cpu might change its value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->tsoffsetEric Dumazet2-4/+5
do_tcp_getsockopt() reads tp->tsoffset while another cpu might change its value. Fixes: 93be6ce0e91b ("tcp: set and get per-socket timestamp") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: annotate data-races around tp->tcp_tx_delayEric Dumazet1-2/+2
do_tcp_getsockopt() reads tp->tcp_tx_delay while another cpu might change its value. Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719212857.3943972-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20tcp: add TCP_OLD_SEQUENCE drop reasonEric Dumazet1-5/+11
tcp_sequence() uses two conditions to decide to drop a packet, and we currently report generic TCP_INVALID_SEQUENCE drop reason. Duplicates are common, we need to distinguish them from the other case. I chose to not reuse TCP_OLD_DATA, and instead added TCP_OLD_SEQUENCE drop reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230719064754.2794106-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-19tcp: tcp_enter_quickack_mode() should be staticEric Dumazet1-2/+1
After commit d2ccd7bc8acd ("tcp: avoid resetting ACK timer in DCTCP"), tcp_enter_quickack_mode() is only used from net/ipv4/tcp_input.c. Fixes: d2ccd7bc8acd ("tcp: avoid resetting ACK timer in DCTCP") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20230718162049.1444938-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19udp: use indirect call wrapper for data ready()Paolo Abeni1-1/+1
In most cases UDP sockets use the default data ready callback. Leverage the indirect call wrapper for such callback to avoid an indirect call in fastpath. The above gives small but measurable performance gain under UDP flood. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/d47d53e6f8ee7a11228ca2f025d6243cc04b77f3.1689691004.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19Revert "tcp: avoid the lookup process failing to get sk in ehash table"Kuniyuki Iwashima2-19/+6
This reverts commit 3f4ca5fafc08881d7a57daa20449d171f2887043. Commit 3f4ca5fafc08 ("tcp: avoid the lookup process failing to get sk in ehash table") reversed the order in how a socket is inserted into ehash to fix an issue that ehash-lookup could fail when reqsk/full sk/twsk are swapped. However, it introduced another lookup failure. The full socket in ehash is allocated from a slab with SLAB_TYPESAFE_BY_RCU and does not have SOCK_RCU_FREE, so the socket could be reused even while it is being referenced on another CPU doing RCU lookup. Let's say a socket is reused and inserted into the same hash bucket during lookup. After the blamed commit, a new socket is inserted at the end of the list. If that happens, we will skip sockets placed after the previous position of the reused socket, resulting in ehash lookup failure. As described in Documentation/RCU/rculist_nulls.rst, we should insert a new socket at the head of the list to avoid such an issue. This issue, the swap-lookup-failure, and another variant reported in [0] can all be handled properly by adding a locked ehash lookup suggested by Eric Dumazet [1]. However, this issue could occur for every packet, thus more likely than the other two races, so let's revert the change for now. Link: https://lore.kernel.org/netdev/20230606064306.9192-1-duanmuquan@baidu.com/ [0] Link: https://lore.kernel.org/netdev/CANn89iK8snOz8TYOhhwfimC7ykYA78GA3Nyv8x06SZYa1nKdyA@mail.gmail.com/ [1] Fixes: 3f4ca5fafc08 ("tcp: avoid the lookup process failing to get sk in ehash table") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230717215918.15723-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19Merge tag 'for-netdev' of ↵Jakub Kicinski1-2/+0
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-19 We've added 45 non-merge commits during the last 3 day(s) which contain a total of 71 files changed, 7808 insertions(+), 592 deletions(-). The main changes are: 1) multi-buffer support in AF_XDP, from Maciej Fijalkowski, Magnus Karlsson, Tirthendu Sarkar. 2) BPF link support for tc BPF programs, from Daniel Borkmann. 3) Enable bpf_map_sum_elem_count kfunc for all program types, from Anton Protopopov. 4) Add 'owner' field to bpf_rb_node to fix races in shared ownership, Dave Marchevsky. 5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) bpf, net: Introduce skb_pointer_if_linear(). bpf: sync tools/ uapi header with selftests/bpf: Add mprog API tests for BPF tcx links selftests/bpf: Add mprog API tests for BPF tcx opts bpftool: Extend net dump with tcx progs libbpf: Add helper macro to clear opts structs libbpf: Add link-based API for tcx libbpf: Add opts-based attach/detach/query API for tcx bpf: Add fd-based tcx multi-prog infra with link support bpf: Add generic attach/detach/query API for multi-progs selftests/xsk: reset NIC settings to default after running test suite selftests/xsk: add test for too many frags selftests/xsk: add metadata copy test for multi-buff selftests/xsk: add invalid descriptor test for multi-buffer selftests/xsk: add unaligned mode test for multi-buffer selftests/xsk: add basic multi-buffer test selftests/xsk: transmit and receive multi-buffer packets xsk: add multi-buffer documentation i40e: xsk: add TX multi-buffer support ice: xsk: Tx multi-buffer support ... ==================== Link: https://lore.kernel.org/r/20230719175424.75717-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19ipv4: ip_gre: fix return value check in erspan_xmit()Yuanjun Gong1-1/+2
goto free_skb if an unexpected result is returned by pskb_tirm() in erspan_xmit(). Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-19ipv4: ip_gre: fix return value check in erspan_fb_xmit()Yuanjun Gong1-1/+2
goto err_free_skb if an unexpected result is returned by pskb_tirm() in erspan_fb_xmit(). Signed-off-by: Yuanjun Gong <ruc_gongyuanjun@163.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-19net: ipv4: Use kfree_sensitive instead of kfreeWang Ming1-1/+1
key might contain private part of the key, so better use kfree_sensitive to free it. Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP") Signed-off-by: Wang Ming <machel@vivo.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-18tcp: annotate data-races around tcp_rsk(req)->ts_recentEric Dumazet3-5/+8
TCP request sockets are lockless, tcp_rsk(req)->ts_recent can change while being read by another cpu as syzbot noticed. This is harmless, but we should annotate the known races. Note that tcp_check_req() changes req->ts_recent a bit early, we might change this in the future. BUG: KCSAN: data-race in tcp_check_req / tcp_check_req write to 0xffff88813c8afb84 of 4 bytes by interrupt on cpu 1: tcp_check_req+0x694/0xc70 net/ipv4/tcp_minisocks.c:762 tcp_v4_rcv+0x12db/0x1b70 net/ipv4/tcp_ipv4.c:2071 ip_protocol_deliver_rcu+0x356/0x6d0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x13c/0x1a0 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:303 [inline] ip_local_deliver+0xec/0x1c0 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:468 [inline] ip_rcv_finish net/ipv4/ip_input.c:449 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip_rcv+0x197/0x270 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core net/core/dev.c:5493 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5607 process_backlog+0x21f/0x380 net/core/dev.c:5935 __napi_poll+0x60/0x3b0 net/core/dev.c:6498 napi_poll net/core/dev.c:6565 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6698 __do_softirq+0xc1/0x265 kernel/softirq.c:571 do_softirq+0x7e/0xb0 kernel/softirq.c:472 __local_bh_enable_ip+0x64/0x70 kernel/softirq.c:396 local_bh_enable+0x1f/0x20 include/linux/bottom_half.h:33 rcu_read_unlock_bh include/linux/rcupdate.h:843 [inline] __dev_queue_xmit+0xabb/0x1d10 net/core/dev.c:4271 dev_queue_xmit include/linux/netdevice.h:3088 [inline] neigh_hh_output include/net/neighbour.h:528 [inline] neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0x700/0x840 net/ipv4/ip_output.c:229 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:431 dst_output include/net/dst.h:458 [inline] ip_local_out net/ipv4/ip_output.c:126 [inline] __ip_queue_xmit+0xa4d/0xa70 net/ipv4/ip_output.c:533 ip_queue_xmit+0x38/0x40 net/ipv4/ip_output.c:547 __tcp_transmit_skb+0x1194/0x16e0 net/ipv4/tcp_output.c:1399 tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline] tcp_write_xmit+0x13ff/0x2fd0 net/ipv4/tcp_output.c:2693 __tcp_push_pending_frames+0x6a/0x1a0 net/ipv4/tcp_output.c:2877 tcp_push_pending_frames include/net/tcp.h:1952 [inline] __tcp_sock_set_cork net/ipv4/tcp.c:3336 [inline] tcp_sock_set_cork+0xe8/0x100 net/ipv4/tcp.c:3343 rds_tcp_xmit_path_complete+0x3b/0x40 net/rds/tcp_send.c:52 rds_send_xmit+0xf8d/0x1420 net/rds/send.c:422 rds_send_worker+0x42/0x1d0 net/rds/threads.c:200 process_one_work+0x3e6/0x750 kernel/workqueue.c:2408 worker_thread+0x5f2/0xa10 kernel/workqueue.c:2555 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 read to 0xffff88813c8afb84 of 4 bytes by interrupt on cpu 0: tcp_check_req+0x32a/0xc70 net/ipv4/tcp_minisocks.c:622 tcp_v4_rcv+0x12db/0x1b70 net/ipv4/tcp_ipv4.c:2071 ip_protocol_deliver_rcu+0x356/0x6d0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x13c/0x1a0 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:303 [inline] ip_local_deliver+0xec/0x1c0 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:468 [inline] ip_rcv_finish net/ipv4/ip_input.c:449 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip_rcv+0x197/0x270 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core net/core/dev.c:5493 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5607 process_backlog+0x21f/0x380 net/core/dev.c:5935 __napi_poll+0x60/0x3b0 net/core/dev.c:6498 napi_poll net/core/dev.c:6565 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6698 __do_softirq+0xc1/0x265 kernel/softirq.c:571 run_ksoftirqd+0x17/0x20 kernel/softirq.c:939 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 value changed: 0x1cd237f1 -> 0x1cd237f2 Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230717144445.653164-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18tcp: annotate data-races around tcp_rsk(req)->txhashEric Dumazet3-4/+5
TCP request sockets are lockless, some of their fields can change while being read by another cpu as syzbot noticed. This is usually harmless, but we should annotate the known races. This patch takes care of tcp_rsk(req)->txhash, a separate one is needed for tcp_rsk(req)->ts_recent. BUG: KCSAN: data-race in tcp_make_synack / tcp_rtx_synack write to 0xffff8881362304bc of 4 bytes by task 32083 on cpu 1: tcp_rtx_synack+0x9d/0x2a0 net/ipv4/tcp_output.c:4213 inet_rtx_syn_ack+0x38/0x80 net/ipv4/inet_connection_sock.c:880 tcp_check_req+0x379/0xc70 net/ipv4/tcp_minisocks.c:665 tcp_v6_rcv+0x125b/0x1b20 net/ipv6/tcp_ipv6.c:1673 ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437 ip6_input_finish net/ipv6/ip6_input.c:482 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79 NF_HOOK include/linux/netfilter.h:303 [inline] ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core net/core/dev.c:5452 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566 netif_receive_skb_internal net/core/dev.c:5652 [inline] netif_receive_skb+0x4a/0x310 net/core/dev.c:5711 tun_rx_batched+0x3bf/0x400 tun_get_user+0x1d24/0x22b0 drivers/net/tun.c:1997 tun_chr_write_iter+0x18e/0x240 drivers/net/tun.c:2043 call_write_iter include/linux/fs.h:1871 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x4ab/0x7d0 fs/read_write.c:584 ksys_write+0xeb/0x1a0 fs/read_write.c:637 __do_sys_write fs/read_write.c:649 [inline] __se_sys_write fs/read_write.c:646 [inline] __x64_sys_write+0x42/0x50 fs/read_write.c:646 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff8881362304bc of 4 bytes by task 32078 on cpu 0: tcp_make_synack+0x367/0xb40 net/ipv4/tcp_output.c:3663 tcp_v6_send_synack+0x72/0x420 net/ipv6/tcp_ipv6.c:544 tcp_conn_request+0x11a8/0x1560 net/ipv4/tcp_input.c:7059 tcp_v6_conn_request+0x13f/0x180 net/ipv6/tcp_ipv6.c:1175 tcp_rcv_state_process+0x156/0x1de0 net/ipv4/tcp_input.c:6494 tcp_v6_do_rcv+0x98a/0xb70 net/ipv6/tcp_ipv6.c:1509 tcp_v6_rcv+0x17b8/0x1b20 net/ipv6/tcp_ipv6.c:1735 ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437 ip6_input_finish net/ipv6/ip6_input.c:482 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79 NF_HOOK include/linux/netfilter.h:303 [inline] ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core net/core/dev.c:5452 [inline] __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566 netif_receive_skb_internal net/core/dev.c:5652 [inline] netif_receive_skb+0x4a/0x310 net/core/dev.c:5711 tun_rx_batched+0x3bf/0x400 tun_get_user+0x1d24/0x22b0 drivers/net/tun.c:1997 tun_chr_write_iter+0x18e/0x240 drivers/net/tun.c:2043 call_write_iter include/linux/fs.h:1871 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x4ab/0x7d0 fs/read_write.c:584 ksys_write+0xeb/0x1a0 fs/read_write.c:637 __do_sys_write fs/read_write.c:649 [inline] __se_sys_write fs/read_write.c:646 [inline] __x64_sys_write+0x42/0x50 fs/read_write.c:646 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x91d25731 -> 0xe79325cd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 32078 Comm: syz-executor.4 Not tainted 6.5.0-rc1-syzkaller-00033-geb26cbb1a754 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 Fixes: 58d607d3e52f ("tcp: provide skb->hash to synack packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230717144445.653164-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18tcp: get rid of sysctl_tcp_adv_win_scaleEric Dumazet2-12/+18
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18bpf: Drop useless btf_vmlinux in bpf_tcp_caGeliang Tang1-2/+0
The code using btf_vmlinux in bpf_tcp_ca is removed by the commit 9f0265e921de ("bpf: Require only one of cong_avoid() and cong_control() from a TCP CC") so drop this useless btf_vmlinux declaration. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/4d38da4eadaba476bd92ffcd7a5a03a5e28745c0.1689582557.git.geliang.tang@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-14gso: fix dodgy bit handling for GSO_UDP_L4Yan Zhai1-5/+11
Commit 1fd54773c267 ("udp: allow header check for dodgy GSO_UDP_L4 packets.") checks DODGY bit for UDP, but for packets that can be fed directly to the device after gso_segs reset, it actually falls through to fragmentation: https://lore.kernel.org/all/CAJPywTKDdjtwkLVUW6LRA2FU912qcDmQOQGt2WaDo28KzYDg+A@mail.gmail.com/ This change restores the expected behavior of GSO_UDP_L4 packets. Fixes: 1fd54773c267 ("udp: allow header check for dodgy GSO_UDP_L4 packets.") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-12tcp: add a scheduling point in established_get_first()Jian Wen1-0/+3
Kubernetes[1] is going to stick with /proc/net/tcp for a while. This commit reduces the scheduling latency introduced by established_get_first(), similar to commit acffb584cda7 ("net: diag: add a scheduling point in inet_diag_dump_icsk()"). In our environment, the scheduling latency affects the performance of latency-sensitive services like Redis. Changes in V2 : - call cond_resched() before checking if a bucket is empty as suggested by Eric Dumazet - removed the delay of synchronize_net() from the commit message [1] https://github.com/google/cadvisor/blob/v0.47.2/container/libcontainer/handler.go#L130 Signed-off-by: Jian Wen <wenjian1@xiaomi.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230711032405.3253025-1-wenjian1@xiaomi.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-03tcp: annotate data races in __tcp_oow_rate_limited()Eric Dumazet1-3/+9
request sockets are lockless, __tcp_oow_rate_limited() could be called on the same object from different cpus. This is harmless. Add READ_ONCE()/WRITE_ONCE() annotations to avoid a KCSAN report. Fixes: 4ce7e93cb3fe ("tcp: rate limit ACK sent by SYN_RECV request sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-24sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)David Howells7-100/+6
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24tcp_bpf, smc, tls, espintcp, siw: Reduce MSG_SENDPAGE_NOTLAST usageDavid Howells1-2/+3
As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't use it further in than the sendpage methods, but rather translate it to MSG_MORE and use that instead. Signed-off-by: David Howells <dhowells@redhat.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: Bernard Metzler <bmt@zurich.ibm.com> cc: Jason Gunthorpe <jgg@ziepe.ca> cc: Leon Romanovsky <leon@kernel.org> cc: John Fastabend <john.fastabend@gmail.com> cc: Jakub Sitnicki <jakub@cloudflare.com> cc: David Ahern <dsahern@kernel.org> cc: Karsten Graul <kgraul@linux.ibm.com> cc: Wenjia Zhang <wenjia@linux.ibm.com> cc: Jan Karcher <jaka@linux.ibm.com> cc: "D. Wythe" <alibuda@linux.alibaba.com> cc: Tony Lu <tonylu@linux.alibaba.com> cc: Wen Gu <guwen@linux.alibaba.com> cc: Boris Pismenny <borisp@nvidia.com> cc: Steffen Klassert <steffen.klassert@secunet.com> cc: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20230623225513.2732256-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24net/tcp: optimise locking for blocking splicePavel Begunkov1-1/+1
Even when tcp_splice_read() reads all it was asked for, for blocking sockets it'll release and immediately regrab the socket lock, loop around and break on the while check. Check tss.len right after we adjust it, and return if we're done. That saves us one release_sock(); lock_sock(); pair per successful blocking splice read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/80736a2cc6d478c383ea565ba825eaf4d1abd876.1687523671.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24Merge tag 'for-netdev' of ↵Jakub Kicinski1-10/+1
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-06-23 We've added 49 non-merge commits during the last 24 day(s) which contain a total of 70 files changed, 1935 insertions(+), 442 deletions(-). The main changes are: 1) Extend bpf_fib_lookup helper to allow passing the route table ID, from Louis DeLosSantos. 2) Fix regsafe() in verifier to call check_ids() for scalar registers, from Eduard Zingerman. 3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and() and a rework of bpf_cpumask_any*() kfuncs. Additionally, add selftests, from David Vernet. 4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings, from Gilad Sever. 5) Change bpf_link_put() to use workqueue unconditionally to fix it under PREEMPT_RT, from Sebastian Andrzej Siewior. 6) Follow-ups to address issues in the bpf_refcount shared ownership implementation, from Dave Marchevsky. 7) A few general refactorings to BPF map and program creation permissions checks which were part of the BPF token series, from Andrii Nakryiko. 8) Various fixes for benchmark framework and add a new benchmark for BPF memory allocator to BPF selftests, from Hou Tao. 9) Documentation improvements around iterators and trusted pointers, from Anton Protopopov. 10) Small cleanup in verifier to improve allocated object check, from Daniel T. Lee. 11) Improve performance of bpf_xdp_pointer() by avoiding access to shared_info when XDP packet does not have frags, from Jesper Dangaard Brouer. 12) Silence a harmless syzbot-reported warning in btf_type_id_size(), from Yonghong Song. 13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper, from Jarkko Sakkinen. 14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) bpf, docs: Document existing macros instead of deprecated bpf, docs: BPF Iterator Document selftests/bpf: Fix compilation failure for prog vrf_socket_lookup selftests/bpf: Add vrf_socket_lookup tests bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint bpf: Factor out socket lookup functions for the TC hookpoint. selftests/bpf: Set the default value of consumer_cnt as 0 selftests/bpf: Ensure that next_cpu() returns a valid CPU number selftests/bpf: Output the correct error code for pthread APIs selftests/bpf: Use producer_cnt to allocate local counter array xsk: Remove unused inline function xsk_buff_discard() bpf: Keep BPF_PROG_LOAD permission checks clear of validations bpf: Centralize permissions checks for all BPF map types bpf: Inline map creation logic in map_create() function bpf: Move unprivileged checks into map_create() and bpf_prog_load() bpf: Remove in_atomic() from bpf_link_put(). selftests/bpf: Verify that check_ids() is used for scalars in regsafe() bpf: Verify scalar ids mapping in regsafe() using check_ids() selftests/bpf: Check if mark_chain_precision() follows scalar ids ... ==================== Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22tcp: fix comment typoYueh-Shun Li1-1/+1
Spell "transmissions" properly. Found by searching for keyword "tranm". Signed-off-by: Yueh-Shun Li <shamrocklee@posteo.net> Link: https://lore.kernel.org/r/20230622012627.15050-6-shamrocklee@posteo.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-0/+4
Cross-merge networking fixes after downstream PR. Conflicts: tools/testing/selftests/net/fcnal-test.sh d7a2fc1437f7 ("selftests: net: fcnal-test: check if FIPS mode is enabled") dd017c72dde6 ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.") https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/ https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21inet: Cleanup on charging memory for newly accepted socketsAbel Wu1-7/+10
If there is no net-memcg associated with the sock, don't bother calculating its memory usage for charge. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lore.kernel.org/r/20230620092712.16217-1-wuyun.abel@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20Merge tag 'ipsec-2023-06-20' of ↵David S. Miller2-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ipsec-2023-06-20
2023-06-18tcp: Use per-vma locking for receive zerocopyArjun Roy1-8/+37
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <arjunroy@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>