summaryrefslogtreecommitdiff
path: root/net/core/xdp.c
AgeCommit message (Collapse)AuthorFilesLines
2024-07-14xdp: fix invalid wait context of page_pool_destroy()Taehee Yoo1-3/+1
If the driver uses a page pool, it creates a page pool with page_pool_create(). The reference count of page pool is 1 as default. A page pool will be destroyed only when a reference count reaches 0. page_pool_destroy() is used to destroy page pool, it decreases a reference count. When a page pool is destroyed, ->disconnect() is called, which is mem_allocator_disconnect(). This function internally acquires mutex_lock(). If the driver uses XDP, it registers a memory model with xdp_rxq_info_reg_mem_model(). The xdp_rxq_info_reg_mem_model() internally increases a page pool reference count if a memory model is a page pool. Now the reference count is 2. To destroy a page pool, the driver should call both page_pool_destroy() and xdp_unreg_mem_model(). The xdp_unreg_mem_model() internally calls page_pool_destroy(). Only page_pool_destroy() decreases a reference count. If a driver calls page_pool_destroy() then xdp_unreg_mem_model(), we will face an invalid wait context warning. Because xdp_unreg_mem_model() calls page_pool_destroy() with rcu_read_lock(). The page_pool_destroy() internally acquires mutex_lock(). Splat looks like: ============================= [ BUG: Invalid wait context ] 6.10.0-rc6+ #4 Tainted: G W ----------------------------- ethtool/1806 is trying to lock: ffffffff90387b90 (mem_id_lock){+.+.}-{4:4}, at: mem_allocator_disconnect+0x73/0x150 other info that might help us debug this: context-{5:5} 3 locks held by ethtool/1806: stack backtrace: CPU: 0 PID: 1806 Comm: ethtool Tainted: G W 6.10.0-rc6+ #4 f916f41f172891c800f2fed Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021 Call Trace: <TASK> dump_stack_lvl+0x7e/0xc0 __lock_acquire+0x1681/0x4de0 ? _printk+0x64/0xe0 ? __pfx_mark_lock.part.0+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 lock_acquire+0x1b3/0x580 ? mem_allocator_disconnect+0x73/0x150 ? __wake_up_klogd.part.0+0x16/0xc0 ? __pfx_lock_acquire+0x10/0x10 ? dump_stack_lvl+0x91/0xc0 __mutex_lock+0x15c/0x1690 ? mem_allocator_disconnect+0x73/0x150 ? __pfx_prb_read_valid+0x10/0x10 ? mem_allocator_disconnect+0x73/0x150 ? __pfx_llist_add_batch+0x10/0x10 ? console_unlock+0x193/0x1b0 ? lockdep_hardirqs_on+0xbe/0x140 ? __pfx___mutex_lock+0x10/0x10 ? tick_nohz_tick_stopped+0x16/0x90 ? __irq_work_queue_local+0x1e5/0x330 ? irq_work_queue+0x39/0x50 ? __wake_up_klogd.part.0+0x79/0xc0 ? mem_allocator_disconnect+0x73/0x150 mem_allocator_disconnect+0x73/0x150 ? __pfx_mem_allocator_disconnect+0x10/0x10 ? mark_held_locks+0xa5/0xf0 ? rcu_is_watching+0x11/0xb0 page_pool_release+0x36e/0x6d0 page_pool_destroy+0xd7/0x440 xdp_unreg_mem_model+0x1a7/0x2a0 ? __pfx_xdp_unreg_mem_model+0x10/0x10 ? kfree+0x125/0x370 ? bnxt_free_ring.isra.0+0x2eb/0x500 ? bnxt_free_mem+0x5ac/0x2500 xdp_rxq_info_unreg+0x4a/0xd0 bnxt_free_mem+0x1356/0x2500 bnxt_close_nic+0xf0/0x3b0 ? __pfx_bnxt_close_nic+0x10/0x10 ? ethnl_parse_bit+0x2c6/0x6d0 ? __pfx___nla_validate_parse+0x10/0x10 ? __pfx_ethnl_parse_bit+0x10/0x10 bnxt_set_features+0x2a8/0x3e0 __netdev_update_features+0x4dc/0x1370 ? ethnl_parse_bitset+0x4ff/0x750 ? __pfx_ethnl_parse_bitset+0x10/0x10 ? __pfx___netdev_update_features+0x10/0x10 ? mark_held_locks+0xa5/0xf0 ? _raw_spin_unlock_irqrestore+0x42/0x70 ? __pm_runtime_resume+0x7d/0x110 ethnl_set_features+0x32d/0xa20 To fix this problem, it uses rhashtable_lookup_fast() instead of rhashtable_lookup() with rcu_read_lock(). Using xa without rcu_read_lock() here is safe. xa is freed by __xdp_mem_allocator_rcu_free() and this is called by call_rcu() of mem_xa_remove(). The mem_xa_remove() is called by page_pool_destroy() if a reference count reaches 0. The xa is already protected by the reference count mechanism well in the control plane. So removing rcu_read_lock() for page_pool_destroy() is safe. Fixes: c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240712095116.3801586-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24xdp: Remove WARN() from __xdp_reg_mem_model()Daniil Dulov1-3/+1
syzkaller reports a warning in __xdp_reg_mem_model(). The warning occurs only if __mem_id_init_hash_table() returns an error. It returns the error in two cases: 1. memory allocation fails; 2. rhashtable_init() fails when some fields of rhashtable_params struct are not initialized properly. The second case cannot happen since there is a static const rhashtable_params struct with valid fields. So, warning is only triggered when there is a problem with memory allocation. Thus, there is no sense in using WARN() to handle this error and it can be safely removed. WARNING: CPU: 0 PID: 5065 at net/core/xdp.c:299 __xdp_reg_mem_model+0x2d9/0x650 net/core/xdp.c:299 CPU: 0 PID: 5065 Comm: syz-executor883 Not tainted 6.8.0-syzkaller-05271-gf99c5f563c17 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:__xdp_reg_mem_model+0x2d9/0x650 net/core/xdp.c:299 Call Trace: xdp_reg_mem_model+0x22/0x40 net/core/xdp.c:344 xdp_test_run_setup net/bpf/test_run.c:188 [inline] bpf_test_run_xdp_live+0x365/0x1e90 net/bpf/test_run.c:377 bpf_prog_test_run_xdp+0x813/0x11b0 net/bpf/test_run.c:1267 bpf_prog_test_run+0x33a/0x3b0 kernel/bpf/syscall.c:4240 __sys_bpf+0x48d/0x810 kernel/bpf/syscall.c:5649 __do_sys_bpf kernel/bpf/syscall.c:5738 [inline] __se_sys_bpf kernel/bpf/syscall.c:5736 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Found by Linux Verification Center (linuxtesting.org) with syzkaller. Fixes: 8d5d88527587 ("xdp: rhashtable with allocator ID to pointer mapping") Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/all/20240617162708.492159-1-d.dulov@aladdin.ru Link: https://lore.kernel.org/bpf/20240624080747.36858-1-d.dulov@aladdin.ru
2024-03-07net: move skbuff_cache(s) to net_hotdataEric Dumazet1-2/+3
skbuff_cache, skbuff_fclone_cache and skb_small_head_cache are used in rx/tx fast paths. Move them to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-02Merge tag 'for-netdev' of ↵Jakub Kicinski1-2/+2
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-02-29 We've added 119 non-merge commits during the last 32 day(s) which contain a total of 150 files changed, 3589 insertions(+), 995 deletions(-). The main changes are: 1) Extend the BPF verifier to enable static subprog calls in spin lock critical sections, from Kumar Kartikeya Dwivedi. 2) Fix confusing and incorrect inference of PTR_TO_CTX argument type in BPF global subprogs, from Andrii Nakryiko. 3) Larger batch of riscv BPF JIT improvements and enabling inlining of the bpf_kptr_xchg() for RV64, from Pu Lehui. 4) Allow skeleton users to change the values of the fields in struct_ops maps at runtime, from Kui-Feng Lee. 5) Extend the verifier's capabilities of tracking scalars when they are spilled to stack, especially when the spill or fill is narrowing, from Maxim Mikityanskiy & Eduard Zingerman. 6) Various BPF selftest improvements to fix errors under gcc BPF backend, from Jose E. Marchesi. 7) Avoid module loading failure when the module trying to register a struct_ops has its BTF section stripped, from Geliang Tang. 8) Annotate all kfuncs in .BTF_ids section which eventually allows for automatic kfunc prototype generation from bpftool, from Daniel Xu. 9) Several updates to the instruction-set.rst IETF standardization document, from Dave Thaler. 10) Shrink the size of struct bpf_map resp. bpf_array, from Alexei Starovoitov. 11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer, from Benjamin Tissoires. 12) Fix bpftool to be more portable to musl libc by using POSIX's basename(), from Arnaldo Carvalho de Melo. 13) Add libbpf support to gcc in CORE macro definitions, from Cupertino Miranda. 14) Remove a duplicate type check in perf_event_bpf_event, from Florian Lehner. 15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them with notrace correctly, from Yonghong Song. 16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible array to fix build warnings, from Kees Cook. 17) Fix resolve_btfids cross-compilation to non host-native endianness, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) selftests/bpf: Test if shadow types work correctly. bpftool: Add an example for struct_ops map and shadow type. bpftool: Generated shadow variables for struct_ops maps. libbpf: Convert st_ops->data to shadow type. libbpf: Set btf_value_type_id of struct bpf_map for struct_ops. bpf: Replace bpf_lpm_trie_key 0-length array with flexible array bpf, arm64: use bpf_prog_pack for memory management arm64: patching: implement text_poke API bpf, arm64: support exceptions arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT bpf: add is_async_callback_calling_insn() helper bpf: introduce in_sleepable() helper bpf: allow more maps in sleepable bpf programs selftests/bpf: Test case for lacking CFI stub functions. bpf: Check cfi_stubs before registering a struct_ops type. bpf: Clarify batch lookup/lookup_and_delete semantics bpf, docs: specify which BPF_ABS and BPF_IND fields were zero bpf, docs: Fix typos in instruction-set.rst selftests/bpf: update tcp_custom_syncookie to use scalar packet offset bpf: Shrink size of struct bpf_map/bpf_array. ... ==================== Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31bpf: treewide: Annotate BPF kfuncs in BTFDaniel Xu1-2/+2
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30xdp: Remove usage of the deprecated ida_simple_xx() APIChristophe JAILLET1-3/+3
ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_range() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/8e889d18a6c881b09db4650d4b30a62d76f4fe77.1705734073.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-13xdp: Add VLAN tag hintLarysa Zaremba1-0/+33
Implement functionality that enables drivers to expose VLAN tag to XDP code. VLAN tag is represented by 2 variables: - protocol ID, which is passed to bpf code in BE - VLAN TCI, in host byte order Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20231205210847.28460-10-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Add __bpf_kfunc_{start,end}_defs macrosDave Marchevsky1-4/+2
BPF kfuncs are meant to be called from BPF programs. Accordingly, most kfuncs are not called from anywhere in the kernel, which the -Wmissing-prototypes warning is unhappy about. We've peppered __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are defined in the codebase to suppress this warning. This patch adds two macros meant to bound one or many kfunc definitions. All existing kfunc definitions which use these __diag calls to suppress -Wmissing-prototypes are migrated to use the newly-introduced macros. A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier version of this patch [0] and another recent mailing list thread [1]. In the future we might need to ignore different warnings or do other kfunc-specific things. This change will make it easier to make such modifications for all kfunc defs. [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-15bpf: expose information about supported xdp metadata kfuncStanislav Fomichev1-2/+2
Add new xdp-rx-metadata-features member to netdev netlink which exports a bitmask of supported kfuncs. Most of the patch is autogenerated (headers), the only relevant part is netdev.yaml and the changes in netdev-genl.c to marshal into netlink. Example output on veth: $ ip link add veth0 type veth peer name veth1 # ifndex == 12 $ ./tools/net/ynl/samples/netdev 12 Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12 veth1[12] xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0 Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: make it easier to add new metadata kfuncStanislav Fomichev1-2/+2
No functional changes. Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc, move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx. This way, any time new kfunc is added, we don't have to touch bpf_dev_bound_resolve_kfunc. Also document XDP_METADATA_KFUNC_xxx arguments since we now have more than two and it might be confusing what is what. Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07page_pool: split types and declarations from page_pool.hYunsheng Lin1-1/+1
Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-11xdp: use trusted arguments in XDP hints kfuncsLarysa Zaremba1-1/+1
Currently, verifier does not reject XDP programs that pass NULL pointer to hints functions. At the same time, this case is not handled in any driver implementation (including veth). For example, changing bpf_xdp_metadata_rx_timestamp(ctx, &timestamp); to bpf_xdp_metadata_rx_timestamp(ctx, NULL); in xdp_metadata test successfully crashes the system. Add KF_TRUSTED_ARGS flag to hints kfunc definitions, so driver code does not have to worry about getting invalid pointers. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Reported-by: Stanislav Fomichev <sdf@google.com> Closes: https://lore.kernel.org/bpf/ZKWo0BbpLfkZHbyE@google.com/ Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230711105930.29170-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13Daniel Borkmann says:Jakub Kicinski1-17/+2
==================== pull-request: bpf-next 2023-04-13 We've added 260 non-merge commits during the last 36 day(s) which contain a total of 356 files changed, 21786 insertions(+), 11275 deletions(-). The main changes are: 1) Rework BPF verifier log behavior and implement it as a rotating log by default with the option to retain old-style fixed log behavior, from Andrii Nakryiko. 2) Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params, from Christian Ehrig. 3) Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton, from Alexei Starovoitov. 4) Optimize hashmap lookups when key size is multiple of 4, from Anton Protopopov. 5) Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps, from David Vernet. 6) Add support for stashing local BPF kptr into a map value via bpf_kptr_xchg(). This is useful e.g. for rbtree node creation for new cgroups, from Dave Marchevsky. 7) Fix BTF handling of is_int_ptr to skip modifiers to work around tracing issues where a program cannot be attached, from Feng Zhou. 8) Migrate a big portion of test_verifier unit tests over to test_progs -a verifier_* via inline asm to ease {read,debug}ability, from Eduard Zingerman. 9) Several updates to the instruction-set.rst documentation which is subject to future IETF standardization (https://lwn.net/Articles/926882/), from Dave Thaler. 10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register known bits information propagation, from Daniel Borkmann. 11) Add skb bitfield compaction work related to BPF with the overall goal to make more of the sk_buff bits optional, from Jakub Kicinski. 12) BPF selftest cleanups for build id extraction which stand on its own from the upcoming integration work of build id into struct file object, from Jiri Olsa. 13) Add fixes and optimizations for xsk descriptor validation and several selftest improvements for xsk sockets, from Kal Conley. 14) Add BPF links for struct_ops and enable switching implementations of BPF TCP cong-ctls under a given name by replacing backing struct_ops map, from Kui-Feng Lee. 15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable offset stack read as earlier Spectre checks cover this, from Luis Gerhorst. 16) Fix issues in copy_from_user_nofault() for BPF and other tracers to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner and Alexei Starovoitov. 17) Add --json-summary option to test_progs in order for CI tooling to ease parsing of test results, from Manu Bretelle. 18) Batch of improvements and refactoring to prep for upcoming bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator, from Martin KaFai Lau. 19) Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations, from Quentin Monnet. 20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting the module name from BTF of the target and searching kallsyms of the correct module, from Viktor Malik. 21) Improve BPF verifier handling of '<const> <cond> <non_const>' to better detect whether in particular jmp32 branches are taken, from Yonghong Song. 22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock. A built-in cc or one from a kernel module is already able to write to app_limited, from Yixin Shen. Conflicts: Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") 0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ include/net/ip_tunnels.h bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts") ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices") https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/ net/bpf/test_run.c e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") 294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES") https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/ ==================== Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13xdp: rss hash types representationJesper Dangaard Brouer1-1/+9
The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-23Merge tag 'for-netdev' of ↵Jakub Kicinski1-2/+8
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-03-23 We've added 8 non-merge commits during the last 13 day(s) which contain a total of 21 files changed, 238 insertions(+), 161 deletions(-). The main changes are: 1) Fix verification issues in some BPF programs due to their stack usage patterns, from Eduard Zingerman. 2) Fix to add missing overflow checks in xdp_umem_reg and return an error in such case, from Kal Conley. 3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia. 4) Fix insufficient bpf_jit_limit default to avoid users running into hard to debug seccomp BPF errors, from Daniel Borkmann. 5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc to make it unambiguous from other errors, from Jesper Dangaard Brouer. 6) Two BPF selftest fixes to address compilation errors from recent changes in kernel structures, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support bpf: Adjust insufficient default bpf_jit_limit xsk: Add missing overflow check in xdp_umem_reg selftests/bpf: Fix progs/test_deny_namespace.c issues. selftests/bpf: Fix progs/find_vma_fail1.c build error. libbpf: Revert poisoning of strlcpy selftests/bpf: Tests for uninitialized stack reads bpf: Allow reads from uninit stack ==================== Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver supportJesper Dangaard Brouer1-2/+8
When driver doesn't implement a bpf_xdp_metadata kfunc the fallback implementation returns EOPNOTSUPP, which indicate device driver doesn't implement this kfunc. Currently many drivers also return EOPNOTSUPP when the hint isn't available, which is ambiguous from an API point of view. Instead change drivers to return ENODATA in these cases. There can be natural cases why a driver doesn't provide any hardware info for a specific hint, even on a frame to frame basis (e.g. PTP). Lets keep these cases as separate return codes. When describing the return values, adjust the function kernel-doc layout to get proper rendering for the return values. Fixes: ab46182d0dcb ("net/mlx4_en: Support RX XDP metadata") Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata") Fixes: 306531f0249f ("veth: Support RX XDP metadata") Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-16net: xdp: don't call notifiers during driver initJakub Kicinski1-1/+3
Drivers will commonly perform feature setting during init, if they use the xdp_set_features_flag() helper they'll likely run into an ASSERT_RTNL() inside call_netdevice_notifiers_info(). Don't call the notifier until the device is actually registered. Nothing should be tracking the device until its registered and after its unregistration has started. Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Link: https://lore.kernel.org/r/20230316220234.598091-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-14xdp: remove unused {__,}xdp_release_frame()Alexander Lobakin1-15/+0
__xdp_build_skb_from_frame() was the last user of {__,}xdp_release_frame(), which detaches pages from the page_pool. All the consumers now recycle Page Pool skbs and page, except mlx5, stmmac and tsnep drivers, which use page_pool_release_page() directly (might change one day). It's safe to assume this functionality is not needed anymore and can be removed (in favor of recycling). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-5-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-14xdp: recycle Page Pool backed skbs built from XDP framesAlexander Lobakin1-2/+2
__xdp_build_skb_from_frame() state(d): /* Until page_pool get SKB return path, release DMA here */ Page Pool got skb pages recycling in April 2021, but missed this function. xdp_release_frame() is relevant only for Page Pool backed frames and it detaches the page from the corresponding page_pool in order to make it freeable via page_frag_free(). It can instead just mark the output skb as eligible for recycling if the frame is backed by a pp. No change for other memory model types (the same condition check as before). cpumap redirect and veth on Page Pool drivers now become zero-alloc (or almost). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-4-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10xdp: add xdp_set_features_flag utility routineLorenzo Bianconi1-7/+19
Introduce xdp_set_features_flag utility routine in order to update dynamically xdp_features according to the dynamic hw configuration via ethtool (e.g. changing number of hw rx/tx queues). Add xdp_clear_features_flag() in order to clear all xdp_feature flag. Reviewed-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10Daniel Borkmann says:Jakub Kicinski1-2/+21
==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10net: skbuff: drop the word head from skb cacheJakub Kicinski1-3/+2
skbuff_head_cache is misnamed (perhaps for historical reasons?) because it does not hold heads. Head is the buffer which skb->data points to, and also where shinfo lives. struct sk_buff is a metadata structure, not the head. Eric recently added skb_small_head_cache (which allocates actual head buffers), let that serve as an excuse to finally clean this up :) Leave the user-space visible name intact, it could possibly be uAPI. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-02drivers: net: turn on XDP featuresMarek Majtyka1-0/+18
A summary of the flags being set for various drivers is given below. Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features that can be turned off and on at runtime. This means that these flags may be set and unset under RTNL lock protection by the driver. Hence, READ_ONCE must be used by code loading the flag value. Also, these flags are not used for synchronization against the availability of XDP resources on a device. It is merely a hint, and hence the read may race with the actual teardown of XDP resources on the device. This may change in the future, e.g. operations taking a reference on the XDP resources of the driver, and in turn inhibiting turning off this flag. However, for now, it can only be used as a hint to check whether device supports becoming a redirection target. Turn 'hw-offload' feature flag on for: - netronome (nfp) - netdevsim. Turn 'native' and 'zerocopy' features flags on for: - intel (i40e, ice, ixgbe, igc) - mellanox (mlx5). - stmmac - netronome (nfp) Turn 'native' features flags on for: - amazon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2, enetc) - funeth - intel (igb) - marvell (mvneta, mvpp2, octeontx2) - mellanox (mlx4) - mtk_eth_soc - qlogic (qede) - sfc - socionext (netsec) - ti (cpsw) - tap - tsnep - veth - xen - virtio_net. Turn 'basic' (tx, pass, aborted and drop) features flags on for: - netronome (nfp) - cavium (thunder) - hyperv. Turn 'redirect_target' feature flag on for: - amanzon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2) - intel (i40e, ice, igb, ixgbe) - ti (cpsw) - marvell (mvneta, mvpp2) - sfc - socionext (netsec) - qlogic (qede) - mellanox (mlx5) - tap - veth - virtio_net - xen Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Marek Majtyka <alardam@gmail.com> Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-02bpf: Add __bpf_kfunc tag to all kfuncsDavid Vernet1-2/+3
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
2023-01-23bpf: XDP metadata RX kfuncsStanislav Fomichev1-0/+64
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not supported by the target device, the default implementation is called instead. The verifier, at load time, replaces a call to the generic kfunc with a call to the per-device one. Per-device kfunc pointers are stored in separate struct xdp_metadata_ops. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-26xdp: improve page_pool xdp_return performanceJesper Dangaard Brouer1-6/+4
During LPC2022 I meetup with my page_pool co-maintainer Ilias. When discussing page_pool code we realised/remembered certain optimizations had not been fully utilised. Since commit c07aea3ef4d4 ("mm: add a signature in struct page") struct page have a direct pointer to the page_pool object this page was allocated from. Thus, with this info it is possible to skip the rhashtable_lookup to find the page_pool object in __xdp_return(). The rcu_read_lock can be removed as it was tied to xdp_mem_allocator. The page_pool object is still safe to access as it tracks inflight pages and (potentially) schedules final release from a work queue. Created a micro benchmark of XDP redirecting from mlx5 into veth with XDP_DROP bpf-prog on the peer veth device. This increased performance 6.5% from approx 8.45Mpps to 9Mpps corresponding to using 7 nanosec (27 cycles at 3.8GHz) less per packet. Suggested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/166377993287.1737053.10258297257583703949.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-0/+1
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-03-21 v2 We've added 137 non-merge commits during the last 17 day(s) which contain a total of 143 files changed, 7123 insertions(+), 1092 deletions(-). The main changes are: 1) Custom SEC() handling in libbpf, from Andrii. 2) subskeleton support, from Delyan. 3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao. 4) Fix net.core.bpf_jit_harden race, from Hou. 5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub. 6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami. The arch specific bits will come later. 7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri. 8) Enable non-atomic allocations in local storage, from Joanne. 9) Various var_off ptr_to_btf_id fixed, from Kumar. 10) bpf_ima_file_hash helper, from Roberto. 11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits) selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" bpftool: Fix a bug in subskeleton code generation bpf: Fix bpf_prog_pack when PMU_SIZE is not defined bpf: Fix bpf_prog_pack for multi-node setup bpf: Fix warning for cast from restricted gfp_t in verifier bpf, arm: Fix various typos in comments libbpf: Close fd in bpf_object__reuse_map bpftool: Fix print error when show bpf map bpf: Fix kprobe_multi return probe backtrace Revert "bpf: Add support to inline bpf_get_func_ip helper on x86" bpf: Simplify check in btf_parse_hdr() selftests/bpf/test_lirc_mode2.sh: Exit with proper code bpf: Check for NULL return from bpf_get_btf_vmlinux selftests/bpf: Test skipping stacktrace bpf: Adjust BPF stack helper functions to accommodate skip > 0 bpf: Select proper size for bpf_prog_pack ... ==================== Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-17veth: Rework veth_xdp_rcv_skb in order to accept non-linear skbLorenzo Bianconi1-0/+1
Introduce veth_convert_skb_to_xdp_buff routine in order to convert a non-linear skb into a xdp buffer. If the received skb is cloned or shared, veth_convert_skb_to_xdp_buff will copy it in a new skb composed by order-0 pages for the linear and the fragmented area. Moreover veth_convert_skb_to_xdp_buff guarantees we have enough headroom for xdp. This is a preliminary patch to allow attaching xdp programs with frags support on veth devices. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/8d228b106bc1903571afd1d77e797bffe9a5ea7c.1646989407.git.lorenzo@kernel.org
2022-03-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+2
net/dsa/dsa2.c commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails") commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master") https://lore.kernel.org/all/20220307101436.7ae87da0@canb.auug.org.au/ drivers/net/ethernet/intel/ice/ice.h commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU") commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device") https://lore.kernel.org/all/20220310112843.3233bcf1@canb.auug.org.au/ drivers/staging/gdm724x/gdm_lte.c commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()") commit 4bcc4249b4cf ("staging: Use netif_rx().") https://lore.kernel.org/all/20220308111043.1018a59d@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10xdp: xdp_mem_allocator can be NULL in trace_mem_connect().Sebastian Andrzej Siewior1-1/+2
Since the commit mentioned below __xdp_reg_mem_model() can return a NULL pointer. This pointer is dereferenced in trace_mem_connect() which leads to segfault. The trace points (mem_connect + mem_disconnect) were put in place to pair connect/disconnect using the IDs. The ID is only assigned if __xdp_reg_mem_model() does not return NULL. That connect trace point is of no use if there is no ID. Skip that connect trace point if xdp_alloc is NULL. [ Toke Høiland-Jørgensen delivered the reasoning for skipping the trace point ] Fixes: 4a48ef70b93b8 ("xdp: Allow registering memory model without rxq reference") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/YikmmXsffE+QajTB@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-21bpf: add frags support to the bpf_xdp_adjust_tail() APIEelco Chaudron1-5/+7
This change adds support for tail growing and shrinking for XDP frags. When called on a non-linear packet with a grow request, it will work on the last fragment of the packet. So the maximum grow size is the last fragments tailroom, i.e. no new buffer will be allocated. A XDP frags capable driver is expected to set frag_size in xdp_rxq_info data structure to notify the XDP core the fragment size. frag_size set to 0 is interpreted by the XDP core as tail growing is not allowed. Introduce __xdp_rxq_info_reg utility routine to initialize frag_size field. When shrinking, it will work from the last fragment, all the way down to the base buffer depending on the shrinking size. It's important to mention that once you shrink down the fragment(s) are freed, so you can not grow again to the original size. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/eabda3485dda4f2f158b477729337327e609461d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21xdp: add frags support to xdp_return_{buff/frame}Lorenzo Bianconi1-1/+53
Take into account if the received xdp_buff/xdp_frame is non-linear recycling/returning the frame memory to the allocator or into xdp_frame_bulk. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/a961069febc868508ce1bdf5e53a343eb4e57cb2.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21net: xdp: add xdp_update_skb_shared_info utility routineLorenzo Bianconi1-0/+12
Introduce xdp_update_skb_shared_info routine to update frags array metadata in skb_shared_info data structure converting to a skb from a xdp_buff or xdp_frame. According to the current skb_shared_info architecture in xdp_frame/xdp_buff and to the xdp frags support, there is no need to run skb_add_rx_frag() and reset frags array converting the buffer to a skb since the frag array will be in the same position for xdp_buff/xdp_frame and for the skb, we just need to update memory metadata. Introduce XDP_FLAGS_PF_MEMALLOC flag in xdp_buff_flags in order to mark the xdp_buff or xdp_frame as under memory-pressure if pages of the frags array are under memory pressure. Doing so we can avoid looping over all fragments in xdp_update_skb_shared_info routine. The driver is expected to set the flag constructing the xdp_buffer using xdp_buff_set_frag_pfmemalloc utility routine. Rely on xdp_update_skb_shared_info in __xdp_build_skb_from_frame routine converting the non-linear xdp_frame to a skb after performing a XDP_REDIRECT. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/bfd23fb8a8d7438724f7819c567cdf99ffd6226f.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-05page_pool: Store the XDP mem idToke Høiland-Jørgensen1-1/+1
Store the XDP mem ID inside the page_pool struct so it can be retrieved later for use in bpf_prog_run(). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
2022-01-05xdp: Allow registering memory model without rxq referenceToke Høiland-Jørgensen1-30/+62
The functions that register an XDP memory model take a struct xdp_rxq as parameter, but the RXQ is not actually used for anything other than pulling out the struct xdp_mem_info that it embeds. So refactor the register functions and export variants that just take a pointer to the xdp_mem_info. This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a page_pool instance that is not connected to any network device. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
2021-12-18xdp: move the if dev statements to the firstYajun Deng1-5/+5
The xdp_rxq_info_unreg() called by xdp_rxq_info_reg() is meaningless when dev is NULL, so move the if dev statements to the first. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27xdp: Remove redundant warningYajun Deng1-2/+0
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is redundant. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-28xdp: Move the rxq_info.mem clearing to unreg_mem_model()Jakub Kicinski1-5/+6
xdp_rxq_info_unreg() implicitly calls xdp_rxq_info_unreg_mem_model(). This may well be confusing to the driver authors, and lead to double free if they call xdp_rxq_info_unreg_mem_model() before xdp_rxq_info_unreg() (when mem model type == MEM_TYPE_PAGE_POOL). In fact error path of mvpp2_rxq_init() seems to currently do exactly that. The double free will result in refcount underflow in page_pool_destroy(). Make the interface a little more programmer friendly by clearing type and id so that xdp_rxq_info_unreg_mem_model() can be called multiple times. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210625221612.2637086-1-kuba@kernel.org
2021-05-26xdp: Extend xdp_redirect_map with broadcast supportHangbin Liu1-0/+28
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to extend xdp_redirect_map for broadcast support. With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be excluded when do broadcasting. When getting the devices in dev hash map via dev_map_hash_get_next_key(), there is a possibility that we fall back to the first key when a device was removed. This will duplicate packets on some interfaces. So just walk the whole buckets to avoid this issue. For dev array map, we also walk the whole map to find valid interfaces. Function bpf_clear_redirect_map() was removed in commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). Add it back as we need to use ri->map again. With test topology: +-------------------+ +-------------------+ | Host A (i40e 10G) | ---------- | eno1(i40e 10G) | +-------------------+ | | | Host B | +-------------------+ | | | Host C (i40e 10G) | ---------- | eno2(i40e 10G) | +-------------------+ | | | +------+ | | veth0 -- | Peer | | | veth1 -- | | | | veth2 -- | NS | | | +------+ | +-------------------+ On Host A: # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64 On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory): Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing. All the veth peers in the NS have a XDP_DROP program loaded. The forward_map max_entries in xdp_redirect_map_multi is modify to 4. Testing the performance impact on the regular xdp_redirect path with and without patch (to check impact of additional check for broadcast mode): 5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M 5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M 5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M 5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M Testing the performance when cloning packets with the redirect_map_multi test, using a redirect map size of 4, filled with 1-3 devices: 5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M 5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M 5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
2021-03-31xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory modelOng Boon Leong1-1/+2
xdp_return_frame() may be called outside of NAPI context to return xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with napi_direct = false. For page_pool memory model, __xdp_return() calls xdp_return_frame_no_direct() unconditionally and below false negative kernel BUG throw happened under preempt-rt build: [ 430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884 [ 430.451678] caller is __xdp_return+0x1ff/0x2e0 [ 430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G U E 5.12.0-rc2+ #45 Changes in v2: - This patch fixes the issue by making xdp_return_frame_no_direct() is only called if napi_direct = true, as recommended for better by Jesper Dangaard Brouer. Thanks! Fixes: 2539650fadbf ("xdp: Helpers for disabling napi_direct of xdp_return_frame") Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-04net, veth: Alloc skb in bulk for ndo_xdp_xmitLorenzo Bianconi1-0/+11
Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine in order to alloc skbs in bulk for XDP_PASS verdict. Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list. The proposed approach has been tested in the following scenario: eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS XDP_REDIRECT: xdp_redirect_map bpf sample XDP_PASS: xdp_rxq_info bpf sample traffic generator: pkt_gen sending udp traffic on a remote device bpf-next master: ~3.64Mpps bpf-next + skb bulking allocation: ~3.79Mpps Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/a14a30d3c06fff24e13f836c733d80efc0bd6eb5.1611957532.git.lorenzo@kernel.org
2021-01-20net, xdp: Introduce xdp_build_skb_from_frame utility routineLorenzo Bianconi1-0/+15
Introduce xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Respect to __xdp_build_skb_from_frame, xdp_build_skb_from_frame will allocate the skb object. Rely on xdp_build_skb_from_frame in veth driver. Introduce missing xdp metadata support in veth_xdp_rcv_one routine. Add missing metadata support in veth_xdp_rcv_one(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20net, xdp: Introduce __xdp_build_skb_from_frame utility routineLorenzo Bianconi1-0/+44
Introduce __xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in cpumap code. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-20/+11
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09xdp: Remove the xdp_attachment_flags_ok() callbackToke Høiland-Jørgensen1-12/+0
Since commit 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device"), the XDP program attachment info is now maintained in the core code. This interacts badly with the xdp_attachment_flags_ok() check that prevents unloading an XDP program with different load flags than it was loaded with. In practice, two kinds of failures are seen: - An XDP program loaded without specifying a mode (and which then ends up in driver mode) cannot be unloaded if the program mode is specified on unload. - The dev_xdp_uninstall() hook always calls the driver callback with the mode set to the type of the program but an empty flags argument, which means the flags_ok() check prevents the program from being removed, leading to bpf prog reference leaks. The original reason this check was added was to avoid ambiguity when multiple programs were loaded. With the way the checks are done in the core now, this is quite simple to enforce in the core code, so let's add a check there and get rid of the xdp_attachment_flags_ok() callback entirely. Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
2020-12-01xsk: Propagate napi_id to XDP socket Rx pathBjörn Töpel1-1/+2
Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-11-30xdp: Handle MEM_TYPE_XSK_BUFF_POOL correctly in xdp_return_buff()Björn Töpel1-7/+10
It turns out that it does exist a path where xdp_return_buff() is being passed an XDP buffer of type MEM_TYPE_XSK_BUFF_POOL. This path is when AF_XDP zero-copy mode is enabled, and a buffer is redirected to a DEVMAP with an attached XDP program that drops the buffer. This change simply puts the handling of MEM_TYPE_XSK_BUFF_POOL back into xdp_return_buff(). Fixes: 82c41671ca4f ("xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}") Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/bpf/20201127171726.123627-1-bjorn.topel@gmail.com
2020-11-14net: page_pool: Add bulk support for ptr_ringLorenzo Bianconi1-7/+2
Introduce the capability to batch page_pool ptr_ring refill since it is usually run inside the driver NAPI tx completion loop. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/bpf/08dd249c9522c001313f520796faa777c4089e1c.1605267335.git.lorenzo@kernel.org
2020-11-14net: xdp: Introduce bulking for xdp tx return pathLorenzo Bianconi1-0/+59
XDP bulk APIs introduce a defer/flush mechanism to return pages belonging to the same xdp_mem_allocator object (identified via the mem.id field) in bulk to optimize I-cache and D-cache since xdp_return_frame is usually run inside the driver NAPI tx completion loop. The bulk queue size is set to 16 to be aligned to how XDP_REDIRECT bulking works. The bulk is flushed when it is full or when mem.id changes. xdp_frame_bulk is usually stored/allocated on the function call-stack to avoid locking penalties. Current implementation considers only page_pool memory model. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/bpf/e190c03eac71b20c8407ae0fc2c399eda7835f49.1605267335.git.lorenzo@kernel.org
2020-07-25bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commandsAndrii Nakryiko1-9/+0
Now that BPF program/link management is centralized in generic net_device code, kernel code never queries program id from drivers, so XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary. This patch removes all the implementations of those commands in kernel, along the xdp_attachment_query(). This patch was compile-tested on allyesconfig. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com