summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller10-168/+733
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 79 files changed, 5275 insertions(+), 600 deletions(-). The main changes are: 1) Basic BTF validation in libbpf, from Andrii Nakryiko. 2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi. 3) next_thread cleanups, from Oleg Nesterov. 4) Add mcpu=v4 support to arm32, from Puranjay Mohan. 5) Add support for __percpu pointers in bpf progs, from Yonghong Song. 6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang. 7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman", Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle), Song Liu, Stanislav Fomichev, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16bpf: Fix kfunc callback register type handlingKumar Kartikeya Dwivedi1-0/+4
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg type before using reg->subprogno. This can accidently permit invalid pointers from being passed into callback helpers (e.g. silently from different paths). Likewise, reg->subprogno from the per-register type union may not be meaningful either. We need to reject any other type except PTR_TO_FUNC. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 5d92ddc3de1b ("bpf: Add callback validation to kfunc verifier logic") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Disallow fentry/fexit/freplace for exception callbacksKumar Kartikeya Dwivedi2-0/+7
During testing, it was discovered that extensions to exception callbacks had no checks, upon running a testcase, the kernel ended up running off the end of a program having final call as bpf_throw, and hitting int3 instructions. The reason is that while the default exception callback would have reset the stack frame to return back to the main program's caller, the replacing extension program will simply return back to bpf_throw, which will instead return back to the program and the program will continue execution, now in an undefined state where anything could happen. The way to support extensions to an exception callback would be to mark the BPF_PROG_TYPE_EXT main subprog as an exception_cb, and prevent it from calling bpf_throw. This would make the JIT produce a prologue that restores saved registers and reset the stack frame. But let's not do that until there is a concrete use case for this, and simply disallow this for now. Similar issues will exist for fentry and fexit cases, where trampoline saves data on the stack when invoking exception callback, which however will then end up resetting the stack frame, and on return, the fexit program will never will invoked as the return address points to the main program's caller in the kernel. Instead of additional complexity and back and forth between the two stacks to enable such a use case, simply forbid it. One key point here to note is that currently X86_TAIL_CALL_OFFSET didn't require any modifications, even though we emit instructions before the corresponding endbr64 instruction. This is because we ensure that a main subprog never serves as an exception callback, and therefore the exception callback (which will be a global subprog) can never serve as the tail call target, eliminating any discrepancies. However, once we support a BPF_PROG_TYPE_EXT to also act as an exception callback, it will end up requiring change to the tail call offset to account for the extra instructions. For simplicitly, tail calls could be disabled for such targets. Noting the above, it appears better to wait for a concrete use case before choosing to permit extension programs to replace exception callbacks. As a precaution, we disable fentry and fexit for exception callbacks as well. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Detect IP == ksym.end as part of BPF programKumar Kartikeya Dwivedi1-1/+5
Now that bpf_throw kfunc is the first such call instruction that has noreturn semantics within the verifier, this also kicks in dead code elimination in unprecedented ways. For one, any instruction following a bpf_throw call will never be marked as seen. Moreover, if a callchain ends up throwing, any instructions after the call instruction to the eventually throwing subprog in callers will also never be marked as seen. The tempting way to fix this would be to emit extra 'int3' instructions which bump the jited_len of a program, and ensure that during runtime when a program throws, we can discover its boundaries even if the call instruction to bpf_throw (or to subprogs that always throw) is emitted as the final instruction in the program. An example of such a program would be this: do_something(): ... r0 = 0 exit foo(): r1 = 0 call bpf_throw r0 = 0 exit bar(cond): if r1 != 0 goto pc+2 call do_something exit call foo r0 = 0 // Never seen by verifier exit // main(ctx): r1 = ... call bar r0 = 0 exit Here, if we do end up throwing, the stacktrace would be the following: bpf_throw foo bar main In bar, the final instruction emitted will be the call to foo, as such, the return address will be the subsequent instruction (which the JIT emits as int3 on x86). This will end up lying outside the jited_len of the program, thus, when unwinding, we will fail to discover the return address as belonging to any program and end up in a panic due to the unreliable stack unwinding of BPF programs that we never expect. To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as part of the BPF program, so that is_bpf_text_address returns true when such a case occurs, and we are able to unwind reliably when the final instruction ends up being a call instruction. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Prevent KASAN false positive with bpf_throwKumar Kartikeya Dwivedi1-0/+6
The KASAN stack instrumentation when CONFIG_KASAN_STACK is true poisons the stack of a function when it is entered and unpoisons it when leaving. However, in the case of bpf_throw, we will never return as we switch our stack frame to the BPF exception callback. Later, this discrepancy will lead to confusing KASAN splats when kernel resumes execution on return from the BPF program. Fix this by unpoisoning everything below the stack pointer of the BPF program, which should cover the range that would not be unpoisoned. An example splat is below: BUG: KASAN: stack-out-of-bounds in stack_trace_consume_entry+0x14e/0x170 Write of size 8 at addr ffffc900013af958 by task test_progs/227 CPU: 0 PID: 227 Comm: test_progs Not tainted 6.5.0-rc2-g43f1c6c9052a-dirty #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-2.fc39 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x80 print_report+0xcf/0x670 ? arch_stack_walk+0x79/0x100 kasan_report+0xda/0x110 ? stack_trace_consume_entry+0x14e/0x170 ? stack_trace_consume_entry+0x14e/0x170 ? __pfx_stack_trace_consume_entry+0x10/0x10 stack_trace_consume_entry+0x14e/0x170 ? __sys_bpf+0xf2e/0x41b0 arch_stack_walk+0x8b/0x100 ? __sys_bpf+0xf2e/0x41b0 ? bpf_prog_test_run_skb+0x341/0x1c70 ? bpf_prog_test_run_skb+0x341/0x1c70 stack_trace_save+0x9b/0xd0 ? __pfx_stack_trace_save+0x10/0x10 ? __kasan_slab_free+0x109/0x180 ? bpf_prog_test_run_skb+0x341/0x1c70 ? __sys_bpf+0xf2e/0x41b0 ? __x64_sys_bpf+0x78/0xc0 ? do_syscall_64+0x3c/0x90 ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 kasan_save_stack+0x33/0x60 ? kasan_save_stack+0x33/0x60 ? kasan_set_track+0x25/0x30 ? kasan_save_free_info+0x2b/0x50 ? __kasan_slab_free+0x109/0x180 ? kmem_cache_free+0x191/0x460 ? bpf_prog_test_run_skb+0x341/0x1c70 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2b/0x50 __kasan_slab_free+0x109/0x180 kmem_cache_free+0x191/0x460 bpf_prog_test_run_skb+0x341/0x1c70 ? __pfx_bpf_prog_test_run_skb+0x10/0x10 ? __fget_light+0x51/0x220 __sys_bpf+0xf2e/0x41b0 ? __might_fault+0xa2/0x170 ? __pfx___sys_bpf+0x10/0x10 ? lock_release+0x1de/0x620 ? __might_fault+0xcd/0x170 ? __pfx_lock_release+0x10/0x10 ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 __x64_sys_bpf+0x78/0xc0 ? syscall_enter_from_user_mode+0x20/0x50 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f0fbb38880d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 45 12 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe13907de8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 00007ffe13908708 RCX: 00007f0fbb38880d RDX: 0000000000000050 RSI: 00007ffe13907e20 RDI: 000000000000000a RBP: 00007ffe13907e00 R08: 0000000000000000 R09: 00007ffe13907e20 R10: 0000000000000064 R11: 0000000000000206 R12: 0000000000000003 R13: 0000000000000000 R14: 00007f0fbb532000 R15: 0000000000cfbd90 </TASK> The buggy address belongs to stack of task test_progs/227 KASAN internal error: frame info validation failed; invalid marker: 0 The buggy address belongs to the virtual mapping at [ffffc900013a8000, ffffc900013b1000) created by: kernel_clone+0xcd/0x600 The buggy address belongs to the physical page: page:00000000b70f4332 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11418f flags: 0x2fffe0000000000(node=0|zone=2|lastcpupid=0x7fff) page_type: 0xffffffff() raw: 02fffe0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffc900013af800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013af880: 00 00 00 f1 f1 f1 f1 00 00 00 f3 f3 f3 f3 f3 00 >ffffc900013af900: 00 00 00 00 00 00 00 00 00 00 00 f1 00 00 00 00 ^ ffffc900013af980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013afa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Treat first argument as return value for bpf_throwKumar Kartikeya Dwivedi1-13/+24
In case of the default exception callback, change the behavior of bpf_throw, where the passed cookie value is no longer ignored, but is instead the return value of the default exception callback. As such, we need to place restrictions on the value being passed into bpf_throw in such a case, only allowing those permitted by the check_return_code function. Thus, bpf_throw can now control the return value of the program from each call site without having the user install a custom exception callback just to override the return value when an exception is thrown. We also modify the hidden subprog instructions to now move BPF_REG_1 to BPF_REG_0, so as to set the return value before exit in the default callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Perform CFG walk for exception callbackKumar Kartikeya Dwivedi1-2/+13
Since exception callbacks are not referenced using bpf_pseudo_func and bpf_pseudo_call instructions, check_cfg traversal will never explore instructions of the exception callback. Even after adding the subprog, the program will then fail with a 'unreachable insn' error. We thus need to begin walking from the start of the exception callback again in check_cfg after a complete CFG traversal finishes, so as to explore the CFG rooted at the exception callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Add support for custom exception callbacksKumar Kartikeya Dwivedi2-16/+126
By default, the subprog generated by the verifier to handle a thrown exception hardcodes a return value of 0. To allow user-defined logic and modification of the return value when an exception is thrown, introduce the 'exception_callback:' declaration tag, which marks a callback as the default exception handler for the program. The format of the declaration tag is 'exception_callback:<value>', where <value> is the name of the exception callback. Each main program can be tagged using this BTF declaratiion tag to associate it with an exception callback. In case the tag is absent, the default callback is used. As such, the exception callback cannot be modified at runtime, only set during verification. Allowing modification of the callback for the current program execution at runtime leads to issues when the programs begin to nest, as any per-CPU state maintaing this information will have to be saved and restored. We don't want it to stay in bpf_prog_aux as this takes a global effect for all programs. An alternative solution is spilling the callback pointer at a known location on the program stack on entry, and then passing this location to bpf_throw as a parameter. However, since exceptions are geared more towards a use case where they are ideally never invoked, optimizing for this use case and adding to the complexity has diminishing returns. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Refactor check_btf_func and split into two phasesKumar Kartikeya Dwivedi1-28/+100
This patch splits the check_btf_info's check_btf_func check into two separate phases. The first phase sets up the BTF and prepares func_info, but does not perform any validation of required invariants for subprogs just yet. This is left to the second phase, which happens where check_btf_info executes currently, and performs the line_info and CO-RE relocation. The reason to perform this split is to obtain the userspace supplied func_info information before we perform the add_subprog call, where we would now require finding and adding subprogs that may not have a bpf_pseudo_call or bpf_pseudo_func instruction in the program. We require this as we want to enable userspace to supply exception callbacks that can override the default hidden subprogram generated by the verifier (which performs a hardcoded action). In such a case, the exception callback may never be referenced in an instruction, but will still be suitably annotated (by way of BTF declaration tags). For finding this exception callback, we would require the program's BTF information, and the supplied func_info information which maps BTF type IDs to subprograms. Since the exception callback won't actually be referenced through instructions, later checks in check_cfg and do_check_subprogs will not verify the subprog. This means that add_subprog needs to add them in the add_subprog_and_kfunc phase before we move forward, which is why the BTF and func_info are required at that point. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Implement BPF exceptionsKumar Kartikeya Dwivedi3-15/+141
This patch implements BPF exceptions, and introduces a bpf_throw kfunc to allow programs to throw exceptions during their execution at runtime. A bpf_throw invocation is treated as an immediate termination of the program, returning back to its caller within the kernel, unwinding all stack frames. This allows the program to simplify its implementation, by testing for runtime conditions which the verifier has no visibility into, and assert that they are true. In case they are not, the program can simply throw an exception from the other branch. BPF exceptions are explicitly *NOT* an unlikely slowpath error handling primitive, and this objective has guided design choices of the implementation of the them within the kernel (with the bulk of the cost for unwinding the stack offloaded to the bpf_throw kfunc). The implementation of this mechanism requires use of add_hidden_subprog mechanism introduced in the previous patch, which generates a couple of instructions to move R1 to R0 and exit. The JIT then rewrites the prologue of this subprog to take the stack pointer and frame pointer as inputs and reset the stack frame, popping all callee-saved registers saved by the main subprog. The bpf_throw function then walks the stack at runtime, and invokes this exception subprog with the stack and frame pointers as parameters. Reviewers must take note that currently the main program is made to save all callee-saved registers on x86_64 during entry into the program. This is because we must do an equivalent of a lightweight context switch when unwinding the stack, therefore we need the callee-saved registers of the caller of the BPF program to be able to return with a sane state. Note that we have to additionally handle r12, even though it is not used by the program, because when throwing the exception the program makes an entry into the kernel which could clobber r12 after saving it on the stack. To be able to preserve the value we received on program entry, we push r12 and restore it from the generated subprogram when unwinding the stack. For now, bpf_throw invocation fails when lingering resources or locks exist in that path of the program. In a future followup, bpf_throw will be extended to perform frame-by-frame unwinding to release lingering resources for each stack frame, removing this limitation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Implement support for adding hidden subprogsKumar Kartikeya Dwivedi3-10/+40
Introduce support in the verifier for generating a subprogram and include it as part of a BPF program dynamically after the do_check phase is complete. The first user will be the next patch which generates default exception callbacks if none are set for the program. The phase of invocation will be do_misc_fixups. Note that this is an internal verifier function, and should be used with instruction blocks which uphold the invariants stated in check_subprogs. Since these subprogs are always appended to the end of the instruction sequence of the program, it becomes relatively inexpensive to do the related adjustments to the subprog_info of the program. Only the fake exit subprogram is shifted forward, making room for our new subprog. This is useful to insert a new subprogram, get it JITed, and obtain its function pointer. The next patch will use this functionality to insert a default exception callback which will be invoked after unwinding the stack. Note that these added subprograms are invisible to userspace, and never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single subprogram is supported, but more can be easily supported in the future. To this end, two function counts are introduced now, the existing func_cnt, and real_func_cnt, the latter including hidden programs. This allows us to conver the JIT code to use the real_func_cnt for management of resources while syscall path continues working with existing func_cnt. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16arch/x86: Implement arch_bpf_stack_walkKumar Kartikeya Dwivedi1-0/+9
The plumbing for offline unwinding when we throw an exception in programs would require walking the stack, hence introduce a new arch_bpf_stack_walk function. This is provided when the JIT supports exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific code is really minimal, hence it should be straightforward to extend this support to other architectures as well, as it reuses the logic of arch_stack_walk, but allowing access to unwind_state data. Once the stack pointer and frame pointer are known for the main subprog during the unwinding, we know the stack layout and location of any callee-saved registers which must be restored before we return back to the kernel. This handling will be added in the subsequent patches. Note that while we primarily unwind through BPF frames, which are effectively CONFIG_UNWINDER_FRAME_POINTER, we still need one of this or CONFIG_UNWINDER_ORC to be able to unwind through the bpf_throw frame from which we begin walking the stack. We also require both sp and bp (stack and frame pointers) from the unwind_state structure, which are only available when one of these two options are enabled. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-15bpf: Allow to use kfunc XDP hints and frags togetherLarysa Zaremba1-1/+8
There is no fundamental reason, why multi-buffer XDP and XDP kfunc RX hints cannot coexist in a single program. Allow those features to be used together by modifying the flags condition for dev-bound-only programs, segments are still prohibited for fully offloaded programs, hence additional check. Suggested-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/CAKH8qBuzgtJj=OKMdsxEkyML36VsAuZpcrsXcyqjdKXSJCBq=Q@mail.gmail.com/ Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230915083914.65538-1-larysa.zaremba@intel.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: expose information about supported xdp metadata kfuncStanislav Fomichev1-1/+1
Add new xdp-rx-metadata-features member to netdev netlink which exports a bitmask of supported kfuncs. Most of the patch is autogenerated (headers), the only relevant part is netdev.yaml and the changes in netdev-genl.c to marshal into netlink. Example output on veth: $ ip link add veth0 type veth peer name veth1 # ifndex == 12 $ ./tools/net/ynl/samples/netdev 12 Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12 veth1[12] xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0 Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: make it easier to add new metadata kfuncStanislav Fomichev1-4/+5
No functional changes. Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc, move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx. This way, any time new kfunc is added, we don't have to touch bpf_dev_bound_resolve_kfunc. Also document XDP_METADATA_KFUNC_xxx arguments since we now have more than two and it might be confusing what is what. Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-14bpf: Charge modmem for struct_ops trampolineSong Liu1-4/+22
Current code charges modmem for regular trampoline, but not for struct_ops trampoline. Add bpf_jit_[charge|uncharge]_modmem() to struct_ops so the trampoline is charged in both cases. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914222542.2986059-1-song@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-13Merge tag 'trace-v6.6-rc1' of ↵Linus Torvalds6-30/+88
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add missing LOCKDOWN checks for eventfs callers When LOCKDOWN is active for tracing, it causes inconsistent state when some functions succeed and others fail. - Use dput() to free the top level eventfs descriptor There was a race between accesses and freeing it. - Fix a long standing bug that eventfs exposed due to changing timings by dynamically creating files. That is, If a event file is opened for an instance, there's nothing preventing the instance from being removed which will make accessing the files cause use-after-free bugs. - Fix a ring buffer race that happens when iterating over the ring buffer while writers are active. Check to make sure not to read the event meta data if it's beyond the end of the ring buffer sub buffer. - Fix the print trigger that disappeared because the test to create it was looking for the event dir field being filled, but now it has the "ef" field filled for the eventfs structure. - Remove the unused "dir" field from the event structure. - Fix the order of the trace_dynamic_info as it had it backwards for the offset and len fields for which one was for which endianess. - Fix NULL pointer dereference with eventfs_remove_rec() If an allocation fails in one of the eventfs_add_*() functions, the caller of it in event_subsystem_dir() or event_create_dir() assigns the result to the structure. But it's assigning the ERR_PTR and not NULL. This was passed to eventfs_remove_rec() which expects either a good pointer or a NULL, not ERR_PTR. The fix is to not assign the ERR_PTR to the structure, but to keep it NULL on error. - Fix list_for_each_rcu() to use list_for_each_srcu() in dcache_dir_open_wrapper(). One iteration of the code used RCU but because it had to call sleepable code, it had to be changed to use SRCU, but one of the iterations was missed. - Fix synthetic event print function to use "as_u64" instead of passing in a pointer to the union. To fix big/little endian issues, the u64 that represented several types was turned into a union to define the types properly. * tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec() tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper() tracing/synthetic: Print out u64 values properly tracing/synthetic: Fix order of struct trace_dynamic_info selftests/ftrace: Fix dependencies for some of the synthetic event tests tracing: Remove unused trace_event_file dir field tracing: Use the new eventfs descriptor for print trigger ring-buffer: Do not attempt to read past "commit" tracefs/eventfs: Free top level files on removal ring-buffer: Avoid softlockup in ring_buffer_resize() tracing: Have event inject files inc the trace array ref count tracing: Have option files inc the trace array ref count tracing: Have current_trace inc the trace array ref count tracing: Have tracing_max_latency inc the trace array ref count tracing: Increase trace array ref count on enable and filter files tracefs/eventfs: Use dput to free the toplevel events directory tracefs/eventfs: Add missing lockdown checks tracefs: Add missing lockdown check to tracefs_create_dir()
2023-09-12bpf, x64: Fix tailcall infinite loopLeon Hwang2-2/+5
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs. How about combining them all together? 1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller. As a result, a tailcall infinite loop comes up. And the loop would halt the machine. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()Jinjie Ruan1-4/+9
Inject fault while probing btrfs.ko, if kstrdup() fails in eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR to assign file->ef. But the eventfs_remove() check NULL in trace_module_remove_events(), which causes the below NULL pointer dereference. As both Masami and Steven suggest, allocater side should handle the error carefully and remove it, so fix the places where it failed. Could not create tracefs 'raid56_write' directory Btrfs loaded, zoned=no, fsverity=no Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102544000 [000000000000001c] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: btrfs] CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : eventfs_remove_rec+0x24/0xc0 lr : eventfs_remove+0x68/0x1d8 sp : ffff800082d63b60 x29: ffff800082d63b60 x28: ffffb84b80ddd00c x27: ffffb84b3054ba40 x26: 0000000000000002 x25: ffff800082d63bf8 x24: ffffb84b8398e440 x23: ffffb84b82af3000 x22: dead000000000100 x21: dead000000000122 x20: ffff800082d63bf8 x19: fffffffffffffff4 x18: ffffb84b82508820 x17: 0000000000000000 x16: 0000000000000000 x15: 000083bc876a3166 x14: 000000000000006d x13: 000000000000006d x12: 0000000000000000 x11: 0000000000000001 x10: 00000000000017e0 x9 : 0000000000000001 x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffb84b84289804 x5 : 0000000000000000 x4 : 9696969696969697 x3 : ffff33a5b7601f38 x2 : 0000000000000000 x1 : ffff800082d63bf8 x0 : fffffffffffffff4 Call trace: eventfs_remove_rec+0x24/0xc0 eventfs_remove+0x68/0x1d8 remove_event_file_dir+0x88/0x100 event_remove+0x140/0x15c trace_module_notify+0x1fc/0x230 notifier_call_chain+0x98/0x17c blocking_notifier_call_chain+0x4c/0x74 __arm64_sys_delete_module+0x1a4/0x298 invoke_syscall+0x44/0x100 el0_svc_common.constprop.1+0x68/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x3c/0xc4 el0t_64_sync_handler+0xa0/0xc4 el0t_64_sync+0x174/0x178 Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x384b00c00000 from 0xffff800080000000 PHYS_OFFSET: 0xffffcc5b80000000 CPU features: 0x88000203,3c020000,1000421b Memory Limit: none Rebooting in 1 seconds.. Link: https://lore.kernel.org/linux-trace-kernel/20230912134752.1838524-1-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20230912025808.668187-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230911052818.1020547-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230909072817.182846-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230908074816.3724716-1-ruanjinjie@huawei.com/ Cc: Ajay Kaher <akaher@vmware.com> Fixes: 5bdcd5f5331a ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-11tracing/synthetic: Print out u64 values properlyTero Kristo1-1/+1
The synth traces incorrectly print pointer to the synthetic event values instead of the actual value when using u64 type. Fix by addressing the contents of the union properly. Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Cc: stable@vger.kernel.org Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-09Merge tag 'riscv-for-linus-6.6-mw2-2' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Palmer Dabbelt: - The kernel now dynamically probes for misaligned access speed, as opposed to relying on a table of known implementations. - Support for non-coherent devices on systems using the Andes AX45MP core, including the RZ/Five SoCs. - Support for the V extension in ptrace(), again. - Support for KASLR. - Support for the BPF prog pack allocator in RISC-V. - A handful of bug fixes and cleanups. * tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (25 commits) soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if dependencies are met riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES config riscv: Kconfig.errata: Drop dependency for MMU in ERRATA_ANDES_CMO config riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabled bpf, riscv: use prog pack allocator in the BPF JIT riscv: implement a memset like function for text riscv: extend patch_text_nosync() for multiple pages bpf: make bpf_prog_pack allocator portable riscv: libstub: Implement KASLR by using generic functions libstub: Fix compilation warning for rv32 arm64: libstub: Move KASLR handling functions to kaslr.c riscv: Dump out kernel offset information on panic riscv: Introduce virtual kernel mapping KASLR RISC-V: Add ptrace support for vectors soc: renesas: Kconfig: Select the required configs for RZ/Five SoC cache: Add L2 cache management for Andes AX45MP RISC-V core dt-bindings: cache: andestech,ax45mp-cache: Add DT binding documentation for L2 cache controller riscv: mm: dma-noncoherent: nonstandard cache operations support riscv: errata: Add Andes alternative ports riscv: asm: vendorid_list: Add Andes Technology to the vendors list ...
2023-09-09Merge tag 'dma-mapping-6.6-2023-09-09' of ↵Linus Torvalds4-13/+18
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - move a dma-debug call that prints a message out from a lock that's causing problems with the lock order in serial drivers (Sergey Senozhatsky) - fix the CONFIG_DMA_NUMA_CMA Kconfig entry to have the right dependency and not default to y (Christoph Hellwig) - move an ifdef a bit to remove a __maybe_unused that seems to trip up some sensitivities (Christoph Hellwig) - revert a bogus check in the CMA allocator (Zhenhua Huang) * tag 'dma-mapping-6.6-2023-09-09' of git://git.infradead.org/users/hch/dma-mapping: Revert "dma-contiguous: check for memory region overlap" dma-pool: remove a __maybe_unused label in atomic_pool_expand dma-contiguous: fix the Kconfig entry for CONFIG_DMA_NUMA_CMA dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lock
2023-09-08tracing: Remove unused trace_event_file dir fieldSteven Rostedt (Google)1-13/+0
Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file structure is no longer used. Remove it. Link: https://lkml.kernel.org/r/20230908022001.580400115@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08tracing: Use the new eventfs descriptor for print triggerSteven Rostedt (Google)1-2/+2
The check to create the print event "trigger" was using the obsolete "dir" value of the trace_event_file to determine if it should create the trigger or not. But that value will now be NULL because it uses the event file descriptor. Change it to test the "ef" field of the trace_event_file structure so that the trace_marker "trigger" file appears again. Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 27152bceea1df ("eventfs: Move tracing/events to eventfs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08ring-buffer: Do not attempt to read past "commit"Steven Rostedt (Google)1-0/+5
When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08Merge tag 'printk-for-6.6-fixup' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Revert exporting symbols needed for dumping the raw printk buffer in panic(). I pushed the export prematurely before the user was ready for merging into the mainline. * tag 'printk-for-6.6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: Revert "printk: export symbols for debug modules"
2023-09-08Merge patch series "bpf, riscv: use BPF prog pack allocator in BPF JIT"Palmer Dabbelt1-4/+4
Puranjay Mohan <puranjay12@gmail.com> says: Here is some data to prove the V2 fixes the problem: Without this series: root@rv-selftester:~/src/kselftest/bpf# time ./test_tag test_tag: OK (40945 tests) real 7m47.562s user 0m24.145s sys 6m37.064s With this series applied: root@rv-selftester:~/src/selftest/bpf# time ./test_tag test_tag: OK (40945 tests) real 7m29.472s user 0m25.865s sys 6m18.401s BPF programs currently consume a page each on RISCV. For systems with many BPF programs, this adds significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system. Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. It packs multiple BPF programs into a single huge page. It is currently only enabled for the x86_64 BPF JIT. I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. ====================================================== Performance Analysis of prog pack allocator on RISCV64 ====================================================== Test setup: =========== Host machine: Debian GNU/Linux 11 (bullseye) Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) u-boot-qemu Version: 2023.07+dfsg-1 opensbi Version: 1.3-1 To test the performance of the BPF prog pack allocator on RV, a stresser tool[4] linked below was built. This tool loads 8 BPF programs on the system and triggers 5 of them in an infinite loop by doing system calls. The runner script starts 20 instances of the above which loads 8*20=160 BPF programs on the system, 5*20=100 of which are being constantly triggered. The script is passed a command which would be run in the above environment. The script was run with following perf command: ./run.sh "perf stat -a \ -e iTLB-load-misses \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e instructions \ --timeout 60000" The output of the above command is discussed below before and after enabling the BPF prog pack allocator. The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs was created using Bjorn's riscv-cross-builder[5] docker container linked below. Results ======= Before enabling prog pack allocator: ------------------------------------ Performance counter stats for 'system wide': 4939048 iTLB-load-misses 5468689 dTLB-load-misses 465234 dTLB-store-misses 1441082097998 instructions 60.045791200 seconds time elapsed After enabling prog pack allocator: ----------------------------------- Performance counter stats for 'system wide': 3430035 iTLB-load-misses 5008745 dTLB-load-misses 409944 dTLB-store-misses 1441535637988 instructions 60.046296600 seconds time elapsed Improvements in metrics ======================= It was expected that the iTLB-load-misses would decrease as now a single huge page is used to keep all the BPF programs compared to a single page for each program earlier. -------------------------------------------- The improvement in iTLB-load-misses: -30.5 % -------------------------------------------- I repeated this expriment more than 100 times in different setups and the improvement was always greater than 30%. This patch series is boot tested on the Starfive VisionFive 2 board[6]. The performance analysis was not done on the board because it doesn't expose iTLB-load-misses, etc. The stresser program was run on the board to test the loading and unloading of BPF programs [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ [4] https://github.com/puranjaymohan/BPF-Allocator-Bench [5] https://github.com/bjoto/riscv-cross-builder [6] https://www.starfivetech.com/en/site/boards * b4-shazam-merge: bpf, riscv: use prog pack allocator in the BPF JIT riscv: implement a memset like function for text riscv: extend patch_text_nosync() for multiple pages bpf: make bpf_prog_pack allocator portable Link: https://lore.kernel.org/r/20230831131229.497941-1-puranjay12@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08bpf: task_group_seq_get_next: simplify the "next tid" logicOleg Nesterov1-7/+4
Kill saved_tid. It looks ugly to update *tid and then restore the previous value if __task_pid_nr_ns() returns 0. Change this code to update *tid and common->pid_visiting once before return. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: kill next_taskOleg Nesterov1-8/+6
It only adds the unnecessary confusion and compicates the "retry" code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: fix the skip_if_dup_files checkOleg Nesterov1-1/+1
Unless I am notally confused it is wrong. We are going to return or skip next_task so we need to check next_task-files, not task->files. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of get/put_task_structOleg Nesterov1-10/+2
get_pid_task() makes no sense, the code does put_task_struct() soon after. Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill put_task_struct(), this allows to do get_task_struct() only once before return. While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)" block, this matches the next usage of find_pid_ns() + get_pid_task() in this function. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of next_thread()Oleg Nesterov1-7/+0
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we can safely iterate the task->thread_group list. Even if this task exits right after get_pid_task() (or goto retry) and pid_alive() returns 0. Kill the unnecessary pid_alive() check. 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}()Hou Tao1-2/+7
Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free freed objects back to slab and the invocation may also be preempted by unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in the following case: task A task B unit_free() // high_watermark = 48 // free_cnt = 49 after free irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 48 after alloc // does unit_alloc() 32-times ...... // free_cnt = 16 unit_alloc() // free_cnt = 15 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 15-times ...... // free_cnt = 0 unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()Hou Tao1-1/+6
When doing stress test for qp-trie, bpf_mem_alloc() returned NULL unexpectedly because all qp-trie operations were initiated from bpf syscalls and there was still available free memory. bpf_obj_new() has the same problem as shown by the following selftest. The failure is due to the preemption. irq_work_raise() will invoke irq_work_claim() first to mark the irq work as pending and then inovke __irq_work_queue_local() to raise an IPI. So when the current task which is invoking irq_work_raise() is preempted by other task, unit_alloc() may return NULL for preemption task as shown below: task A task B unit_alloc() // low_watermark = 32 // free_cnt = 31 after alloc irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 30 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 30-times ...... unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. An alternative fix is using preempt_{disable|enable}_notrace() pair, but it may have extra overhead. Another feasible fix is to only disable preemption or IRQ before invoking irq_work_queue() and enable preemption or IRQ after the invocation completes, but it can't handle the case when c->low_watermark is 1. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Mark OBJ_RELEASE argument as MEM_RCU when possibleYonghong Song1-0/+20
In previous selftests/bpf patch, we have p = bpf_percpu_obj_new(struct val_t); if (!p) goto out; p1 = bpf_kptr_xchg(&e->pc, p); if (p1) { /* race condition */ bpf_percpu_obj_drop(p1); } p = e->pc; if (!p) goto out; After bpf_kptr_xchg(), we need to re-read e->pc into 'p'. This is due to that the second argument of bpf_kptr_xchg() is marked OBJ_RELEASE and it will be marked as invalid after the call. So after bpf_kptr_xchg(), 'p' is an unknown scalar, and the bpf program needs to reread from the map value. This patch checks if the 'p' has type MEM_ALLOC and MEM_PERCPU, and if 'p' is RCU protected. If this is the case, 'p' can be marked as MEM_RCU. MEM_ALLOC needs to be removed since 'p' is not an owning reference any more. Such a change makes re-read from the map value unnecessary. Note that re-reading 'e->pc' after bpf_kptr_xchg() might get a different value from 'p' if immediately before 'p = e->pc', another cpu may do another bpf_kptr_xchg() and swap in another value into 'e->pc'. If this is the case, then 'p = e->pc' may get either 'p' or another value, and race condition already exists. So removing direct re-reading seems fine too. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152816.2000760-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu objYonghong Song1-8/+51
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed for allocated percpu objects. For an allocated percpu obj, the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'. The return type for these two re-purposed helpera is 'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'. The MEM_ALLOC allows that the per-cpu data can be read and written. Since the memory allocator bpf_mem_alloc() returns a ptr to a percpu ptr for percpu data, the first argument of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched with a dereference before passing to the helper func. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add alloc/xchg/direct_access support for local percpu kptrYonghong Song2-22/+106
Add two new kfunc's, bpf_percpu_obj_new_impl() and bpf_percpu_obj_drop_impl(), to allocate a percpu obj. Two functions are very similar to bpf_obj_new_impl() and bpf_obj_drop_impl(). The major difference is related to percpu handling. bpf_rcu_read_lock() struct val_t __percpu_kptr *v = map_val->percpu_data; ... bpf_rcu_read_unlock() For a percpu data map_val like above 'v', the reg->type is set as PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU if inside rcu critical section. MEM_RCU marking here is similar to NON_OWN_REF as 'v' is not a owning reference. But NON_OWN_REF is trusted and typically inside the spinlock while MEM_RCU is under rcu read lock. RCU is preferred here since percpu data structures mean potential concurrent access into its contents. Also, bpf_percpu_obj_new_impl() is restricted such that no pointers or special fields are allowed. Therefore, the bpf_list_head and bpf_rb_root will not be supported in this patch set to avoid potential memory leak issue due to racing between bpf_obj_free_fields() and another bpf_kptr_xchg() moving an allocated object to bpf_list_head and bpf_rb_root. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add BPF_KPTR_PERCPU as a field typeYonghong Song2-0/+9
BPF_KPTR_PERCPU represents a percpu field type like below struct val_t { ... fields ... }; struct t { ... struct val_t __percpu_kptr *percpu_data_ptr; ... }; where #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr"))) While BPF_KPTR_REF points to a trusted kernel object or a trusted local object, BPF_KPTR_PERCPU points to a trusted local percpu object. This patch added basic support for BPF_KPTR_PERCPU related to percpu_kptr field parsing, recording and free operations. BPF_KPTR_PERCPU also supports the same map types as BPF_KPTR_REF does. Note that unlike a local kptr, it is possible that a BPF_KTPR_PERCPU struct may not contain any special fields like other kptr, bpf_spin_lock, bpf_list_head, etc. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add support for non-fix-size percpu mem allocationYonghong Song2-11/+11
This is needed for later percpu mem allocation when the allocation is done by bpf program. For such cases, a global bpf_global_percpu_ma is added where a flexible allocation size is needed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08Revert "dma-contiguous: check for memory region overlap"Zhenhua Huang1-5/+0
This reverts commit 3fa6456ebe13adab3ba1817c8e515a5b88f95dce. The Commit broke the CMA region creation through DT on arm64, as showed below logs with "memblock=debug": [ 0.000000] memblock_phys_alloc_range: 41943040 bytes align=0x200000 from=0x0000000000000000 max_addr=0x00000000ffffffff early_init_dt_alloc_reserved_memory_arch+0x34/0xa0 [ 0.000000] memblock_reserve: [0x00000000fd600000-0x00000000ffdfffff] memblock_alloc_range_nid+0xc0/0x19c [ 0.000000] Reserved memory: overlap with other memblock reserved region >From call flow, region we defined in DT was always reserved before entering into rmem_cma_setup. Also, rmem_cma_setup has one routine cma_init_reserved_mem to ensure the region was reserved. Checking the region not reserved here seems not correct. early_init_fdt_scan_reserved_mem: fdt_scan_reserved_mem __reserved_mem_reserve_reg early_init_dt_reserve_memory memblock_reserve(using “reg” prop case) fdt_init_reserved_mem __reserved_mem_alloc_size *early_init_dt_alloc_reserved_memory_arch* memblock_reserve(dynamic alloc case) __reserved_mem_init_node rmem_cma_setup(region overlap check here should always fail) Example DT can be used to reproduce issue: dump_mem: mem_dump_region { compatible = "shared-dma-pool"; alloc-ranges = <0x0 0x00000000 0x0 0xffffffff>; reusable; size = <0 0x2800000>; }; Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
2023-09-07Merge tag 'net-6.6-rc1' of ↵Linus Torvalds4-38/+20
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking updates from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - eth: stmmac: fix failure to probe without MAC interface specified Current release - new code bugs: - docs: netlink: fix missing classic_netlink doc reference Previous releases - regressions: - deal with integer overflows in kmalloc_reserve() - use sk_forward_alloc_get() in sk_get_meminfo() - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc - fib: avoid warn splat in flow dissector after packet mangling - skb_segment: call zero copy functions before using skbuff frags - eth: sfc: check for zero length in EF10 RX prefix Previous releases - always broken: - af_unix: fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR() - netfilter: - nft_exthdr: fix non-linear header modification - xt_u32, xt_sctp: validate user space input - nftables: exthdr: fix 4-byte stack OOB write - nfnetlink_osf: avoid OOB read - one more fix for the garbage collection work from last release - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t - handshake: fix null-deref in handshake_nl_done_doit() - ip: ignore dst hint for multipath routes to ensure packets are hashed across the nexthops - phy: micrel: - correct bit assignments for cable test errata - disable EEE according to the KSZ9477 errata Misc: - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations - Revert "net: macsec: preserve ingress frame ordering", it appears to have been developed against an older kernel, problem doesn't exist upstream" * tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits) net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs() Revert "net: team: do not use dynamic lockdep key" net: hns3: remove GSO partial feature bit net: hns3: fix the port information display when sfp is absent net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue net: hns3: fix debugfs concurrency issue between kfree buffer and read net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read() net: hns3: Support query tx timeout threshold by debugfs net: hns3: fix tx timeout issue net: phy: Provide Module 4 KSZ9477 errata (DS80000754C) netfilter: nf_tables: Unbreak audit log reset netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID netfilter: nfnetlink_osf: avoid OOB read netfilter: nftables: exthdr: fix 4-byte stack OOB write selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc bpf: bpf_sk_storage: Fix invalid wait context lockdep report s390/bpf: Pass through tail call counter in trampolines ...
2023-09-07ring-buffer: Avoid softlockup in ring_buffer_resize()Zheng Yejian1-0/+2
When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have event inject files inc the trace array ref countSteven Rostedt (Google)1-1/+2
The event inject files add events for a specific trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when a event inject file is opened. Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 6c3edaf9fd6a ("tracing: Introduce trace event injection") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have option files inc the trace array ref countSteven Rostedt (Google)1-1/+22
The option files update the options for a given trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when an option file is opened. Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have current_trace inc the trace array ref countSteven Rostedt (Google)1-1/+2
The current_trace updates the trace array tracer. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace array when current_trace is opened. Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have tracing_max_latency inc the trace array ref countSteven Rostedt (Google)1-5/+10
The tracing_max_latency file points to the trace_array max_latency field. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when tracing_max_latency is opened. Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Increase trace array ref count on enable and filter filesSteven Rostedt (Google)3-2/+33
When the trace event enable and filter files are opened, increment the trace array ref counter, otherwise they can be accessed when the trace array is being deleted. The ref counter keeps the trace array from being deleted while those files are opened. Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07Revert "printk: export symbols for debug modules"Christoph Hellwig1-2/+0
This reverts commit 3e00123a13d824d63072b1824c9da59cd78356d9. No, we never export random symbols for out of tree modules. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230905081902.321778-1-hch@lst.de
2023-09-06bpf: make bpf_prog_pack allocator portablePuranjay Mohan1-4/+4
The bpf_prog_pack allocator currently uses module_alloc() and module_memfree() to allocate and free memory. This is not portable because different architectures use different methods for allocating memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree(). Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management in bpf_prog_pack allocator. Other architectures can override these with their implementation and will be able to use bpf_prog_pack directly. On architectures that don't override bpf_jit_alloc/free_exec() this is basically a NOP. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Björn Töpel <bjorn@kernel.org> Tested-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230831131229.497941-2-puranjay12@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-06bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_allocMartin KaFai Lau1-1/+1
The commit c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into bpf_local_storage_unlink_nolock() which then later renamed to bpf_local_storage_destroy(). The commit accidentally passed the "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock() which then stopped the uncharge from happening to the sk->sk_omem_alloc. This missing uncharge only happens when the sk is going away (during __sk_destruct). This patch fixes it by always passing "uncharge_mem = true". It is a noop to the task/inode/cgroup storage because they do not have the map_local_storage_(un)charge enabled in the map_ops. A followup patch will be done in bpf-next to remove the uncharge_mem argument. A selftest is added in the next patch. Fixes: c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev