diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-01-29 19:38:34 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-01-29 19:38:34 -0800 |
commit | 83fa805bcbfc53ae82eedd65132794ae324798e5 (patch) | |
tree | ff4b2ba048bb5f14194110aedb09c85aab159d4a | |
parent | 896f8d23d0cb5889021d66eab6107e97109c5459 (diff) | |
parent | 8d19f1c8e1937baf74e1962aae9f90fa3aeab463 (diff) |
Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread management updates from Christian Brauner:
"Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
syscall.
This syscall allows for the retrieval of file descriptors of a process
based on its pidfd. A task needs to have ptrace_may_access()
permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
Andy) on the target.
One of the main use-cases is in combination with seccomp's user
notification feature. As a reminder, seccomp's user notification
feature was made available in v5.0. It allows a task to retrieve a
file descriptor for its seccomp filter. The file descriptor is usually
handed of to a more privileged supervising process. The supervisor can
then listen for syscall events caught by the seccomp filter of the
supervisee and perform actions in lieu of the supervisee, usually
emulating syscalls. pidfd_getfd() is needed to expand its uses.
There are currently two major users that wait on pidfd_getfd() and one
future user:
- Netflix, Sargun said, is working on a service mesh where users
should be able to connect to a dns-based VIP. When a user connects
to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
redirected to an envoy process. This service mesh uses seccomp user
notifications and pidfd to intercept all connect calls and instead
of connecting them to 1.2.3.4:80 connects them to e.g.
127.0.0.1:8080.
- LXD uses the seccomp notifier heavily to intercept and emulate
mknod() and mount() syscalls for unprivileged containers/processes.
With pidfd_getfd() more uses-cases e.g. bridging socket connections
will be possible.
- The patchset has also seen some interest from the browser corner.
Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
broker process. In the future glibc will start blocking all signals
during dlopen() rendering this type of sandbox impossible. Hence,
in the future Firefox will switch to a seccomp-user-nofication
based sandbox which also makes use of file descriptor retrieval.
The thread for this can be found at
https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
With pidfd_getfd() it is e.g. possible to bridge socket connections
for the supervisee (binding to a privileged port) and taking actions
on file descriptors on behalf of the supervisee in general.
Sargun's first version was using an ioctl on pidfds but various people
pushed for it to be a proper syscall which he duely implemented as
well over various review cycles. Selftests are of course included.
I've also added instructions how to deal with merge conflicts below.
There's also a small fix coming from the kernel mentee project to
correctly annotate struct sighand_struct with __rcu to fix various
sparse warnings. We've received a few more such fixes and even though
they are mostly trivial I've decided to postpone them until after -rc1
since they came in rather late and I don't want to risk introducing
build warnings.
Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
needed to avoid allocation recursions triggerable by storage drivers
that have userspace parts that run in the IO path (e.g. dm-multipath,
iscsi, etc). These allocation recursions deadlock the device.
The new prctl() allows such privileged userspace components to avoid
allocation recursions by setting the PF_MEMALLOC_NOIO and
PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
relevant maintainers and is routed here as part of prctl()
thread-management."
* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
sched.h: Annotate sighand_struct with __rcu
test: Add test for pidfd getfd
arch: wire up pidfd_getfd syscall
pid: Implement pidfd_getfd syscall
vfs, fdtable: Add fget_task helper
32 files changed, 427 insertions, 7 deletions
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 4d7f2ffa957c..36d42da7466a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -476,3 +476,4 @@ 544 common pidfd_open sys_pidfd_open # 545 reserved for clone3 547 common openat2 sys_openat2 +548 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 4ba54bc7e19a..4d1cf74a2caa 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -450,3 +450,4 @@ 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 0f255a23733d..1dd22da1c3a9 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 438 +#define __NR_compat_syscalls 439 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 57f6f592d460..c1c61635f89c 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -881,6 +881,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) __SYSCALL(__NR_clone3, sys_clone3) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) +#define __NR_pidfd_getfd 438 +__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 8d36f2e2dc89..042911e670b8 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -357,3 +357,4 @@ 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index b911e0f50a71..f4f49fcb76d0 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -436,3 +436,4 @@ 434 common pidfd_open sys_pidfd_open 435 common clone3 __sys_clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index c04385e60833..4c67b11f9c9e 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -442,3 +442,4 @@ 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 68c9ec06851f..1f9e8ad636cc 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -375,3 +375,4 @@ 434 n32 pidfd_open sys_pidfd_open 435 n32 clone3 __sys_clone3 437 n32 openat2 sys_openat2 +438 n32 pidfd_getfd sys_pidfd_getfd diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 42a72d010050..c0b9d802dbf6 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -351,3 +351,4 @@ 434 n64 pidfd_open sys_pidfd_open 435 n64 clone3 __sys_clone3 437 n64 openat2 sys_openat2 +438 n64 pidfd_getfd sys_pidfd_getfd diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index f114c4aed0ed..ac586774c980 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -424,3 +424,4 @@ 434 o32 pidfd_open sys_pidfd_open 435 o32 clone3 __sys_clone3 437 o32 openat2 sys_openat2 +438 o32 pidfd_getfd sys_pidfd_getfd diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index b550ae9a7fea..52a15f5cd130 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -434,3 +434,4 @@ 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3_wrapper 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index a8b5ecb5b602..35b61bfc1b1a 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -518,3 +518,4 @@ 434 common pidfd_open sys_pidfd_open 435 nospu clone3 ppc_clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 16b571c06161..bd7bd3581a0f 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -439,3 +439,4 @@ 434 common pidfd_open sys_pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 sys_clone3 437 common openat2 sys_openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index a7185cc18626..c7a30fcd135f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -439,3 +439,4 @@ 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index b11c19552022..f13615ecdecc 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -482,3 +482,4 @@ 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index d22a8b5c3fab..c17cb77eb150 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -441,3 +441,4 @@ 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open 435 i386 clone3 sys_clone3 __ia32_sys_clone3 437 i386 openat2 sys_openat2 __ia32_sys_openat2 +438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 9035647ef236..44d510bc9b78 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -358,6 +358,7 @@ 434 common pidfd_open __x64_sys_pidfd_open 435 common clone3 __x64_sys_clone3/ptregs 437 common openat2 __x64_sys_openat2 +438 common pidfd_getfd __x64_sys_pidfd_getfd # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index f0a68013c038..85a9ab1bc04d 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -407,3 +407,4 @@ 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 437 common openat2 sys_openat2 +438 common pidfd_getfd sys_pidfd_getfd diff --git a/fs/file.c b/fs/file.c index fb7081bfac2b..a364e1a9b7e8 100644 --- a/fs/file.c +++ b/fs/file.c @@ -708,9 +708,9 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) +static struct file *__fget_files(struct files_struct *files, unsigned int fd, + fmode_t mask, unsigned int refs) { - struct files_struct *files = current->files; struct file *file; rcu_read_lock(); @@ -731,6 +731,12 @@ loop: return file; } +static inline struct file *__fget(unsigned int fd, fmode_t mask, + unsigned int refs) +{ + return __fget_files(current->files, fd, mask, refs); +} + struct file *fget_many(unsigned int fd, unsigned int refs) { return __fget(fd, FMODE_PATH, refs); @@ -748,6 +754,18 @@ struct file *fget_raw(unsigned int fd) } EXPORT_SYMBOL(fget_raw); +struct file *fget_task(struct task_struct *task, unsigned int fd) +{ + struct file *file = NULL; + + task_lock(task); + if (task->files) + file = __fget_files(task->files, fd, 0, 1); + task_unlock(task); + + return file; +} + /* * Lightweight file lookup - no refcnt increment if fd table isn't shared. * diff --git a/include/linux/file.h b/include/linux/file.h index 3fcddff56bc4..c6c7b24ea9f7 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -16,6 +16,7 @@ extern void fput(struct file *); extern void fput_many(struct file *, unsigned int); struct file_operations; +struct task_struct; struct vfsmount; struct dentry; struct inode; @@ -47,6 +48,7 @@ static inline void fdput(struct fd fd) extern struct file *fget(unsigned int fd); extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); +extern struct file *fget_task(struct task_struct *task, unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); extern unsigned long __fdget_pos(unsigned int fd); diff --git a/include/linux/sched.h b/include/linux/sched.h index 716ad1d8d95e..04278493bf15 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -917,7 +917,7 @@ struct task_struct { /* Signal handlers: */ struct signal_struct *signal; - struct sighand_struct *sighand; + struct sighand_struct __rcu *sighand; sigset_t blocked; sigset_t real_blocked; /* Restored if set_restore_sigmask() was used: */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index ac3a663137b6..1815065d52f3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1002,6 +1002,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags) asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user *info, unsigned int flags); +asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d4122c091472..3a3201e4618e 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -853,9 +853,11 @@ __SYSCALL(__NR_clone3, sys_clone3) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) +#define __NR_pidfd_getfd 438 +__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) #undef __NR_syscalls -#define __NR_syscalls 438 +#define __NR_syscalls 439 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h index 240fdb9a60f6..272dc69fa080 100644 --- a/include/uapi/linux/capability.h +++ b/include/uapi/linux/capability.h @@ -301,6 +301,7 @@ struct vfs_ns_cap_data { /* Allow more than 64hz interrupts from the real-time clock */ /* Override max number of consoles on console allocation */ /* Override max number of keymaps */ +/* Control memory reclaim behavior */ #define CAP_SYS_RESOURCE 24 diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 7da1b37b27aa..07b4f8131e36 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -234,4 +234,8 @@ struct prctl_mm_map { #define PR_GET_TAGGED_ADDR_CTRL 56 # define PR_TAGGED_ADDR_ENABLE (1UL << 0) +/* Control reclaim behavior when allocating memory */ +#define PR_SET_IO_FLUSHER 57 +#define PR_GET_IO_FLUSHER 58 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/pid.c b/kernel/pid.c index 2278e249141d..0f4ecb57214c 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -578,3 +578,93 @@ void __init pid_idr_init(void) init_pid_ns.pid_cachep = KMEM_CACHE(pid, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT); } + +static struct file *__pidfd_fget(struct task_struct *task, int fd) +{ + struct file *file; + int ret; + + ret = mutex_lock_killable(&task->signal->cred_guard_mutex); + if (ret) + return ERR_PTR(ret); + + if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) + file = fget_task(task, fd); + else + file = ERR_PTR(-EPERM); + + mutex_unlock(&task->signal->cred_guard_mutex); + + return file ?: ERR_PTR(-EBADF); +} + +static int pidfd_getfd(struct pid *pid, int fd) +{ + struct task_struct *task; + struct file *file; + int ret; + + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) + return -ESRCH; + + file = __pidfd_fget(task, fd); + put_task_struct(task); + if (IS_ERR(file)) + return PTR_ERR(file); + + ret = security_file_receive(file); + if (ret) { + fput(file); + return ret; + } + + ret = get_unused_fd_flags(O_CLOEXEC); + if (ret < 0) + fput(file); + else + fd_install(ret, file); + + return ret; +} + +/** + * sys_pidfd_getfd() - Get a file descriptor from another process + * + * @pidfd: the pidfd file descriptor of the process + * @fd: the file descriptor number to get + * @flags: flags on how to get the fd (reserved) + * + * This syscall gets a copy of a file descriptor from another process + * based on the pidfd, and file descriptor number. It requires that + * the calling process has the ability to ptrace the process represented + * by the pidfd. The process which is having its file descriptor copied + * is otherwise unaffected. + * + * Return: On success, a cloexec file descriptor is returned. + * On error, a negative errno number will be returned. + */ +SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd, + unsigned int, flags) +{ + struct pid *pid; + struct fd f; + int ret; + + /* flags is currently unused - make sure it's unset */ + if (flags) + return -EINVAL; + + f = fdget(pidfd); + if (!f.file) + return -EBADF; + + pid = pidfd_pid(f.file); + if (IS_ERR(pid)) + ret = PTR_ERR(pid); + else + ret = pidfd_getfd(pid, fd); + + fdput(f); + return ret; +} diff --git a/kernel/signal.c b/kernel/signal.c index bcd46f547db3..9ad8dea93dbb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1383,7 +1383,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, * must see ->sighand == NULL. */ spin_lock_irqsave(&sighand->siglock, *flags); - if (likely(sighand == tsk->sighand)) + if (likely(sighand == rcu_access_pointer(tsk->sighand))) break; spin_unlock_irqrestore(&sighand->siglock, *flags); } diff --git a/kernel/sys.c b/kernel/sys.c index a9331f101883..f9bc5c303e3f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2261,6 +2261,8 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2488,6 +2490,29 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = GET_TAGGED_ADDR_CTRL(); break; + case PR_SET_IO_FLUSHER: + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + + if (arg3 || arg4 || arg5) + return -EINVAL; + + if (arg2 == 1) + current->flags |= PR_IO_FLUSHER; + else if (!arg2) + current->flags &= ~PR_IO_FLUSHER; + else + return -EINVAL; + break; + case PR_GET_IO_FLUSHER: + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + + error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; + break; default: error = -EINVAL; break; diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore index 8d069490e17b..3a779c084d96 100644 --- a/tools/testing/selftests/pidfd/.gitignore +++ b/tools/testing/selftests/pidfd/.gitignore @@ -2,3 +2,4 @@ pidfd_open_test pidfd_poll_test pidfd_test pidfd_wait +pidfd_getfd_test diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile index 43db1b98e845..75a545861375 100644 --- a/tools/testing/selftests/pidfd/Makefile +++ b/tools/testing/selftests/pidfd/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS += -g -I../../../../usr/include/ -pthread -TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait +TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test include ../lib.mk diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h index c6bc68329f4b..d482515604db 100644 --- a/tools/testing/selftests/pidfd/pidfd.h +++ b/tools/testing/selftests/pidfd/pidfd.h @@ -36,6 +36,10 @@ #define __NR_clone3 -1 #endif +#ifndef __NR_pidfd_getfd +#define __NR_pidfd_getfd -1 +#endif + /* * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c * That means, when it wraps around any pid < 300 will be skipped. @@ -84,4 +88,9 @@ static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); } +static inline int sys_pidfd_getfd(int pidfd, int fd, int flags) +{ + return syscall(__NR_pidfd_getfd, pidfd, fd, flags); +} + #endif /* __PIDFD_H */ diff --git a/tools/testing/selftests/pidfd/pidfd_getfd_test.c b/tools/testing/selftests/pidfd/pidfd_getfd_test.c new file mode 100644 index 000000000000..401a7c1d0312 --- /dev/null +++ b/tools/testing/selftests/pidfd/pidfd_getfd_test.c @@ -0,0 +1,249 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <limits.h> +#include <linux/types.h> +#include <sched.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <syscall.h> +#include <sys/prctl.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sys/socket.h> +#include <linux/kcmp.h> + +#include "pidfd.h" +#include "../kselftest.h" +#include "../kselftest_harness.h" + +/* + * UNKNOWN_FD is an fd number that should never exist in the child, as it is + * used to check the negative case. + */ +#define UNKNOWN_FD 111 +#define UID_NOBODY 65535 + +static int sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, + unsigned long idx2) +{ + return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2); +} + +static int sys_memfd_create(const char *name, unsigned int flags) +{ + return syscall(__NR_memfd_create, name, flags); +} + +static int __child(int sk, int memfd) +{ + int ret; + char buf; + + /* + * Ensure we don't leave around a bunch of orphaned children if our + * tests fail. + */ + ret = prctl(PR_SET_PDEATHSIG, SIGKILL); + if (ret) { + fprintf(stderr, "%s: Child could not set DEATHSIG\n", + strerror(errno)); + return -1; + } + + ret = send(sk, &memfd, sizeof(memfd), 0); + if (ret != sizeof(memfd)) { + fprintf(stderr, "%s: Child failed to send fd number\n", + strerror(errno)); + return -1; + } + + /* + * The fixture setup is completed at this point. The tests will run. + * + * This blocking recv enables the parent to message the child. + * Either we will read 'P' off of the sk, indicating that we need + * to disable ptrace, or we will read a 0, indicating that the other + * side has closed the sk. This occurs during fixture teardown time, + * indicating that the child should exit. + */ + while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) { + if (buf == 'P') { + ret = prctl(PR_SET_DUMPABLE, 0); + if (ret < 0) { + fprintf(stderr, + "%s: Child failed to disable ptrace\n", + strerror(errno)); + return -1; + } + } else { + fprintf(stderr, "Child received unknown command %c\n", + buf); + return -1; + } + ret = send(sk, &buf, sizeof(buf), 0); + if (ret != 1) { + fprintf(stderr, "%s: Child failed to ack\n", + strerror(errno)); + return -1; + } + } + if (ret < 0) { + fprintf(stderr, "%s: Child failed to read from socket\n", + strerror(errno)); + return -1; + } + + return 0; +} + +static int child(int sk) +{ + int memfd, ret; + + memfd = sys_memfd_create("test", 0); + if (memfd < 0) { + fprintf(stderr, "%s: Child could not create memfd\n", + strerror(errno)); + ret = -1; + } else { + ret = __child(sk, memfd); + close(memfd); + } + + close(sk); + return ret; +} + +FIXTURE(child) +{ + /* + * remote_fd is the number of the FD which we are trying to retrieve + * from the child. + */ + int remote_fd; + /* pid points to the child which we are fetching FDs from */ + pid_t pid; + /* pidfd is the pidfd of the child */ + int pidfd; + /* + * sk is our side of the socketpair used to communicate with the child. + * When it is closed, the child will exit. + */ + int sk; +}; + +FIXTURE_SETUP(child) +{ + int ret, sk_pair[2]; + + ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair)) { + TH_LOG("%s: failed to create socketpair", strerror(errno)); + } + self->sk = sk_pair[0]; + + self->pid = fork(); + ASSERT_GE(self->pid, 0); + + if (self->pid == 0) { + close(sk_pair[0]); + if (child(sk_pair[1])) + _exit(EXIT_FAILURE); + _exit(EXIT_SUCCESS); + } + + close(sk_pair[1]); + + self->pidfd = sys_pidfd_open(self->pid, 0); + ASSERT_GE(self->pidfd, 0); + + /* + * Wait for the child to complete setup. It'll send the remote memfd's + * number when ready. + */ + ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0); + ASSERT_EQ(sizeof(self->remote_fd), ret); +} + +FIXTURE_TEARDOWN(child) +{ + EXPECT_EQ(0, close(self->pidfd)); + EXPECT_EQ(0, close(self->sk)); + + EXPECT_EQ(0, wait_for_pid(self->pid)); +} + +TEST_F(child, disable_ptrace) +{ + int uid, fd; + char c; + + /* + * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE + * + * The tests should run in their own process, so even this test fails, + * it shouldn't result in subsequent tests failing. + */ + uid = getuid(); + if (uid == 0) + ASSERT_EQ(0, seteuid(UID_NOBODY)); + + ASSERT_EQ(1, send(self->sk, "P", 1, 0)); + ASSERT_EQ(1, recv(self->sk, &c, 1, 0)); + + fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0); + EXPECT_EQ(-1, fd); + EXPECT_EQ(EPERM, errno); + + if (uid == 0) + ASSERT_EQ(0, seteuid(0)); +} + +TEST_F(child, fetch_fd) +{ + int fd, ret; + + fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0); + ASSERT_GE(fd, 0); + + EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd)); + + ret = fcntl(fd, F_GETFD); + ASSERT_GE(ret, 0); + EXPECT_GE(ret & FD_CLOEXEC, 0); + + close(fd); +} + +TEST_F(child, test_unknown_fd) +{ + int fd; + + fd = sys_pidfd_getfd(self->pidfd, UNKNOWN_FD, 0); + EXPECT_EQ(-1, fd) { + TH_LOG("getfd succeeded while fetching unknown fd"); + }; + EXPECT_EQ(EBADF, errno) { + TH_LOG("%s: getfd did not get EBADF", strerror(errno)); + } +} + +TEST(flags_set) +{ + ASSERT_EQ(-1, sys_pidfd_getfd(0, 0, 1)); + EXPECT_EQ(errno, EINVAL); +} + +#if __NR_pidfd_getfd == -1 +int main(void) +{ + fprintf(stderr, "__NR_pidfd_getfd undefined. The pidfd_getfd syscall is unavailable. Test aborting\n"); + return KSFT_SKIP; +} +#else +TEST_HARNESS_MAIN +#endif |