From c5f4546593e9911800f0926c1090959b58bc5c93 Mon Sep 17 00:00:00 2001 From: Seth Jennings Date: Tue, 16 Dec 2014 11:58:18 -0600 Subject: livepatch: kernel: add TAINT_LIVEPATCH This adds a new taint flag to indicate when the kernel or a kernel module has been live patched. This will provide a clean indication in bug reports that live patching was used. Additionally, if the crash occurs in a live patched function, the live patch module will appear beside the patched function in the backtrace. Signed-off-by: Seth Jennings Acked-by: Josh Poimboeuf Reviewed-by: Miroslav Benes Reviewed-by: Petr Mladek Reviewed-by: Masami Hiramatsu Signed-off-by: Jiri Kosina --- kernel/panic.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/panic.c b/kernel/panic.c index 4d8d6f906dec..8136ad76e5fd 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -226,6 +226,7 @@ static const struct tnt tnts[] = { { TAINT_OOT_MODULE, 'O', ' ' }, { TAINT_UNSIGNED_MODULE, 'E', ' ' }, { TAINT_SOFTLOCKUP, 'L', ' ' }, + { TAINT_LIVEPATCH, 'K', ' ' }, }; /** @@ -246,6 +247,7 @@ static const struct tnt tnts[] = { * 'O' - Out-of-tree module has been loaded. * 'E' - Unsigned module has been loaded. * 'L' - A soft lockup has previously occurred. + * 'K' - Kernel has been live patched. * * The string is overwritten by the next call to print_tainted(). */ -- cgit v1.2.3 From b700e7f03df5d92f85fa5247fe1f557528d3363d Mon Sep 17 00:00:00 2001 From: Seth Jennings Date: Tue, 16 Dec 2014 11:58:19 -0600 Subject: livepatch: kernel: add support for live patching This commit introduces code for the live patching core. It implements an ftrace-based mechanism and kernel interface for doing live patching of kernel and kernel module functions. It represents the greatest common functionality set between kpatch and kgraft and can accept patches built using either method. This first version does not implement any consistency mechanism that ensures that old and new code do not run together. In practice, ~90% of CVEs are safe to apply in this way, since they simply add a conditional check. However, any function change that can not execute safely with the old version of the function can _not_ be safely applied in this version. [ jkosina@suse.cz: due to the number of contributions that got folded into this original patch from Seth Jennings, add SUSE's copyright as well, as discussed via e-mail ] Signed-off-by: Seth Jennings Signed-off-by: Josh Poimboeuf Reviewed-by: Miroslav Benes Reviewed-by: Petr Mladek Reviewed-by: Masami Hiramatsu Signed-off-by: Miroslav Benes Signed-off-by: Petr Mladek Signed-off-by: Jiri Kosina --- Documentation/ABI/testing/sysfs-kernel-livepatch | 44 ++ MAINTAINERS | 13 + arch/x86/Kconfig | 3 + arch/x86/include/asm/livepatch.h | 37 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/livepatch.c | 90 +++ include/linux/livepatch.h | 133 ++++ kernel/Makefile | 1 + kernel/livepatch/Kconfig | 18 + kernel/livepatch/Makefile | 3 + kernel/livepatch/core.c | 930 +++++++++++++++++++++++ 11 files changed, 1273 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-livepatch create mode 100644 arch/x86/include/asm/livepatch.h create mode 100644 arch/x86/kernel/livepatch.c create mode 100644 include/linux/livepatch.h create mode 100644 kernel/livepatch/Kconfig create mode 100644 kernel/livepatch/Makefile create mode 100644 kernel/livepatch/core.c (limited to 'kernel') diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch new file mode 100644 index 000000000000..5bf42a840b22 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-livepatch @@ -0,0 +1,44 @@ +What: /sys/kernel/livepatch +Date: Nov 2014 +KernelVersion: 3.19.0 +Contact: live-patching@vger.kernel.org +Description: + Interface for kernel live patching + + The /sys/kernel/livepatch directory contains subdirectories for + each loaded live patch module. + +What: /sys/kernel/livepatch/ +Date: Nov 2014 +KernelVersion: 3.19.0 +Contact: live-patching@vger.kernel.org +Description: + The patch directory contains subdirectories for each kernel + object (vmlinux or a module) in which it patched functions. + +What: /sys/kernel/livepatch//enabled +Date: Nov 2014 +KernelVersion: 3.19.0 +Contact: live-patching@vger.kernel.org +Description: + A writable attribute that indicates whether the patched + code is currently applied. Writing 0 will disable the patch + while writing 1 will re-enable the patch. + +What: /sys/kernel/livepatch// +Date: Nov 2014 +KernelVersion: 3.19.0 +Contact: live-patching@vger.kernel.org +Description: + The object directory contains subdirectories for each function + that is patched within the object. + +What: /sys/kernel/livepatch/// +Date: Nov 2014 +KernelVersion: 3.19.0 +Contact: live-patching@vger.kernel.org +Description: + The function directory contains attributes regarding the + properties and state of the patched function. + + There are currently no such attributes. diff --git a/MAINTAINERS b/MAINTAINERS index ddb9ac8d32b3..df6a0784b466 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5784,6 +5784,19 @@ F: Documentation/misc-devices/lis3lv02d F: drivers/misc/lis3lv02d/ F: drivers/platform/x86/hp_accel.c +LIVE PATCHING +M: Josh Poimboeuf +M: Seth Jennings +M: Jiri Kosina +M: Vojtech Pavlik +S: Maintained +F: kernel/livepatch/ +F: include/linux/livepatch.h +F: arch/x86/include/asm/livepatch.h +F: arch/x86/kernel/livepatch.c +F: Documentation/ABI/testing/sysfs-kernel-livepatch +L: live-patching@vger.kernel.org + LLC (802.2) M: Arnaldo Carvalho de Melo S: Maintained diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ba397bde7948..460b31b79938 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -17,6 +17,7 @@ config X86_64 depends on 64BIT select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF + select ARCH_HAVE_LIVE_PATCHING ### Arch settings config X86 @@ -2008,6 +2009,8 @@ config CMDLINE_OVERRIDE This is used to work around broken boot loaders. This should be set to 'N' under normal conditions. +source "kernel/livepatch/Kconfig" + endmenu config ARCH_ENABLE_MEMORY_HOTPLUG diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h new file mode 100644 index 000000000000..d529db1b1edf --- /dev/null +++ b/arch/x86/include/asm/livepatch.h @@ -0,0 +1,37 @@ +/* + * livepatch.h - x86-specific Kernel Live Patching Core + * + * Copyright (C) 2014 Seth Jennings + * Copyright (C) 2014 SUSE + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ + +#ifndef _ASM_X86_LIVEPATCH_H +#define _ASM_X86_LIVEPATCH_H + +#include + +#ifdef CONFIG_LIVE_PATCHING +#ifndef CC_USING_FENTRY +#error Your compiler must support -mfentry for live patching to work +#endif +extern int klp_write_module_reloc(struct module *mod, unsigned long type, + unsigned long loc, unsigned long value); + +#else +#error Live patching support is disabled; check CONFIG_LIVE_PATCHING +#endif + +#endif /* _ASM_X86_LIVEPATCH_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 5d4502c8b983..316b34e74c15 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -63,6 +63,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o obj-y += apic/ obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o +obj-$(CONFIG_LIVE_PATCHING) += livepatch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o obj-$(CONFIG_X86_TSC) += trace_clock.o diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c new file mode 100644 index 000000000000..ff3c3101d003 --- /dev/null +++ b/arch/x86/kernel/livepatch.c @@ -0,0 +1,90 @@ +/* + * livepatch.c - x86-specific Kernel Live Patching Core + * + * Copyright (C) 2014 Seth Jennings + * Copyright (C) 2014 SUSE + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ + +#include +#include +#include +#include +#include +#include + +/** + * klp_write_module_reloc() - write a relocation in a module + * @mod: module in which the section to be modified is found + * @type: ELF relocation type (see asm/elf.h) + * @loc: address that the relocation should be written to + * @value: relocation value (sym address + addend) + * + * This function writes a relocation to the specified location for + * a particular module. + */ +int klp_write_module_reloc(struct module *mod, unsigned long type, + unsigned long loc, unsigned long value) +{ + int ret, numpages, size = 4; + bool readonly; + unsigned long val; + unsigned long core = (unsigned long)mod->module_core; + unsigned long core_ro_size = mod->core_ro_size; + unsigned long core_size = mod->core_size; + + switch (type) { + case R_X86_64_NONE: + return 0; + case R_X86_64_64: + val = value; + size = 8; + break; + case R_X86_64_32: + val = (u32)value; + break; + case R_X86_64_32S: + val = (s32)value; + break; + case R_X86_64_PC32: + val = (u32)(value - loc); + break; + default: + /* unsupported relocation type */ + return -EINVAL; + } + + if (loc < core || loc >= core + core_size) + /* loc does not point to any symbol inside the module */ + return -EINVAL; + + if (loc < core + core_ro_size) + readonly = true; + else + readonly = false; + + /* determine if the relocation spans a page boundary */ + numpages = ((loc & PAGE_MASK) == ((loc + size) & PAGE_MASK)) ? 1 : 2; + + if (readonly) + set_memory_rw(loc & PAGE_MASK, numpages); + + ret = probe_kernel_write((void *)loc, &val, size); + + if (readonly) + set_memory_ro(loc & PAGE_MASK, numpages); + + return ret; +} diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h new file mode 100644 index 000000000000..950bc615842f --- /dev/null +++ b/include/linux/livepatch.h @@ -0,0 +1,133 @@ +/* + * livepatch.h - Kernel Live Patching Core + * + * Copyright (C) 2014 Seth Jennings + * Copyright (C) 2014 SUSE + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ + +#ifndef _LINUX_LIVEPATCH_H_ +#define _LINUX_LIVEPATCH_H_ + +#include +#include + +#if IS_ENABLED(CONFIG_LIVE_PATCHING) + +#include + +enum klp_state { + KLP_DISABLED, + KLP_ENABLED +}; + +/** + * struct klp_func - function structure for live patching + * @old_name: name of the function to be patched + * @new_func: pointer to the patched function code + * @old_addr: a hint conveying at what address the old function + * can be found (optional, vmlinux patches only) + * @kobj: kobject for sysfs resources + * @fops: ftrace operations structure + * @state: tracks function-level patch application state + */ +struct klp_func { + /* external */ + const char *old_name; + void *new_func; + /* + * The old_addr field is optional and can be used to resolve + * duplicate symbol names in the vmlinux object. If this + * information is not present, the symbol is located by name + * with kallsyms. If the name is not unique and old_addr is + * not provided, the patch application fails as there is no + * way to resolve the ambiguity. + */ + unsigned long old_addr; + + /* internal */ + struct kobject kobj; + struct ftrace_ops *fops; + enum klp_state state; +}; + +/** + * struct klp_reloc - relocation structure for live patching + * @loc: address where the relocation will be written + * @val: address of the referenced symbol (optional, + * vmlinux patches only) + * @type: ELF relocation type + * @name: name of the referenced symbol (for lookup/verification) + * @addend: offset from the referenced symbol + * @external: symbol is either exported or within the live patch module itself + */ +struct klp_reloc { + unsigned long loc; + unsigned long val; + unsigned long type; + const char *name; + int addend; + int external; +}; + +/** + * struct klp_object - kernel object structure for live patching + * @name: module name (or NULL for vmlinux) + * @relocs: relocation entries to be applied at load time + * @funcs: function entries for functions to be patched in the object + * @kobj: kobject for sysfs resources + * @mod: kernel module associated with the patched object + * (NULL for vmlinux) + * @state: tracks object-level patch application state + */ +struct klp_object { + /* external */ + const char *name; + struct klp_reloc *relocs; + struct klp_func *funcs; + + /* internal */ + struct kobject *kobj; + struct module *mod; + enum klp_state state; +}; + +/** + * struct klp_patch - patch structure for live patching + * @mod: reference to the live patch module + * @objs: object entries for kernel objects to be patched + * @list: list node for global list of registered patches + * @kobj: kobject for sysfs resources + * @state: tracks patch-level application state + */ +struct klp_patch { + /* external */ + struct module *mod; + struct klp_object *objs; + + /* internal */ + struct list_head list; + struct kobject kobj; + enum klp_state state; +}; + +extern int klp_register_patch(struct klp_patch *); +extern int klp_unregister_patch(struct klp_patch *); +extern int klp_enable_patch(struct klp_patch *); +extern int klp_disable_patch(struct klp_patch *); + +#endif /* CONFIG_LIVE_PATCHING */ + +#endif /* _LINUX_LIVEPATCH_H_ */ diff --git a/kernel/Makefile b/kernel/Makefile index a59481a3fa6c..616994f0a76f 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -26,6 +26,7 @@ obj-y += power/ obj-y += printk/ obj-y += irq/ obj-y += rcu/ +obj-y += livepatch/ obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o obj-$(CONFIG_FREEZER) += freezer.o diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig new file mode 100644 index 000000000000..96da00fbc120 --- /dev/null +++ b/kernel/livepatch/Kconfig @@ -0,0 +1,18 @@ +config ARCH_HAVE_LIVE_PATCHING + boolean + help + Arch supports kernel live patching + +config LIVE_PATCHING + boolean "Kernel Live Patching" + depends on DYNAMIC_FTRACE_WITH_REGS + depends on MODULES + depends on SYSFS + depends on KALLSYMS_ALL + depends on ARCH_HAVE_LIVE_PATCHING + help + Say Y here if you want to support kernel live patching. + This option has no runtime impact until a kernel "patch" + module uses the interface provided by this option to register + a patch, causing calls to patched functions to be redirected + to new function code contained in the patch module. diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile new file mode 100644 index 000000000000..7c1f00861428 --- /dev/null +++ b/kernel/livepatch/Makefile @@ -0,0 +1,3 @@ +obj-$(CONFIG_LIVE_PATCHING) += livepatch.o + +livepatch-objs := core.o diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c new file mode 100644 index 000000000000..f99fe189d596 --- /dev/null +++ b/kernel/livepatch/core.c @@ -0,0 +1,930 @@ +/* + * core.c - Kernel Live Patching Core + * + * Copyright (C) 2014 Seth Jennings + * Copyright (C) 2014 SUSE + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * The klp_mutex protects the klp_patches list and state transitions of any + * structure reachable from the patches list. References to any structure must + * be obtained under mutex protection. + */ + +static DEFINE_MUTEX(klp_mutex); +static LIST_HEAD(klp_patches); + +static struct kobject *klp_root_kobj; + +static bool klp_is_module(struct klp_object *obj) +{ + return obj->name; +} + +static bool klp_is_object_loaded(struct klp_object *obj) +{ + return !obj->name || obj->mod; +} + +/* sets obj->mod if object is not vmlinux and module is found */ +static void klp_find_object_module(struct klp_object *obj) +{ + if (!klp_is_module(obj)) + return; + + mutex_lock(&module_mutex); + /* + * We don't need to take a reference on the module here because we have + * the klp_mutex, which is also taken by the module notifier. This + * prevents any module from unloading until we release the klp_mutex. + */ + obj->mod = find_module(obj->name); + mutex_unlock(&module_mutex); +} + +/* klp_mutex must be held by caller */ +static bool klp_is_patch_registered(struct klp_patch *patch) +{ + struct klp_patch *mypatch; + + list_for_each_entry(mypatch, &klp_patches, list) + if (mypatch == patch) + return true; + + return false; +} + +static bool klp_initialized(void) +{ + return klp_root_kobj; +} + +struct klp_find_arg { + const char *objname; + const char *name; + unsigned long addr; + /* + * If count == 0, the symbol was not found. If count == 1, a unique + * match was found and addr is set. If count > 1, there is + * unresolvable ambiguity among "count" number of symbols with the same + * name in the same object. + */ + unsigned long count; +}; + +static int klp_find_callback(void *data, const char *name, + struct module *mod, unsigned long addr) +{ + struct klp_find_arg *args = data; + + if ((mod && !args->objname) || (!mod && args->objname)) + return 0; + + if (strcmp(args->name, name)) + return 0; + + if (args->objname && strcmp(args->objname, mod->name)) + return 0; + + /* + * args->addr might be overwritten if another match is found + * but klp_find_object_symbol() handles this and only returns the + * addr if count == 1. + */ + args->addr = addr; + args->count++; + + return 0; +} + +static int klp_find_object_symbol(const char *objname, const char *name, + unsigned long *addr) +{ + struct klp_find_arg args = { + .objname = objname, + .name = name, + .addr = 0, + .count = 0 + }; + + kallsyms_on_each_symbol(klp_find_callback, &args); + + if (args.count == 0) + pr_err("symbol '%s' not found in symbol table\n", name); + else if (args.count > 1) + pr_err("unresolvable ambiguity (%lu matches) on symbol '%s' in object '%s'\n", + args.count, name, objname); + else { + *addr = args.addr; + return 0; + } + + *addr = 0; + return -EINVAL; +} + +struct klp_verify_args { + const char *name; + const unsigned long addr; +}; + +static int klp_verify_callback(void *data, const char *name, + struct module *mod, unsigned long addr) +{ + struct klp_verify_args *args = data; + + if (!mod && + !strcmp(args->name, name) && + args->addr == addr) + return 1; + + return 0; +} + +static int klp_verify_vmlinux_symbol(const char *name, unsigned long addr) +{ + struct klp_verify_args args = { + .name = name, + .addr = addr, + }; + + if (kallsyms_on_each_symbol(klp_verify_callback, &args)) + return 0; + + pr_err("symbol '%s' not found at specified address 0x%016lx, kernel mismatch?", + name, addr); + return -EINVAL; +} + +static int klp_find_verify_func_addr(struct klp_object *obj, + struct klp_func *func) +{ + int ret; + +#if defined(CONFIG_RANDOMIZE_BASE) + /* KASLR is enabled, disregard old_addr from user */ + func->old_addr = 0; +#endif + + if (!func->old_addr || klp_is_module(obj)) + ret = klp_find_object_symbol(obj->name, func->old_name, + &func->old_addr); + else + ret = klp_verify_vmlinux_symbol(func->old_name, + func->old_addr); + + return ret; +} + +/* + * external symbols are located outside the parent object (where the parent + * object is either vmlinux or the kmod being patched). + */ +static int klp_find_external_symbol(struct module *pmod, const char *name, + unsigned long *addr) +{ + const struct kernel_symbol *sym; + + /* first, check if it's an exported symbol */ + preempt_disable(); + sym = find_symbol(name, NULL, NULL, true, true); + preempt_enable(); + if (sym) { + *addr = sym->value; + return 0; + } + + /* otherwise check if it's in another .o within the patch module */ + return klp_find_object_symbol(pmod->name, name, addr); +} + +static int klp_write_object_relocations(struct module *pmod, + struct klp_object *obj) +{ + int ret; + struct klp_reloc *reloc; + + if (WARN_ON(!klp_is_object_loaded(obj))) + return -EINVAL; + + if (WARN_ON(!obj->relocs)) + return -EINVAL; + + for (reloc = obj->relocs; reloc->name; reloc++) { + if (!klp_is_module(obj)) { + ret = klp_verify_vmlinux_symbol(reloc->name, + reloc->val); + if (ret) + return ret; + } else { + /* module, reloc->val needs to be discovered */ + if (reloc->external) + ret = klp_find_external_symbol(pmod, + reloc->name, + &reloc->val); + else + ret = klp_find_object_symbol(obj->mod->name, + reloc->name, + &reloc->val); + if (ret) + return ret; + } + ret = klp_write_module_reloc(pmod, reloc->type, reloc->loc, + reloc->val + reloc->addend); + if (ret) { + pr_err("relocation failed for symbol '%s' at 0x%016lx (%d)\n", + reloc->name, reloc->val, ret); + return ret; + } + } + + return 0; +} + +static void notrace klp_ftrace_handler(unsigned long ip, + unsigned long parent_ip, + struct ftrace_ops *ops, + struct pt_regs *regs) +{ + struct klp_func *func = ops->private; + + regs->ip = (unsigned long)func->new_func; +} + +static int klp_disable_func(struct klp_func *func) +{ + int ret; + + if (WARN_ON(func->state != KLP_ENABLED)) + return -EINVAL; + + if (WARN_ON(!func->old_addr)) + return -EINVAL; + + ret = unregister_ftrace_function(func->fops); + if (ret) { + pr_err("failed to unregister ftrace handler for function '%s' (%d)\n", + func->old_name, ret); + return ret; + } + + ret = ftrace_set_filter_ip(func->fops, func->old_addr, 1, 0); + if (ret) + pr_warn("function unregister succeeded but failed to clear the filter\n"); + + func->state = KLP_DISABLED; + + return 0; +} + +static int klp_enable_func(struct klp_func *func) +{ + int ret; + + if (WARN_ON(!func->old_addr)) + return -EINVAL; + + if (WARN_ON(func->state != KLP_DISABLED)) + return -EINVAL; + + ret = ftrace_set_filter_ip(func->fops, func->old_addr, 0, 0); + if (ret) { + pr_err("failed to set ftrace filter for function '%s' (%d)\n", + func->old_name, ret); + return ret; + } + + ret = register_ftrace_function(func->fops); + if (ret) { + pr_err("failed to register ftrace handler for function '%s' (%d)\n", + func->old_name, ret); + ftrace_set_filter_ip(func->fops, func->old_addr, 1, 0); + } else { + func->state = KLP_ENABLED; + } + + return ret; +} + +static int klp_disable_object(struct klp_object *obj) +{ + struct klp_func *func; + int ret; + + for (func = obj->funcs; func->old_name; func++) { + if (func->state != KLP_ENABLED) + continue; + + ret = klp_disable_func(func); + if (ret) + return ret; + } + + obj->state = KLP_DISABLED; + + return 0; +} + +static int klp_enable_object(struct klp_object *obj) +{ + struct klp_func *func; + int ret; + + if (WARN_ON(obj->state != KLP_DISABLED)) + return -EINVAL; + + if (WARN_ON(!klp_is_object_loaded(obj))) + return -EINVAL; + + for (func = obj->funcs; func->old_name; func++) { + ret = klp_enable_func(func); + if (ret) + goto unregister; + } + obj->state = KLP_ENABLED; + + return 0; + +unregister: + WARN_ON(klp_disable_object(obj)); + return ret; +} + +static int __klp_disable_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + int ret; + + pr_notice("disabling patch '%s'\n", patch->mod->name); + + for (obj = patch->objs; obj->funcs; obj++) { + if (obj->state != KLP_ENABLED) + continue; + + ret = klp_disable_object(obj); + if (ret) + return ret; + } + + patch->state = KLP_DISABLED; + + return 0; +} + +/** + * klp_disable_patch() - disables a registered patch + * @patch: The registered, enabled patch to be disabled + * + * Unregisters the patched functions from ftrace. + * + * Return: 0 on success, otherwise error + */ +int klp_disable_patch(struct klp_patch *patch) +{ + int ret; + + mutex_lock(&klp_mutex); + + if (!klp_is_patch_registered(patch)) { + ret = -EINVAL; + goto err; + } + + if (patch->state == KLP_DISABLED) { + ret = -EINVAL; + goto err; + } + + ret = __klp_disable_patch(patch); + +err: + mutex_unlock(&klp_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(klp_disable_patch); + +static int __klp_enable_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + int ret; + + if (WARN_ON(patch->state != KLP_DISABLED)) + return -EINVAL; + + pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n"); + add_taint(TAINT_LIVEPATCH, LOCKDEP_STILL_OK); + + pr_notice("enabling patch '%s'\n", patch->mod->name); + + for (obj = patch->objs; obj->funcs; obj++) { + klp_find_object_module(obj); + + if (!klp_is_object_loaded(obj)) + continue; + + ret = klp_enable_object(obj); + if (ret) + goto unregister; + } + + patch->state = KLP_ENABLED; + + return 0; + +unregister: + WARN_ON(__klp_disable_patch(patch)); + return ret; +} + +/** + * klp_enable_patch() - enables a registered patch + * @patch: The registered, disabled patch to be enabled + * + * Performs the needed symbol lookups and code relocations, + * then registers the patched functions with ftrace. + * + * Return: 0 on success, otherwise error + */ +int klp_enable_patch(struct klp_patch *patch) +{ + int ret; + + mutex_lock(&klp_mutex); + + if (!klp_is_patch_registered(patch)) { + ret = -EINVAL; + goto err; + } + + ret = __klp_enable_patch(patch); + +err: + mutex_unlock(&klp_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(klp_enable_patch); + +/* + * Sysfs Interface + * + * /sys/kernel/livepatch + * /sys/kernel/livepatch/ + * /sys/kernel/livepatch//enabled + * /sys/kernel/livepatch// + * /sys/kernel/livepatch/// + */ + +static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct klp_patch *patch; + int ret; + unsigned long val; + + ret = kstrtoul(buf, 10, &val); + if (ret) + return -EINVAL; + + if (val != KLP_DISABLED && val != KLP_ENABLED) + return -EINVAL; + + patch = container_of(kobj, struct klp_patch, kobj); + + mutex_lock(&klp_mutex); + + if (val == patch->state) { + /* already in requested state */ + ret = -EINVAL; + goto err; + } + + if (val == KLP_ENABLED) { + ret = __klp_enable_patch(patch); + if (ret) + goto err; + } else { + ret = __klp_disable_patch(patch); + if (ret) + goto err; + } + + mutex_unlock(&klp_mutex); + + return count; + +err: + mutex_unlock(&klp_mutex); + return ret; +} + +static ssize_t enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct klp_patch *patch; + + patch = container_of(kobj, struct klp_patch, kobj); + return snprintf(buf, PAGE_SIZE-1, "%d\n", patch->state); +} + +static struct kobj_attribute enabled_kobj_attr = __ATTR_RW(enabled); +static struct attribute *klp_patch_attrs[] = { + &enabled_kobj_attr.attr, + NULL +}; + +static void klp_kobj_release_patch(struct kobject *kobj) +{ + /* + * Once we have a consistency model we'll need to module_put() the + * patch module here. See klp_register_patch() for more details. + */ +} + +static struct kobj_type klp_ktype_patch = { + .release = klp_kobj_release_patch, + .sysfs_ops = &kobj_sysfs_ops, + .default_attrs = klp_patch_attrs, +}; + +static void klp_kobj_release_func(struct kobject *kobj) +{ + struct klp_func *func; + + func = container_of(kobj, struct klp_func, kobj); + kfree(func->fops); +} + +static struct kobj_type klp_ktype_func = { + .release = klp_kobj_release_func, + .sysfs_ops = &kobj_sysfs_ops, +}; + +/* + * Free all functions' kobjects in the array up to some limit. When limit is + * NULL, all kobjects are freed. + */ +static void klp_free_funcs_limited(struct klp_object *obj, + struct klp_func *limit) +{ + struct klp_func *func; + + for (func = obj->funcs; func->old_name && func != limit; func++) + kobject_put(&func->kobj); +} + +/* Clean up when a patched object is unloaded */ +static void klp_free_object_loaded(struct klp_object *obj) +{ + struct klp_func *func; + + obj->mod = NULL; + + for (func = obj->funcs; func->old_name; func++) + func->old_addr = 0; +} + +/* + * Free all objects' kobjects in the array up to some limit. When limit is + * NULL, all kobjects are freed. + */ +static void klp_free_objects_limited(struct klp_patch *patch, + struct klp_object *limit) +{ + struct klp_object *obj; + + for (obj = patch->objs; obj->funcs && obj != limit; obj++) { + klp_free_funcs_limited(obj, NULL); + kobject_put(obj->kobj); + } +} + +static void klp_free_patch(struct klp_patch *patch) +{ + klp_free_objects_limited(patch, NULL); + if (!list_empty(&patch->list)) + list_del(&patch->list); + kobject_put(&patch->kobj); +} + +static int klp_init_func(struct klp_object *obj, struct klp_func *func) +{ + struct ftrace_ops *ops; + int ret; + + ops = kzalloc(sizeof(*ops), GFP_KERNEL); + if (!ops) + return -ENOMEM; + + ops->private = func; + ops->func = klp_ftrace_handler; + ops->flags = FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_DYNAMIC; + func->fops = ops; + func->state = KLP_DISABLED; + + ret = kobject_init_and_add(&func->kobj, &klp_ktype_func, + obj->kobj, func->old_name); + if (ret) { + kfree(func->fops); + return ret; + } + + return 0; +} + +/* parts of the initialization that is done only when the object is loaded */ +static int klp_init_object_loaded(struct klp_patch *patch, + struct klp_object *obj) +{ + struct klp_func *func; + int ret; + + if (obj->relocs) { + ret = klp_write_object_relocations(patch->mod, obj); + if (ret) + return ret; + } + + for (func = obj->funcs; func->old_name; func++) { + ret = klp_find_verify_func_addr(obj, func); + if (ret) + return ret; + } + + return 0; +} + +static int klp_init_object(struct klp_patch *patch, struct klp_object *obj) +{ + struct klp_func *func; + int ret; + const char *name; + + if (!obj->funcs) + return -EINVAL; + + obj->state = KLP_DISABLED; + + klp_find_object_module(obj); + + name = klp_is_module(obj) ? obj->name : "vmlinux"; + obj->kobj = kobject_create_and_add(name, &patch->kobj); + if (!obj->kobj) + return -ENOMEM; + + for (func = obj->funcs; func->old_name; func++) { + ret = klp_init_func(obj, func); + if (ret) + goto free; + } + + if (klp_is_object_loaded(obj)) { + ret = klp_init_object_loaded(patch, obj); + if (ret) + goto free; + } + + return 0; + +free: + klp_free_funcs_limited(obj, func); + kobject_put(obj->kobj); + return ret; +} + +static int klp_init_patch(struct klp_patch *patch) +{ + struct klp_object *obj; + int ret; + + if (!patch->objs) + return -EINVAL; + + mutex_lock(&klp_mutex); + + patch->state = KLP_DISABLED; + + ret = kobject_init_and_add(&patch->kobj, &klp_ktype_patch, + klp_root_kobj, patch->mod->name); + if (ret) + goto unlock; + + for (obj = patch->objs; obj->funcs; obj++) { + ret = klp_init_object(patch, obj); + if (ret) + goto free; + } + + list_add(&patch->list, &klp_patches); + + mutex_unlock(&klp_mutex); + + return 0; + +free: + klp_free_objects_limited(patch, obj); + kobject_put(&patch->kobj); +unlock: + mutex_unlock(&klp_mutex); + return ret; +} + +/** + * klp_unregister_patch() - unregisters a patch + * @patch: Disabled patch to be unregistered + * + * Frees the data structures and removes the sysfs interface. + * + * Return: 0 on success, otherwise error + */ +int klp_unregister_patch(struct klp_patch *patch) +{ + int ret = 0; + + mutex_lock(&klp_mutex); + + if (!klp_is_patch_registered(patch)) { + ret = -EINVAL; + goto out; + } + + if (patch->state == KLP_ENABLED) { + ret = -EBUSY; + goto out; + } + + klp_free_patch(patch); + +out: + mutex_unlock(&klp_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(klp_unregister_patch); + +/** + * klp_register_patch() - registers a patch + * @patch: Patch to be registered + * + * Initializes the data structure associated with the patch and + * creates the sysfs interface. + * + * Return: 0 on success, otherwise error + */ +int klp_register_patch(struct klp_patch *patch) +{ + int ret; + + if (!klp_initialized()) + return -ENODEV; + + if (!patch || !patch->mod) + return -EINVAL; + + /* + * A reference is taken on the patch module to prevent it from being + * unloaded. Right now, we don't allow patch modules to unload since + * there is currently no method to determine if a thread is still + * running in the patched code contained in the patch module once + * the ftrace registration is successful. + */ + if (!try_module_get(patch->mod)) + return -ENODEV; + + ret = klp_init_patch(patch); + if (ret) + module_put(patch->mod); + + return ret; +} +EXPORT_SYMBOL_GPL(klp_register_patch); + +static void klp_module_notify_coming(struct klp_patch *patch, + struct klp_object *obj) +{ + struct module *pmod = patch->mod; + struct module *mod = obj->mod; + int ret; + + ret = klp_init_object_loaded(patch, obj); + if (ret) + goto err; + + if (patch->state == KLP_DISABLED) + return; + + pr_notice("applying patch '%s' to loading module '%s'\n", + pmod->name, mod->name); + + ret = klp_enable_object(obj); + if (!ret) + return; + +err: + pr_warn("failed to apply patch '%s' to module '%s' (%d)\n", + pmod->name, mod->name, ret); +} + +static void klp_module_notify_going(struct klp_patch *patch, + struct klp_object *obj) +{ + struct module *pmod = patch->mod; + struct module *mod = obj->mod; + int ret; + + if (patch->state == KLP_DISABLED) + goto disabled; + + pr_notice("reverting patch '%s' on unloading module '%s'\n", + pmod->name, mod->name); + + ret = klp_disable_object(obj); + if (ret) + pr_warn("failed to revert patch '%s' on module '%s' (%d)\n", + pmod->name, mod->name, ret); + +disabled: + klp_free_object_loaded(obj); +} + +static int klp_module_notify(struct notifier_block *nb, unsigned long action, + void *data) +{ + struct module *mod = data; + struct klp_patch *patch; + struct klp_object *obj; + + if (action != MODULE_STATE_COMING && action != MODULE_STATE_GOING) + return 0; + + mutex_lock(&klp_mutex); + + list_for_each_entry(patch, &klp_patches, list) { + for (obj = patch->objs; obj->funcs; obj++) { + if (!klp_is_module(obj) || strcmp(obj->name, mod->name)) + continue; + + if (action == MODULE_STATE_COMING) { + obj->mod = mod; + klp_module_notify_coming(patch, obj); + } else /* MODULE_STATE_GOING */ + klp_module_notify_going(patch, obj); + + break; + } + } + + mutex_unlock(&klp_mutex); + + return 0; +} + +static struct notifier_block klp_module_nb = { + .notifier_call = klp_module_notify, + .priority = INT_MIN+1, /* called late but before ftrace notifier */ +}; + +static int klp_init(void) +{ + int ret; + + ret = register_module_notifier(&klp_module_nb); + if (ret) + return ret; + + klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj); + if (!klp_root_kobj) { + ret = -ENOMEM; + goto unregister; + } + + return 0; + +unregister: + unregister_module_notifier(&klp_module_nb); + return ret; +} + +module_init(klp_init); -- cgit v1.2.3 From b5bfc51707f1b56b0b733980bb4fcc0562bf02d8 Mon Sep 17 00:00:00 2001 From: Li Bin Date: Fri, 19 Dec 2014 14:11:17 +0800 Subject: livepatch: move x86 specific ftrace handler code to arch/x86 The execution flow redirection related implemention in the livepatch ftrace handler is depended on the specific architecture. This patch introduces klp_arch_set_pc(like kgdb_arch_set_pc) interface to change the pt_regs. Signed-off-by: Li Bin Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- arch/x86/include/asm/livepatch.h | 5 +++++ kernel/livepatch/core.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h index d529db1b1edf..b5608d7757fd 100644 --- a/arch/x86/include/asm/livepatch.h +++ b/arch/x86/include/asm/livepatch.h @@ -22,6 +22,7 @@ #define _ASM_X86_LIVEPATCH_H #include +#include #ifdef CONFIG_LIVE_PATCHING #ifndef CC_USING_FENTRY @@ -30,6 +31,10 @@ extern int klp_write_module_reloc(struct module *mod, unsigned long type, unsigned long loc, unsigned long value); +static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip) +{ + regs->ip = ip; +} #else #error Live patching support is disabled; check CONFIG_LIVE_PATCHING #endif diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index f99fe189d596..07a2db9d01e6 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -272,7 +272,7 @@ static void notrace klp_ftrace_handler(unsigned long ip, { struct klp_func *func = ops->private; - regs->ip = (unsigned long)func->new_func; + klp_arch_set_pc(regs, (unsigned long)func->new_func); } static int klp_disable_func(struct klp_func *func) -- cgit v1.2.3 From 33e8612f64d4973ae3617a01224ec02b7b879597 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Mon, 22 Dec 2014 13:39:54 +0100 Subject: livepatch: use FTRACE_OPS_FL_IPMODIFY Use the FTRACE_OPS_FL_IPMODIFY flag to prevent conflicts with other ftrace users who also modify regs->ip. Signed-off-by: Josh Poimboeuf Reviewed-by: Petr Mladek Acked-by: Masami Hiramatsu Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 07a2db9d01e6..6f6387912da7 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -641,7 +641,8 @@ static int klp_init_func(struct klp_object *obj, struct klp_func *func) ops->private = func; ops->func = klp_ftrace_handler; - ops->flags = FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_DYNAMIC; + ops->flags = FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_DYNAMIC | + FTRACE_OPS_FL_IPMODIFY; func->fops = ops; func->state = KLP_DISABLED; -- cgit v1.2.3 From cf6ab6d9143b157786bf29bca5c32e55234bb07d Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 15 Dec 2014 20:13:31 -0500 Subject: tracing: Add ref count to tracer for when they are being read by pipe When one of the trace pipe files are being read (by either the trace_pipe or trace_pipe_raw), do not allow the current_trace to change. By adding a ref count that is incremented when the pipe files are opened, will prevent the current_trace from being changed. This will allow for the removal of the global trace_types_lock from reading the pipe buffers (which is currently a bottle neck). Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 16 +++++++++++++++- kernel/trace/trace.h | 1 + 2 files changed, 16 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2e767972e99c..ed3fba1d6570 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4140,6 +4140,12 @@ static int tracing_set_tracer(struct trace_array *tr, const char *buf) goto out; } + /* If trace pipe files are being read, we can't change the tracer */ + if (tr->current_trace->ref) { + ret = -EBUSY; + goto out; + } + trace_branch_disable(); tr->current_trace->enabled--; @@ -4363,6 +4369,8 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp) iter->trace->pipe_open(iter); nonseekable_open(inode, filp); + + tr->current_trace->ref++; out: mutex_unlock(&trace_types_lock); return ret; @@ -4382,6 +4390,8 @@ static int tracing_release_pipe(struct inode *inode, struct file *file) mutex_lock(&trace_types_lock); + tr->current_trace->ref--; + if (iter->trace->pipe_close) iter->trace->pipe_close(iter); @@ -5331,6 +5341,8 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp) filp->private_data = info; + tr->current_trace->ref++; + mutex_unlock(&trace_types_lock); ret = nonseekable_open(inode, filp); @@ -5437,6 +5449,8 @@ static int tracing_buffers_release(struct inode *inode, struct file *file) mutex_lock(&trace_types_lock); + iter->tr->current_trace->ref--; + __trace_array_put(iter->tr); if (info->spare) @@ -6416,7 +6430,7 @@ static int instance_delete(const char *name) goto out_unlock; ret = -EBUSY; - if (tr->ref) + if (tr->ref || (tr->current_trace && tr->current_trace->ref)) goto out_unlock; list_del(&tr->list); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 8de48bac1ce2..0eddfeb05fee 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -388,6 +388,7 @@ struct tracer { struct tracer *next; struct tracer_flags *flags; int enabled; + int ref; bool print_max; bool allow_instances; #ifdef CONFIG_TRACER_MAX_TRACE -- cgit v1.2.3 From d716ff71dd12bc6328f84a9ec1c3647daf01c827 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 15 Dec 2014 22:31:07 -0500 Subject: tracing: Remove taking of trace_types_lock in pipe files Taking the global mutex "trace_types_lock" in the trace_pipe files causes a bottle neck as most the pipe files can be read per cpu and there's no reason to serialize them. The current_trace variable was given a ref count and it can not change when the ref count is not zero. Opening the trace_pipe files will up the ref count (and decremented on close), so that the lock no longer needs to be taken when accessing the current_trace variable. Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 110 +++++++++++++-------------------------------------- 1 file changed, 28 insertions(+), 82 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ed3fba1d6570..7669b1f3234e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4332,17 +4332,7 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp) } trace_seq_init(&iter->seq); - - /* - * We make a copy of the current tracer to avoid concurrent - * changes on it while we are reading. - */ - iter->trace = kmalloc(sizeof(*iter->trace), GFP_KERNEL); - if (!iter->trace) { - ret = -ENOMEM; - goto fail; - } - *iter->trace = *tr->current_trace; + iter->trace = tr->current_trace; if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) { ret = -ENOMEM; @@ -4399,7 +4389,6 @@ static int tracing_release_pipe(struct inode *inode, struct file *file) free_cpumask_var(iter->started); mutex_destroy(&iter->mutex); - kfree(iter->trace); kfree(iter); trace_array_put(tr); @@ -4432,7 +4421,7 @@ tracing_poll_pipe(struct file *filp, poll_table *poll_table) return trace_poll(iter, filp, poll_table); } -/* Must be called with trace_types_lock mutex held. */ +/* Must be called with iter->mutex held. */ static int tracing_wait_pipe(struct file *filp) { struct trace_iterator *iter = filp->private_data; @@ -4477,7 +4466,6 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) { struct trace_iterator *iter = filp->private_data; - struct trace_array *tr = iter->tr; ssize_t sret; /* return any leftover data */ @@ -4487,12 +4475,6 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, trace_seq_init(&iter->seq); - /* copy the tracer to avoid using a global lock all around */ - mutex_lock(&trace_types_lock); - if (unlikely(iter->trace->name != tr->current_trace->name)) - *iter->trace = *tr->current_trace; - mutex_unlock(&trace_types_lock); - /* * Avoid more than one consumer on a single file descriptor * This is just a matter of traces coherency, the ring buffer itself @@ -4652,7 +4634,6 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, .ops = &tracing_pipe_buf_ops, .spd_release = tracing_spd_release_pipe, }; - struct trace_array *tr = iter->tr; ssize_t ret; size_t rem; unsigned int i; @@ -4660,12 +4641,6 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, if (splice_grow_spd(pipe, &spd)) return -ENOMEM; - /* copy the tracer to avoid using a global lock all around */ - mutex_lock(&trace_types_lock); - if (unlikely(iter->trace->name != tr->current_trace->name)) - *iter->trace = *tr->current_trace; - mutex_unlock(&trace_types_lock); - mutex_lock(&iter->mutex); if (iter->trace->splice_read) { @@ -5373,21 +5348,16 @@ tracing_buffers_read(struct file *filp, char __user *ubuf, if (!count) return 0; - mutex_lock(&trace_types_lock); - #ifdef CONFIG_TRACER_MAX_TRACE - if (iter->snapshot && iter->tr->current_trace->use_max_tr) { - size = -EBUSY; - goto out_unlock; - } + if (iter->snapshot && iter->tr->current_trace->use_max_tr) + return -EBUSY; #endif if (!info->spare) info->spare = ring_buffer_alloc_read_page(iter->trace_buffer->buffer, iter->cpu_file); - size = -ENOMEM; if (!info->spare) - goto out_unlock; + return -ENOMEM; /* Do we have previous read data to read? */ if (info->read < PAGE_SIZE) @@ -5403,21 +5373,16 @@ tracing_buffers_read(struct file *filp, char __user *ubuf, if (ret < 0) { if (trace_empty(iter)) { - if ((filp->f_flags & O_NONBLOCK)) { - size = -EAGAIN; - goto out_unlock; - } - mutex_unlock(&trace_types_lock); + if ((filp->f_flags & O_NONBLOCK)) + return -EAGAIN; + ret = wait_on_pipe(iter, false); - mutex_lock(&trace_types_lock); - if (ret) { - size = ret; - goto out_unlock; - } + if (ret) + return ret; + goto again; } - size = 0; - goto out_unlock; + return 0; } info->read = 0; @@ -5427,18 +5392,14 @@ tracing_buffers_read(struct file *filp, char __user *ubuf, size = count; ret = copy_to_user(ubuf, info->spare + info->read, size); - if (ret == size) { - size = -EFAULT; - goto out_unlock; - } + if (ret == size) + return -EFAULT; + size -= ret; *ppos += size; info->read += size; - out_unlock: - mutex_unlock(&trace_types_lock); - return size; } @@ -5536,30 +5497,20 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, int entries, size, i; ssize_t ret = 0; - mutex_lock(&trace_types_lock); - #ifdef CONFIG_TRACER_MAX_TRACE - if (iter->snapshot && iter->tr->current_trace->use_max_tr) { - ret = -EBUSY; - goto out; - } + if (iter->snapshot && iter->tr->current_trace->use_max_tr) + return -EBUSY; #endif - if (splice_grow_spd(pipe, &spd)) { - ret = -ENOMEM; - goto out; - } + if (splice_grow_spd(pipe, &spd)) + return -ENOMEM; - if (*ppos & (PAGE_SIZE - 1)) { - ret = -EINVAL; - goto out; - } + if (*ppos & (PAGE_SIZE - 1)) + return -EINVAL; if (len & (PAGE_SIZE - 1)) { - if (len < PAGE_SIZE) { - ret = -EINVAL; - goto out; - } + if (len < PAGE_SIZE) + return -EINVAL; len &= PAGE_MASK; } @@ -5620,25 +5571,20 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, /* did we read anything? */ if (!spd.nr_pages) { if (ret) - goto out; + return ret; + + if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) + return -EAGAIN; - if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) { - ret = -EAGAIN; - goto out; - } - mutex_unlock(&trace_types_lock); ret = wait_on_pipe(iter, true); - mutex_lock(&trace_types_lock); if (ret) - goto out; + return ret; goto again; } ret = splice_to_pipe(pipe, &spd); splice_shrink_spd(&spd); -out: - mutex_unlock(&trace_types_lock); return ret; } -- cgit v1.2.3 From 74d23cc704d19732e70ef1579a669f7d5f09dd9a Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Sun, 21 Dec 2014 19:46:56 +0100 Subject: time: move the timecounter/cyclecounter code into its own file. The timecounter code has almost nothing to do with the clocksource code. Let it live in its own file. This will help isolate the timecounter users from the clocksource users in the source tree. Signed-off-by: Richard Cochran Acked-by: Jeff Kirsher Signed-off-by: David S. Miller --- drivers/net/ethernet/amd/xgbe/xgbe.h | 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 +- drivers/net/ethernet/freescale/fec.h | 1 + drivers/net/ethernet/intel/e1000e/e1000.h | 2 +- drivers/net/ethernet/intel/igb/igb.h | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +- drivers/net/ethernet/ti/cpts.h | 1 + include/clocksource/arm_arch_timer.h | 2 +- include/linux/clocksource.h | 102 ----------------------- include/linux/mlx4/device.h | 2 +- include/linux/timecounter.h | 122 ++++++++++++++++++++++++++++ include/linux/types.h | 3 + kernel/time/Makefile | 2 +- kernel/time/clocksource.c | 76 ----------------- kernel/time/timecounter.c | 95 ++++++++++++++++++++++ sound/pci/hda/hda_priv.h | 2 +- 16 files changed, 231 insertions(+), 187 deletions(-) create mode 100644 include/linux/timecounter.h create mode 100644 kernel/time/timecounter.c (limited to 'kernel') diff --git a/drivers/net/ethernet/amd/xgbe/xgbe.h b/drivers/net/ethernet/amd/xgbe/xgbe.h index f9ec762ac3f0..2af6affc35a7 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe.h +++ b/drivers/net/ethernet/amd/xgbe/xgbe.h @@ -124,7 +124,7 @@ #include #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h index c3a6072134f5..792ba72fb5c8 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h @@ -22,7 +22,7 @@ #include #include -#include +#include /* compilation time flags */ diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h index 469691ad4a1e..df8bbddaeb37 100644 --- a/drivers/net/ethernet/freescale/fec.h +++ b/drivers/net/ethernet/freescale/fec.h @@ -16,6 +16,7 @@ #include #include #include +#include #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \ defined(CONFIG_M520x) || defined(CONFIG_M532x) || \ diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h index 7785240a0da1..9416e5a7e0c8 100644 --- a/drivers/net/ethernet/intel/e1000e/e1000.h +++ b/drivers/net/ethernet/intel/e1000e/e1000.h @@ -34,7 +34,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 82d891e183b1..ee22da391474 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -29,7 +29,7 @@ #include "e1000_mac.h" #include "e1000_82575.h" -#include +#include #include #include #include diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index b6137be43920..38fc64cf5dca 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -38,7 +38,7 @@ #include #include -#include +#include #include #include diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h index 1a581ef7eee8..69a46b92c7d6 100644 --- a/drivers/net/ethernet/ti/cpts.h +++ b/drivers/net/ethernet/ti/cpts.h @@ -27,6 +27,7 @@ #include #include #include +#include struct cpsw_cpts { u32 idver; /* Identification and version */ diff --git a/include/clocksource/arm_arch_timer.h b/include/clocksource/arm_arch_timer.h index 6d26b40cbf5d..9916d0e4eff5 100644 --- a/include/clocksource/arm_arch_timer.h +++ b/include/clocksource/arm_arch_timer.h @@ -16,7 +16,7 @@ #ifndef __CLKSOURCE_ARM_ARCH_TIMER_H #define __CLKSOURCE_ARM_ARCH_TIMER_H -#include +#include #include #define ARCH_TIMER_CTRL_ENABLE (1 << 0) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index abcafaa20b86..9c78d15d33e4 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -18,8 +18,6 @@ #include #include -/* clocksource cycle base type */ -typedef u64 cycle_t; struct clocksource; struct module; @@ -27,106 +25,6 @@ struct module; #include #endif -/** - * struct cyclecounter - hardware abstraction for a free running counter - * Provides completely state-free accessors to the underlying hardware. - * Depending on which hardware it reads, the cycle counter may wrap - * around quickly. Locking rules (if necessary) have to be defined - * by the implementor and user of specific instances of this API. - * - * @read: returns the current cycle value - * @mask: bitmask for two's complement - * subtraction of non 64 bit counters, - * see CLOCKSOURCE_MASK() helper macro - * @mult: cycle to nanosecond multiplier - * @shift: cycle to nanosecond divisor (power of two) - */ -struct cyclecounter { - cycle_t (*read)(const struct cyclecounter *cc); - cycle_t mask; - u32 mult; - u32 shift; -}; - -/** - * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds - * Contains the state needed by timecounter_read() to detect - * cycle counter wrap around. Initialize with - * timecounter_init(). Also used to convert cycle counts into the - * corresponding nanosecond counts with timecounter_cyc2time(). Users - * of this code are responsible for initializing the underlying - * cycle counter hardware, locking issues and reading the time - * more often than the cycle counter wraps around. The nanosecond - * counter will only wrap around after ~585 years. - * - * @cc: the cycle counter used by this instance - * @cycle_last: most recent cycle counter value seen by - * timecounter_read() - * @nsec: continuously increasing count - */ -struct timecounter { - const struct cyclecounter *cc; - cycle_t cycle_last; - u64 nsec; -}; - -/** - * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds - * @cc: Pointer to cycle counter. - * @cycles: Cycles - * - * XXX - This could use some mult_lxl_ll() asm optimization. Same code - * as in cyc2ns, but with unsigned result. - */ -static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, - cycle_t cycles) -{ - u64 ret = (u64)cycles; - ret = (ret * cc->mult) >> cc->shift; - return ret; -} - -/** - * timecounter_init - initialize a time counter - * @tc: Pointer to time counter which is to be initialized/reset - * @cc: A cycle counter, ready to be used. - * @start_tstamp: Arbitrary initial time stamp. - * - * After this call the current cycle register (roughly) corresponds to - * the initial time stamp. Every call to timecounter_read() increments - * the time stamp counter by the number of elapsed nanoseconds. - */ -extern void timecounter_init(struct timecounter *tc, - const struct cyclecounter *cc, - u64 start_tstamp); - -/** - * timecounter_read - return nanoseconds elapsed since timecounter_init() - * plus the initial time stamp - * @tc: Pointer to time counter. - * - * In other words, keeps track of time since the same epoch as - * the function which generated the initial time stamp. - */ -extern u64 timecounter_read(struct timecounter *tc); - -/** - * timecounter_cyc2time - convert a cycle counter to same - * time base as values returned by - * timecounter_read() - * @tc: Pointer to time counter. - * @cycle_tstamp: a value returned by tc->cc->read() - * - * Cycle counts that are converted correctly as long as they - * fall into the interval [-1/2 max cycle count, +1/2 max cycle count], - * with "max cycle count" == cs->mask+1. - * - * This allows conversion of cycle counter values which were generated - * in the past. - */ -extern u64 timecounter_cyc2time(struct timecounter *tc, - cycle_t cycle_tstamp); - /** * struct clocksource - hardware abstraction for a free running counter * Provides mostly state-free accessors to the underlying hardware. diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 25c791e295fd..f1e41b33462f 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -42,7 +42,7 @@ #include -#include +#include #define MAX_MSIX_P_PORT 17 #define MAX_MSIX 64 diff --git a/include/linux/timecounter.h b/include/linux/timecounter.h new file mode 100644 index 000000000000..146f07a6651b --- /dev/null +++ b/include/linux/timecounter.h @@ -0,0 +1,122 @@ +/* + * linux/include/linux/timecounter.h + * + * based on code that migrated away from + * linux/include/linux/clocksource.h + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#ifndef _LINUX_TIMECOUNTER_H +#define _LINUX_TIMECOUNTER_H + +#include + +/** + * struct cyclecounter - hardware abstraction for a free running counter + * Provides completely state-free accessors to the underlying hardware. + * Depending on which hardware it reads, the cycle counter may wrap + * around quickly. Locking rules (if necessary) have to be defined + * by the implementor and user of specific instances of this API. + * + * @read: returns the current cycle value + * @mask: bitmask for two's complement + * subtraction of non 64 bit counters, + * see CLOCKSOURCE_MASK() helper macro + * @mult: cycle to nanosecond multiplier + * @shift: cycle to nanosecond divisor (power of two) + */ +struct cyclecounter { + cycle_t (*read)(const struct cyclecounter *cc); + cycle_t mask; + u32 mult; + u32 shift; +}; + +/** + * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds + * Contains the state needed by timecounter_read() to detect + * cycle counter wrap around. Initialize with + * timecounter_init(). Also used to convert cycle counts into the + * corresponding nanosecond counts with timecounter_cyc2time(). Users + * of this code are responsible for initializing the underlying + * cycle counter hardware, locking issues and reading the time + * more often than the cycle counter wraps around. The nanosecond + * counter will only wrap around after ~585 years. + * + * @cc: the cycle counter used by this instance + * @cycle_last: most recent cycle counter value seen by + * timecounter_read() + * @nsec: continuously increasing count + */ +struct timecounter { + const struct cyclecounter *cc; + cycle_t cycle_last; + u64 nsec; +}; + +/** + * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds + * @cc: Pointer to cycle counter. + * @cycles: Cycles + * + * XXX - This could use some mult_lxl_ll() asm optimization. Same code + * as in cyc2ns, but with unsigned result. + */ +static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, + cycle_t cycles) +{ + u64 ret = (u64)cycles; + ret = (ret * cc->mult) >> cc->shift; + return ret; +} + +/** + * timecounter_init - initialize a time counter + * @tc: Pointer to time counter which is to be initialized/reset + * @cc: A cycle counter, ready to be used. + * @start_tstamp: Arbitrary initial time stamp. + * + * After this call the current cycle register (roughly) corresponds to + * the initial time stamp. Every call to timecounter_read() increments + * the time stamp counter by the number of elapsed nanoseconds. + */ +extern void timecounter_init(struct timecounter *tc, + const struct cyclecounter *cc, + u64 start_tstamp); + +/** + * timecounter_read - return nanoseconds elapsed since timecounter_init() + * plus the initial time stamp + * @tc: Pointer to time counter. + * + * In other words, keeps track of time since the same epoch as + * the function which generated the initial time stamp. + */ +extern u64 timecounter_read(struct timecounter *tc); + +/** + * timecounter_cyc2time - convert a cycle counter to same + * time base as values returned by + * timecounter_read() + * @tc: Pointer to time counter. + * @cycle_tstamp: a value returned by tc->cc->read() + * + * Cycle counts that are converted correctly as long as they + * fall into the interval [-1/2 max cycle count, +1/2 max cycle count], + * with "max cycle count" == cs->mask+1. + * + * This allows conversion of cycle counter values which were generated + * in the past. + */ +extern u64 timecounter_cyc2time(struct timecounter *tc, + cycle_t cycle_tstamp); + +#endif diff --git a/include/linux/types.h b/include/linux/types.h index a0bb7048687f..62323825cff9 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -213,5 +213,8 @@ struct callback_head { }; #define rcu_head callback_head +/* clocksource cycle base type */ +typedef u64 cycle_t; + #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/time/Makefile b/kernel/time/Makefile index f622cf28628a..c09c07817d7a 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -1,6 +1,6 @@ obj-y += time.o timer.o hrtimer.o itimer.o posix-timers.o posix-cpu-timers.o obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o -obj-y += timeconv.o posix-clock.o alarmtimer.o +obj-y += timeconv.o timecounter.o posix-clock.o alarmtimer.o obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD) += clockevents.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index b79f39bda7e1..4892352f0e49 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -34,82 +34,6 @@ #include "tick-internal.h" #include "timekeeping_internal.h" -void timecounter_init(struct timecounter *tc, - const struct cyclecounter *cc, - u64 start_tstamp) -{ - tc->cc = cc; - tc->cycle_last = cc->read(cc); - tc->nsec = start_tstamp; -} -EXPORT_SYMBOL_GPL(timecounter_init); - -/** - * timecounter_read_delta - get nanoseconds since last call of this function - * @tc: Pointer to time counter - * - * When the underlying cycle counter runs over, this will be handled - * correctly as long as it does not run over more than once between - * calls. - * - * The first call to this function for a new time counter initializes - * the time tracking and returns an undefined result. - */ -static u64 timecounter_read_delta(struct timecounter *tc) -{ - cycle_t cycle_now, cycle_delta; - u64 ns_offset; - - /* read cycle counter: */ - cycle_now = tc->cc->read(tc->cc); - - /* calculate the delta since the last timecounter_read_delta(): */ - cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask; - - /* convert to nanoseconds: */ - ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta); - - /* update time stamp of timecounter_read_delta() call: */ - tc->cycle_last = cycle_now; - - return ns_offset; -} - -u64 timecounter_read(struct timecounter *tc) -{ - u64 nsec; - - /* increment time by nanoseconds since last call */ - nsec = timecounter_read_delta(tc); - nsec += tc->nsec; - tc->nsec = nsec; - - return nsec; -} -EXPORT_SYMBOL_GPL(timecounter_read); - -u64 timecounter_cyc2time(struct timecounter *tc, - cycle_t cycle_tstamp) -{ - u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask; - u64 nsec; - - /* - * Instead of always treating cycle_tstamp as more recent - * than tc->cycle_last, detect when it is too far in the - * future and treat it as old time stamp instead. - */ - if (cycle_delta > tc->cc->mask / 2) { - cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask; - nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta); - } else { - nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec; - } - - return nsec; -} -EXPORT_SYMBOL_GPL(timecounter_cyc2time); - /** * clocks_calc_mult_shift - calculate mult/shift factors for scaled math of clocks * @mult: pointer to mult variable diff --git a/kernel/time/timecounter.c b/kernel/time/timecounter.c new file mode 100644 index 000000000000..59a1ec3a57cb --- /dev/null +++ b/kernel/time/timecounter.c @@ -0,0 +1,95 @@ +/* + * linux/kernel/time/timecounter.c + * + * based on code that migrated away from + * linux/kernel/time/clocksource.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include + +void timecounter_init(struct timecounter *tc, + const struct cyclecounter *cc, + u64 start_tstamp) +{ + tc->cc = cc; + tc->cycle_last = cc->read(cc); + tc->nsec = start_tstamp; +} +EXPORT_SYMBOL_GPL(timecounter_init); + +/** + * timecounter_read_delta - get nanoseconds since last call of this function + * @tc: Pointer to time counter + * + * When the underlying cycle counter runs over, this will be handled + * correctly as long as it does not run over more than once between + * calls. + * + * The first call to this function for a new time counter initializes + * the time tracking and returns an undefined result. + */ +static u64 timecounter_read_delta(struct timecounter *tc) +{ + cycle_t cycle_now, cycle_delta; + u64 ns_offset; + + /* read cycle counter: */ + cycle_now = tc->cc->read(tc->cc); + + /* calculate the delta since the last timecounter_read_delta(): */ + cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask; + + /* convert to nanoseconds: */ + ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta); + + /* update time stamp of timecounter_read_delta() call: */ + tc->cycle_last = cycle_now; + + return ns_offset; +} + +u64 timecounter_read(struct timecounter *tc) +{ + u64 nsec; + + /* increment time by nanoseconds since last call */ + nsec = timecounter_read_delta(tc); + nsec += tc->nsec; + tc->nsec = nsec; + + return nsec; +} +EXPORT_SYMBOL_GPL(timecounter_read); + +u64 timecounter_cyc2time(struct timecounter *tc, + cycle_t cycle_tstamp) +{ + u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask; + u64 nsec; + + /* + * Instead of always treating cycle_tstamp as more recent + * than tc->cycle_last, detect when it is too far in the + * future and treat it as old time stamp instead. + */ + if (cycle_delta > tc->cc->mask / 2) { + cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask; + nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta); + } else { + nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec; + } + + return nsec; +} +EXPORT_SYMBOL_GPL(timecounter_cyc2time); diff --git a/sound/pci/hda/hda_priv.h b/sound/pci/hda/hda_priv.h index 166e3e84b963..daf458299753 100644 --- a/sound/pci/hda/hda_priv.h +++ b/sound/pci/hda/hda_priv.h @@ -15,7 +15,7 @@ #ifndef __SOUND_HDA_PRIV_H #define __SOUND_HDA_PRIV_H -#include +#include #include #include -- cgit v1.2.3 From 2eebdde6528a722fbf8e2cffcf7aa52cbb4c2de0 Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Sun, 21 Dec 2014 19:47:06 +0100 Subject: timecounter: keep track of accumulated fractional nanoseconds MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The current timecounter implementation will drop a variable amount of resolution, depending on the magnitude of the time delta. In other words, reading the clock too often or too close to a time stamp conversion will introduce errors into the time values. This patch fixes the issue by introducing a fractional nanosecond field that accumulates the low order bits. Reported-by: Janusz Użycki Signed-off-by: Richard Cochran Signed-off-by: David S. Miller --- drivers/net/ethernet/mellanox/mlx4/en_clock.c | 4 ++-- include/linux/timecounter.h | 19 ++++++++++------ kernel/time/timecounter.c | 31 +++++++++++++++++++++------ virt/kvm/arm/arch_timer.c | 3 ++- 4 files changed, 40 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c index df35d0e1b899..e9cce4f72b24 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c @@ -240,7 +240,7 @@ void mlx4_en_init_timestamp(struct mlx4_en_dev *mdev) { struct mlx4_dev *dev = mdev->dev; unsigned long flags; - u64 ns; + u64 ns, zero = 0; rwlock_init(&mdev->clock_lock); @@ -265,7 +265,7 @@ void mlx4_en_init_timestamp(struct mlx4_en_dev *mdev) /* Calculate period in seconds to call the overflow watchdog - to make * sure counter is checked at least once every wrap around. */ - ns = cyclecounter_cyc2ns(&mdev->cycles, mdev->cycles.mask); + ns = cyclecounter_cyc2ns(&mdev->cycles, mdev->cycles.mask, zero, &zero); do_div(ns, NSEC_PER_SEC / 2 / HZ); mdev->overflow_period = ns; diff --git a/include/linux/timecounter.h b/include/linux/timecounter.h index af3dfa4e90f0..74f45496e6d1 100644 --- a/include/linux/timecounter.h +++ b/include/linux/timecounter.h @@ -55,27 +55,32 @@ struct cyclecounter { * @cycle_last: most recent cycle counter value seen by * timecounter_read() * @nsec: continuously increasing count + * @mask: bit mask for maintaining the 'frac' field + * @frac: accumulated fractional nanoseconds */ struct timecounter { const struct cyclecounter *cc; cycle_t cycle_last; u64 nsec; + u64 mask; + u64 frac; }; /** * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds * @cc: Pointer to cycle counter. * @cycles: Cycles - * - * XXX - This could use some mult_lxl_ll() asm optimization. Same code - * as in cyc2ns, but with unsigned result. + * @mask: bit mask for maintaining the 'frac' field + * @frac: pointer to storage for the fractional nanoseconds. */ static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, - cycle_t cycles) + cycle_t cycles, u64 mask, u64 *frac) { - u64 ret = (u64)cycles; - ret = (ret * cc->mult) >> cc->shift; - return ret; + u64 ns = (u64) cycles; + + ns = (ns * cc->mult) + *frac; + *frac = ns & mask; + return ns >> cc->shift; } /** diff --git a/kernel/time/timecounter.c b/kernel/time/timecounter.c index 59a1ec3a57cb..4687b3104bae 100644 --- a/kernel/time/timecounter.c +++ b/kernel/time/timecounter.c @@ -25,6 +25,8 @@ void timecounter_init(struct timecounter *tc, tc->cc = cc; tc->cycle_last = cc->read(cc); tc->nsec = start_tstamp; + tc->mask = (1ULL << cc->shift) - 1; + tc->frac = 0; } EXPORT_SYMBOL_GPL(timecounter_init); @@ -51,7 +53,8 @@ static u64 timecounter_read_delta(struct timecounter *tc) cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask; /* convert to nanoseconds: */ - ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta); + ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta, + tc->mask, &tc->frac); /* update time stamp of timecounter_read_delta() call: */ tc->cycle_last = cycle_now; @@ -72,22 +75,36 @@ u64 timecounter_read(struct timecounter *tc) } EXPORT_SYMBOL_GPL(timecounter_read); +/* + * This is like cyclecounter_cyc2ns(), but it is used for computing a + * time previous to the time stored in the cycle counter. + */ +static u64 cc_cyc2ns_backwards(const struct cyclecounter *cc, + cycle_t cycles, u64 mask, u64 frac) +{ + u64 ns = (u64) cycles; + + ns = ((ns * cc->mult) - frac) >> cc->shift; + + return ns; +} + u64 timecounter_cyc2time(struct timecounter *tc, cycle_t cycle_tstamp) { - u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask; - u64 nsec; + u64 delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask; + u64 nsec = tc->nsec, frac = tc->frac; /* * Instead of always treating cycle_tstamp as more recent * than tc->cycle_last, detect when it is too far in the * future and treat it as old time stamp instead. */ - if (cycle_delta > tc->cc->mask / 2) { - cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask; - nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta); + if (delta > tc->cc->mask / 2) { + delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask; + nsec -= cc_cyc2ns_backwards(tc->cc, delta, tc->mask, frac); } else { - nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec; + nsec += cyclecounter_cyc2ns(tc->cc, delta, tc->mask, &frac); } return nsec; diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c index 1c0772b340d8..6e54f3542126 100644 --- a/virt/kvm/arm/arch_timer.c +++ b/virt/kvm/arm/arch_timer.c @@ -152,7 +152,8 @@ void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu) return; } - ns = cyclecounter_cyc2ns(timecounter->cc, cval - now); + ns = cyclecounter_cyc2ns(timecounter->cc, cval - now, timecounter->mask, + &timecounter->frac); timer_arm(timer, ns); } -- cgit v1.2.3 From 113948d841e8d78039e5dbbb5248f5b73e99eafa Mon Sep 17 00:00:00 2001 From: Thomas Graf Date: Fri, 2 Jan 2015 23:00:19 +0100 Subject: spinlock: Add spin_lock_bh_nested() Signed-off-by: Thomas Graf Signed-off-by: David S. Miller --- include/linux/spinlock.h | 8 ++++++++ include/linux/spinlock_api_smp.h | 2 ++ include/linux/spinlock_api_up.h | 1 + kernel/locking/spinlock.c | 8 ++++++++ 4 files changed, 19 insertions(+) (limited to 'kernel') diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 262ba4ef9a8e..3e18379dfa6f 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -190,6 +190,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC # define raw_spin_lock_nested(lock, subclass) \ _raw_spin_lock_nested(lock, subclass) +# define raw_spin_lock_bh_nested(lock, subclass) \ + _raw_spin_lock_bh_nested(lock, subclass) # define raw_spin_lock_nest_lock(lock, nest_lock) \ do { \ @@ -205,6 +207,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) # define raw_spin_lock_nested(lock, subclass) \ _raw_spin_lock(((void)(subclass), (lock))) # define raw_spin_lock_nest_lock(lock, nest_lock) _raw_spin_lock(lock) +# define raw_spin_lock_bh_nested(lock, subclass) _raw_spin_lock_bh(lock) #endif #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) @@ -324,6 +327,11 @@ do { \ raw_spin_lock_nested(spinlock_check(lock), subclass); \ } while (0) +#define spin_lock_bh_nested(lock, subclass) \ +do { \ + raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\ +} while (0) + #define spin_lock_nest_lock(lock, nest_lock) \ do { \ raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock); \ diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h index 42dfab89e740..5344268e6e62 100644 --- a/include/linux/spinlock_api_smp.h +++ b/include/linux/spinlock_api_smp.h @@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr); void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) __acquires(lock); void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass) __acquires(lock); +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass) + __acquires(lock); void __lockfunc _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map) __acquires(lock); diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h index d0d188861ad6..d3afef9d8dbe 100644 --- a/include/linux/spinlock_api_up.h +++ b/include/linux/spinlock_api_up.h @@ -57,6 +57,7 @@ #define _raw_spin_lock(lock) __LOCK(lock) #define _raw_spin_lock_nested(lock, subclass) __LOCK(lock) +#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock) #define _raw_read_lock(lock) __LOCK(lock) #define _raw_write_lock(lock) __LOCK(lock) #define _raw_spin_lock_bh(lock) __LOCK_BH(lock) diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 4b082b5cac9e..db3ccb1dd614 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass) } EXPORT_SYMBOL(_raw_spin_lock_nested); +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass) +{ + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET); + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); +} +EXPORT_SYMBOL(_raw_spin_lock_bh_nested); + unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) { -- cgit v1.2.3 From 83ac237a950e130c974d0170cce30891dcd8f250 Mon Sep 17 00:00:00 2001 From: Christoph Jaeger Date: Mon, 5 Jan 2015 22:34:09 -0500 Subject: livepatch: kconfig: use bool instead of boolean Keyword 'boolean' for type definition attributes is considered deprecated and should not be used anymore. No functional changes. Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com Reference: http://lkml.kernel.org/r/1419108071-11607-1-git-send-email-cj@linux.com Signed-off-by: Christoph Jaeger Reviewed-by: Petr Mladek Reviewed-by: Jingoo Han Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- kernel/livepatch/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig index 96da00fbc120..66797aa597c0 100644 --- a/kernel/livepatch/Kconfig +++ b/kernel/livepatch/Kconfig @@ -1,10 +1,10 @@ config ARCH_HAVE_LIVE_PATCHING - boolean + bool help Arch supports kernel live patching config LIVE_PATCHING - boolean "Kernel Live Patching" + bool "Kernel Live Patching" depends on DYNAMIC_FTRACE_WITH_REGS depends on MODULES depends on SYSFS -- cgit v1.2.3 From b9dfe0bed999d23ee8838d389637dd8aef83fafa Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Fri, 9 Jan 2015 10:53:21 +0100 Subject: livepatch: handle ancient compilers with more grace We are aborting a build in case when gcc doesn't support fentry on x86_64 (regs->ip modification can't really reliably work with mcount). This however breaks allmodconfig for people with older gccs that don't support -mfentry. Turn the build-time failure into runtime failure, resulting in the whole infrastructure not being initialized if CC_USING_FENTRY is unset. Reported-by: Andrew Morton Signed-off-by: Jiri Kosina Signed-off-by: Andrew Morton Acked-by: Josh Poimboeuf --- arch/x86/include/asm/livepatch.h | 6 +++++- kernel/livepatch/core.c | 6 ++++++ 2 files changed, 11 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h index b5608d7757fd..26e58134c8cb 100644 --- a/arch/x86/include/asm/livepatch.h +++ b/arch/x86/include/asm/livepatch.h @@ -25,9 +25,13 @@ #include #ifdef CONFIG_LIVE_PATCHING +static inline int klp_check_compiler_support(void) +{ #ifndef CC_USING_FENTRY -#error Your compiler must support -mfentry for live patching to work + return 1; #endif + return 0; +} extern int klp_write_module_reloc(struct module *mod, unsigned long type, unsigned long loc, unsigned long value); diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 6f6387912da7..ce42d3b930dc 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -911,6 +911,12 @@ static int klp_init(void) { int ret; + ret = klp_check_compiler_support(); + if (ret) { + pr_info("Your compiler is too old; turning off.\n"); + return -EINVAL; + } + ret = register_module_notifier(&klp_module_nb); if (ret) return ret; -- cgit v1.2.3 From 665d92e38f65d70796aad2b8e49e42e80815d4a4 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 25 Dec 2014 14:31:24 +0900 Subject: kbuild: do not add $(call ...) to invoke cc-version or cc-fullversion The macros cc-version, cc-fullversion and ld-version take no argument. It is not necessary to add $(call ...) to invoke them. Signed-off-by: Masahiro Yamada Acked-by: Helge Deller [parisc] Signed-off-by: Michal Marek --- Documentation/kbuild/makefiles.txt | 4 ++-- arch/parisc/Makefile | 2 +- arch/powerpc/Makefile | 6 +++--- arch/x86/Makefile.um | 2 +- kernel/gcov/Makefile | 2 +- scripts/Kbuild.include | 7 ++----- 6 files changed, 10 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index a311db829e9b..7b3487a67476 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -524,7 +524,7 @@ more details, with real examples. Example: #arch/x86/Makefile cflags-y += $(shell \ - if [ $(call cc-version) -ge 0300 ] ; then \ + if [ $(cc-version) -ge 0300 ] ; then \ echo "-mregparm=3"; fi ;) In the above example, -mregparm=3 is only used for gcc version greater @@ -552,7 +552,7 @@ more details, with real examples. Example: #arch/powerpc/Makefile - $(Q)if test "$(call cc-fullversion)" = "040200" ; then \ + $(Q)if test "$(cc-fullversion)" = "040200" ; then \ echo -n '*** GCC-4.2.0 cannot compile the 64-bit powerpc ' ; \ false ; \ fi diff --git a/arch/parisc/Makefile b/arch/parisc/Makefile index 5db8882f732c..fc1aca379fe2 100644 --- a/arch/parisc/Makefile +++ b/arch/parisc/Makefile @@ -149,7 +149,7 @@ endef # we require gcc 3.3 or above to compile the kernel archprepare: checkbin checkbin: - @if test "$(call cc-version)" -lt "0303"; then \ + @if test "$(cc-version)" -lt "0303"; then \ echo -n "Sorry, GCC v3.3 or above is required to build " ; \ echo "the kernel." ; \ false ; \ diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 132d9c681d6a..fc502e042438 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -314,7 +314,7 @@ TOUT := .tmp_gas_check # - Require gcc 4.0 or above on 64-bit # - gcc-4.2.0 has issues compiling modules on 64-bit checkbin: - @if test "$(call cc-version)" = "0304" ; then \ + @if test "$(cc-version)" = "0304" ; then \ if ! /bin/echo mftb 5 | $(AS) -v -mppc -many -o $(TOUT) >/dev/null 2>&1 ; then \ echo -n '*** ${VERSION}.${PATCHLEVEL} kernels no longer build '; \ echo 'correctly with gcc-3.4 and your version of binutils.'; \ @@ -322,13 +322,13 @@ checkbin: false; \ fi ; \ fi - @if test "$(call cc-version)" -lt "0400" \ + @if test "$(cc-version)" -lt "0400" \ && test "x${CONFIG_PPC64}" = "xy" ; then \ echo -n "Sorry, GCC v4.0 or above is required to build " ; \ echo "the 64-bit powerpc kernel." ; \ false ; \ fi - @if test "$(call cc-fullversion)" = "040200" \ + @if test "$(cc-fullversion)" = "040200" \ && test "x${CONFIG_MODULES}${CONFIG_PPC64}" = "xyy" ; then \ echo -n '*** GCC-4.2.0 cannot compile the 64-bit powerpc ' ; \ echo 'kernel with modules enabled.' ; \ diff --git a/arch/x86/Makefile.um b/arch/x86/Makefile.um index 36b62bc52638..95eba554baf9 100644 --- a/arch/x86/Makefile.um +++ b/arch/x86/Makefile.um @@ -30,7 +30,7 @@ cflags-y += -ffreestanding # Disable unit-at-a-time mode on pre-gcc-4.0 compilers, it makes gcc use # a lot more stack due to the lack of sharing of stacklots. Also, gcc # 4.3.0 needs -funit-at-a-time for extern inline functions. -KBUILD_CFLAGS += $(shell if [ $(call cc-version) -lt 0400 ] ; then \ +KBUILD_CFLAGS += $(shell if [ $(cc-version) -lt 0400 ] ; then \ echo $(call cc-option,-fno-unit-at-a-time); \ else echo $(call cc-option,-funit-at-a-time); fi ;) diff --git a/kernel/gcov/Makefile b/kernel/gcov/Makefile index 52aa7e8de927..6f01fa3d9ca1 100644 --- a/kernel/gcov/Makefile +++ b/kernel/gcov/Makefile @@ -21,7 +21,7 @@ else # is not available. We can probably move if-lt to Kbuild.include, so it's also # not defined during clean or to include Kbuild.include in # scripts/Makefile.clean. But the following workaround seems least invasive. - cc-ver := $(if $(call cc-version),$(call cc-version),0) + cc-ver := $(if $(cc-version),$(cc-version),0) endif obj-$(CONFIG_GCOV_KERNEL) := base.o fs.o diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index 34a87fc77f71..ddf0ebdc2ca8 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -129,17 +129,15 @@ cc-disable-warning = $(call try-run,\ $(CC) $(KBUILD_CPPFLAGS) $(KBUILD_CFLAGS) -W$(strip $(1)) -c -x c /dev/null -o "$$TMP",-Wno-$(strip $(1))) # cc-version -# Usage gcc-ver := $(call cc-version) cc-version = $(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-version.sh $(CC)) # cc-fullversion -# Usage gcc-ver := $(call cc-fullversion) cc-fullversion = $(shell $(CONFIG_SHELL) \ $(srctree)/scripts/gcc-version.sh -p $(CC)) # cc-ifversion # Usage: EXTRA_CFLAGS += $(call cc-ifversion, -lt, 0402, -O1) -cc-ifversion = $(shell [ $(call cc-version) $(1) $(2) ] && echo $(3)) +cc-ifversion = $(shell [ $(cc-version) $(1) $(2) ] && echo $(3)) # cc-ldoption # Usage: ldflags += $(call cc-ldoption, -Wl$(comma)--hash-style=both) @@ -157,13 +155,12 @@ ld-option = $(call try-run,\ ar-option = $(call try-run, $(AR) rc$(1) "$$TMP",$(1),$(2)) # ld-version -# Usage: $(call ld-version) # Note this is mainly for HJ Lu's 3 number binutil versions ld-version = $(shell $(LD) --version | $(srctree)/scripts/ld-version.sh) # ld-ifversion # Usage: $(call ld-ifversion, -ge, 22252, y) -ld-ifversion = $(shell [ $(call ld-version) $(1) $(2) ] && echo $(3)) +ld-ifversion = $(shell [ $(ld-version) $(1) $(2) ] && echo $(3)) ###### -- cgit v1.2.3 From 842857dedc40df7c86e990b7153a069d0040e552 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 25 Dec 2014 14:31:25 +0900 Subject: kbuild,gcov: remove unnecessary workaround Since commit 371fdc77af44 (kbuild: collect shorthands into scripts/Kbuild.include), scripts/Makefile.clean includes scripts/Kbuild.include. The workaround and the comment block in kernel/gcov/Makefile are no longer necessary. Signed-off-by: Masahiro Yamada Cc: Peter Oberparleiter Signed-off-by: Michal Marek --- kernel/gcov/Makefile | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/gcov/Makefile b/kernel/gcov/Makefile index 6f01fa3d9ca1..42323c7ec106 100644 --- a/kernel/gcov/Makefile +++ b/kernel/gcov/Makefile @@ -10,18 +10,7 @@ ifeq ($(CONFIG_GCOV_FORMAT_3_4),y) else ifeq ($(CONFIG_GCOV_FORMAT_4_7),y) cc-ver := 0407 else -# Use cc-version if available, otherwise set 0 -# -# scripts/Kbuild.include, which contains cc-version function, is not included -# during make clean "make -f scripts/Makefile.clean obj=kernel/gcov" -# Meaning cc-ver is empty causing if-lt test to fail with -# "/bin/sh: line 0: [: -lt: unary operator expected" error mesage. -# This has no affect on the clean phase, but the error message could be -# confusing/annoying. So this dummy workaround sets cc-ver to zero if cc-version -# is not available. We can probably move if-lt to Kbuild.include, so it's also -# not defined during clean or to include Kbuild.include in -# scripts/Makefile.clean. But the following workaround seems least invasive. - cc-ver := $(if $(cc-version),$(cc-version),0) + cc-ver := $(cc-version) endif obj-$(CONFIG_GCOV_KERNEL) := base.o fs.o -- cgit v1.2.3 From 3df8094727bd1972eb8e231e56ecdbd6e8fb2210 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 25 Dec 2014 14:31:26 +0900 Subject: kbuild,gcov: simplify kernel/gcov/Makefile Kbuild descends into kernel/gcov/ directory only when CONFIG_GCOV_KERNEL is enabled. (See kernel/Makefile) CONFIG_GCOV_KERNEL check can be omitted in kernel/gcov/Makefile. Signed-off-by: Masahiro Yamada Cc: Peter Oberparleiter Signed-off-by: Michal Marek --- kernel/gcov/Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/gcov/Makefile b/kernel/gcov/Makefile index 42323c7ec106..69b3515c1806 100644 --- a/kernel/gcov/Makefile +++ b/kernel/gcov/Makefile @@ -13,10 +13,10 @@ else cc-ver := $(cc-version) endif -obj-$(CONFIG_GCOV_KERNEL) := base.o fs.o +obj-y := base.o fs.o ifeq ($(call if-lt, $(cc-ver), 0407),1) - obj-$(CONFIG_GCOV_KERNEL) += gcc_3_4.o + obj-y += gcc_3_4.o else - obj-$(CONFIG_GCOV_KERNEL) += gcc_4_7.o + obj-y += gcc_4_7.o endif -- cgit v1.2.3 From a75f8b8dab0f73459fa47a1daa10c84c4e8400a8 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Thu, 25 Dec 2014 14:31:28 +0900 Subject: kbuild,gcov: simplify kernel/gcov/Makefile more CONFIG_GCOV_FORMAT_3_4 / _4_7 / _AUTODETECT are exclusive. Compare the CC version only when _AUTODETECT is enabled. This change should have no impact. Signed-off-by: Masahiro Yamada Cc: Peter Oberparleiter Signed-off-by: Michal Marek --- kernel/gcov/Makefile | 23 ++++------------------- 1 file changed, 4 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/gcov/Makefile b/kernel/gcov/Makefile index 69b3515c1806..752d6486b67e 100644 --- a/kernel/gcov/Makefile +++ b/kernel/gcov/Makefile @@ -1,22 +1,7 @@ ccflags-y := -DSRCTREE='"$(srctree)"' -DOBJTREE='"$(objtree)"' -# if-lt -# Usage VAR := $(call if-lt, $(a), $(b)) -# Returns 1 if (a < b) -if-lt = $(shell [ $(1) -lt $(2) ] && echo 1) - -ifeq ($(CONFIG_GCOV_FORMAT_3_4),y) - cc-ver := 0304 -else ifeq ($(CONFIG_GCOV_FORMAT_4_7),y) - cc-ver := 0407 -else - cc-ver := $(cc-version) -endif - obj-y := base.o fs.o - -ifeq ($(call if-lt, $(cc-ver), 0407),1) - obj-y += gcc_3_4.o -else - obj-y += gcc_4_7.o -endif +obj-$(CONFIG_GCOV_FORMAT_3_4) += gcc_3_4.o +obj-$(CONFIG_GCOV_FORMAT_4_7) += gcc_4_7.o +obj-$(CONFIG_GCOV_FORMAT_AUTODETECT) += $(call cc-ifversion, -lt, 0407, \ + gcc_3_4.o, gcc_4_7.o) -- cgit v1.2.3 From 99590ba565a22c9f58f7528a94881d0455eef018 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Fri, 9 Jan 2015 14:03:04 -0600 Subject: livepatch: fix deferred module patching order When applying multiple patches to a module, if the module is loaded after the patches are loaded, the patches are applied in reverse order: $ insmod patch1.ko [ 43.172992] livepatch: enabling patch 'patch1' $ insmod patch2.ko [ 46.571563] livepatch: enabling patch 'patch2' $ modprobe nfsd [ 52.888922] livepatch: applying patch 'patch2' to loading module 'nfsd' [ 52.899847] livepatch: applying patch 'patch1' to loading module 'nfsd' Fix the loading order by storing the klp_patches list in queue order. Signed-off-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index ce42d3b930dc..3d9c00b5223a 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -739,7 +739,7 @@ static int klp_init_patch(struct klp_patch *patch) goto free; } - list_add(&patch->list, &klp_patches); + list_add_tail(&patch->list, &klp_patches); mutex_unlock(&klp_mutex); -- cgit v1.2.3 From cbf6ab52add20b845f903decc973afbd5463c527 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Mon, 5 Jan 2015 19:29:32 +0800 Subject: kprobes: Pass the original kprobe for preparing optimized kprobe Pass the original kprobe for preparing an optimized kprobe arch-dep part, since for some architecture (e.g. ARM32) requires the information in original kprobe. Signed-off-by: Masami Hiramatsu Signed-off-by: Wang Nan Signed-off-by: Jon Medhurst --- arch/x86/kernel/kprobes/opt.c | 3 ++- include/linux/kprobes.h | 3 ++- kernel/kprobes.c | 4 ++-- 3 files changed, 6 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 7c523bbf3dc8..0dd8d089c315 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -322,7 +322,8 @@ void arch_remove_optimized_kprobe(struct optimized_kprobe *op) * Target instructions MUST be relocatable (checked inside) * This is called when new aggr(opt)probe is allocated or reused. */ -int arch_prepare_optimized_kprobe(struct optimized_kprobe *op) +int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, + struct kprobe *__unused) { u8 *buf; int ret; diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index 5297f9fa0ef2..1ab54754a86d 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -308,7 +308,8 @@ struct optimized_kprobe { /* Architecture dependent functions for direct jump optimization */ extern int arch_prepared_optinsn(struct arch_optimized_insn *optinsn); extern int arch_check_optimized_kprobe(struct optimized_kprobe *op); -extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op); +extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, + struct kprobe *orig); extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op); extern void arch_optimize_kprobes(struct list_head *oplist); extern void arch_unoptimize_kprobes(struct list_head *oplist, diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 06f58309fed2..bad4e959f2f7 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -717,7 +717,7 @@ static void prepare_optimized_kprobe(struct kprobe *p) struct optimized_kprobe *op; op = container_of(p, struct optimized_kprobe, kp); - arch_prepare_optimized_kprobe(op); + arch_prepare_optimized_kprobe(op, p); } /* Allocate new optimized_kprobe and try to prepare optimized instructions */ @@ -731,7 +731,7 @@ static struct kprobe *alloc_aggr_kprobe(struct kprobe *p) INIT_LIST_HEAD(&op->list); op->kp.addr = p->addr; - arch_prepare_optimized_kprobe(op); + arch_prepare_optimized_kprobe(op, p); return &op->kp; } -- cgit v1.2.3 From 053c095a82cf773075e83d7233b5cc19a1f73ece Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Fri, 16 Jan 2015 22:09:00 +0100 Subject: netlink: make nlmsg_end() and genlmsg_end() void Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg Signed-off-by: David S. Miller --- drivers/acpi/event.c | 7 +------ drivers/net/ethernet/rocker/rocker.c | 3 ++- drivers/net/vxlan.c | 3 ++- drivers/net/wireless/mac80211_hwsim.c | 3 ++- drivers/scsi/pmcraid.c | 8 +------- drivers/target/target_core_user.c | 4 +--- drivers/thermal/thermal_core.c | 6 +----- fs/dlm/netlink.c | 7 +------ include/net/genetlink.h | 4 ++-- include/net/netlink.h | 6 +----- kernel/taskstats.c | 13 ++----------- net/bridge/br_fdb.c | 3 ++- net/bridge/br_mdb.c | 3 ++- net/bridge/br_netlink.c | 3 ++- net/can/gw.c | 3 ++- net/core/fib_rules.c | 3 ++- net/core/neighbour.c | 12 ++++++++---- net/core/rtnetlink.c | 9 ++++++--- net/decnet/dn_dev.c | 3 ++- net/decnet/dn_route.c | 3 ++- net/decnet/dn_table.c | 3 ++- net/ieee802154/netlink.c | 12 ++---------- net/ieee802154/nl-mac.c | 3 ++- net/ieee802154/nl-phy.c | 3 ++- net/ieee802154/nl802154.c | 6 ++++-- net/ipv4/devinet.c | 8 +++++--- net/ipv4/fib_semantics.c | 3 ++- net/ipv4/inet_diag.c | 9 ++++++--- net/ipv4/ipmr.c | 3 ++- net/ipv4/route.c | 3 ++- net/ipv4/tcp_metrics.c | 3 ++- net/ipv6/addrconf.c | 32 +++++++++++++++++++------------- net/ipv6/addrlabel.c | 5 +++-- net/ipv6/ip6_fib.c | 1 - net/ipv6/ip6mr.c | 3 ++- net/ipv6/route.c | 3 ++- net/l2tp/l2tp_netlink.c | 10 ++++++---- net/netfilter/ipvs/ip_vs_ctl.c | 9 ++++++--- net/netfilter/nf_tables_api.c | 18 ++++++++++++------ net/netlabel/netlabel_cipso_v4.c | 3 ++- net/netlabel/netlabel_mgmt.c | 6 ++++-- net/netlabel/netlabel_unlabeled.c | 3 ++- net/netlink/diag.c | 3 ++- net/netlink/genetlink.c | 6 ++++-- net/nfc/netlink.c | 12 +++++++----- net/openvswitch/datapath.c | 9 ++++++--- net/packet/diag.c | 3 ++- net/phonet/pn_netlink.c | 16 +++++++++++----- net/unix/diag.c | 3 ++- net/wireless/nl80211.c | 27 ++++++++++++++++++--------- net/xfrm/xfrm_user.c | 27 ++++++++++++++++++--------- 51 files changed, 203 insertions(+), 158 deletions(-) (limited to 'kernel') diff --git a/drivers/acpi/event.c b/drivers/acpi/event.c index ef2d730734dc..e24ea4e796e4 100644 --- a/drivers/acpi/event.c +++ b/drivers/acpi/event.c @@ -100,7 +100,6 @@ int acpi_bus_generate_netlink_event(const char *device_class, struct acpi_genl_event *event; void *msg_header; int size; - int result; /* allocate memory */ size = nla_total_size(sizeof(struct acpi_genl_event)) + @@ -137,11 +136,7 @@ int acpi_bus_generate_netlink_event(const char *device_class, event->data = data; /* send multicast genetlink message */ - result = genlmsg_end(skb, msg_header); - if (result < 0) { - nlmsg_free(skb); - return result; - } + genlmsg_end(skb, msg_header); genlmsg_multicast(&acpi_event_genl_family, skb, 0, 0, GFP_ATOMIC); return 0; diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c index 964d719b150f..d54781e71cd4 100644 --- a/drivers/net/ethernet/rocker/rocker.c +++ b/drivers/net/ethernet/rocker/rocker.c @@ -3674,7 +3674,8 @@ static int rocker_fdb_fill_info(struct sk_buff *skb, if (vid && nla_put_u16(skb, NDA_VLAN, vid)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 6b6b45622a0a..c5f79e7513a6 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -363,7 +363,8 @@ static int vxlan_fdb_info(struct sk_buff *skb, struct vxlan_dev *vxlan, if (nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c index 494e7335aa64..4a4c6586a8d2 100644 --- a/drivers/net/wireless/mac80211_hwsim.c +++ b/drivers/net/wireless/mac80211_hwsim.c @@ -2557,7 +2557,8 @@ static int mac80211_hwsim_get_radio(struct sk_buff *skb, if (res < 0) goto out_err; - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; out_err: genlmsg_cancel(skb, hdr); diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c index 8c27b6a77ec4..cf222f46eac5 100644 --- a/drivers/scsi/pmcraid.c +++ b/drivers/scsi/pmcraid.c @@ -1473,13 +1473,7 @@ static int pmcraid_notify_aen( } /* send genetlink multicast message to notify appplications */ - result = genlmsg_end(skb, msg_header); - - if (result < 0) { - pmcraid_err("genlmsg_end failed\n"); - nlmsg_free(skb); - return result; - } + genlmsg_end(skb, msg_header); result = genlmsg_multicast(&pmcraid_event_family, skb, 0, 0, GFP_ATOMIC); diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index 1157b559683b..1a1bcf71ec9d 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -784,9 +784,7 @@ static int tcmu_netlink_event(enum tcmu_genl_cmd cmd, const char *name, int mino if (ret < 0) goto free_skb; - ret = genlmsg_end(skb, msg_header); - if (ret < 0) - goto free_skb; + genlmsg_end(skb, msg_header); ret = genlmsg_multicast(&tcmu_genl_family, skb, 0, TCMU_MCGRP_CONFIG, GFP_KERNEL); diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 87e0b0782023..48491d1a81d6 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -1759,11 +1759,7 @@ int thermal_generate_netlink_event(struct thermal_zone_device *tz, thermal_event->event = event; /* send multicast genetlink message */ - result = genlmsg_end(skb, msg_header); - if (result < 0) { - nlmsg_free(skb); - return result; - } + genlmsg_end(skb, msg_header); result = genlmsg_multicast(&thermal_event_genl_family, skb, 0, 0, GFP_ATOMIC); diff --git a/fs/dlm/netlink.c b/fs/dlm/netlink.c index e7cfbaf8d0e2..1e6e227134d7 100644 --- a/fs/dlm/netlink.c +++ b/fs/dlm/netlink.c @@ -56,13 +56,8 @@ static int send_data(struct sk_buff *skb) { struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); void *data = genlmsg_data(genlhdr); - int rv; - rv = genlmsg_end(skb, data); - if (rv < 0) { - nlmsg_free(skb); - return rv; - } + genlmsg_end(skb, data); return genlmsg_unicast(&init_net, skb, listener_nlportid); } diff --git a/include/net/genetlink.h b/include/net/genetlink.h index 84125088c309..f24aa83b80b6 100644 --- a/include/net/genetlink.h +++ b/include/net/genetlink.h @@ -245,9 +245,9 @@ static inline void *genlmsg_put_reply(struct sk_buff *skb, * @skb: socket buffer the message is stored in * @hdr: user specific header */ -static inline int genlmsg_end(struct sk_buff *skb, void *hdr) +static inline void genlmsg_end(struct sk_buff *skb, void *hdr) { - return nlmsg_end(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN); + nlmsg_end(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN); } /** diff --git a/include/net/netlink.h b/include/net/netlink.h index d5869b90bfbb..e010ee8da41d 100644 --- a/include/net/netlink.h +++ b/include/net/netlink.h @@ -490,14 +490,10 @@ static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags) * Corrects the netlink message header to include the appeneded * attributes. Only necessary if attributes have been added to * the message. - * - * Returns the total data length of the skb. */ -static inline int nlmsg_end(struct sk_buff *skb, struct nlmsghdr *nlh) +static inline void nlmsg_end(struct sk_buff *skb, struct nlmsghdr *nlh) { nlh->nlmsg_len = skb_tail_pointer(skb) - (unsigned char *)nlh; - - return skb->len; } /** diff --git a/kernel/taskstats.c b/kernel/taskstats.c index 670fff88a961..21f82c29c914 100644 --- a/kernel/taskstats.c +++ b/kernel/taskstats.c @@ -111,13 +111,8 @@ static int send_reply(struct sk_buff *skb, struct genl_info *info) { struct genlmsghdr *genlhdr = nlmsg_data(nlmsg_hdr(skb)); void *reply = genlmsg_data(genlhdr); - int rc; - rc = genlmsg_end(skb, reply); - if (rc < 0) { - nlmsg_free(skb); - return rc; - } + genlmsg_end(skb, reply); return genlmsg_reply(skb, info); } @@ -134,11 +129,7 @@ static void send_cpu_listeners(struct sk_buff *skb, void *reply = genlmsg_data(genlhdr); int rc, delcount = 0; - rc = genlmsg_end(skb, reply); - if (rc < 0) { - nlmsg_free(skb); - return; - } + genlmsg_end(skb, reply); rc = 0; down_read(&listeners->sem); diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index 03667e65cc29..08bf04bdac58 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -633,7 +633,8 @@ static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br, if (fdb->vlan_id && nla_put(skb, NDA_VLAN, sizeof(u16), &fdb->vlan_id)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index fed61c971b17..409608960899 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -190,7 +190,8 @@ static int nlmsg_populate_mdb_fill(struct sk_buff *skb, nla_nest_end(skb, nest2); nla_nest_end(skb, nest); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; end: nla_nest_end(skb, nest); diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 163950b10d8c..528cf2790a5f 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -263,7 +263,8 @@ static int br_fill_ifinfo(struct sk_buff *skb, } done: - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/can/gw.c b/net/can/gw.c index 295f62e62eb3..a6f448e18ea8 100644 --- a/net/can/gw.c +++ b/net/can/gw.c @@ -575,7 +575,8 @@ static int cgw_put_job(struct sk_buff *skb, struct cgw_job *gwj, int type, goto cancel; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; cancel: nlmsg_cancel(skb, nlh); diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 185c341fafbd..44706e81b2e0 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -609,7 +609,8 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule, if (ops->fill(rule, skb, frh) < 0) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 8d614c93f86a..d36d564f149f 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1884,7 +1884,8 @@ static int neightbl_fill_info(struct sk_buff *skb, struct neigh_table *tbl, goto nla_put_failure; read_unlock_bh(&tbl->lock); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: read_unlock_bh(&tbl->lock); @@ -1917,7 +1918,8 @@ static int neightbl_fill_param_info(struct sk_buff *skb, goto errout; read_unlock_bh(&tbl->lock); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; errout: read_unlock_bh(&tbl->lock); nlmsg_cancel(skb, nlh); @@ -2202,7 +2204,8 @@ static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -2232,7 +2235,8 @@ static int pneigh_fill_info(struct sk_buff *skb, struct pneigh_entry *pn, if (nla_put(skb, NDA_DST, tbl->key_len, pn->key)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index eadc5c0e2dfa..e13b9dbdf154 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1199,7 +1199,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, nla_nest_end(skb, af_spec); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -2326,7 +2327,8 @@ static int nlmsg_populate_fdb_fill(struct sk_buff *skb, if (nla_put(skb, NDA_LLADDR, ETH_ALEN, addr)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -2809,7 +2811,8 @@ int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq, nla_nest_end(skb, protinfo); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); return -EMSGSIZE; diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c index 4400da7739da..b2c26b081134 100644 --- a/net/decnet/dn_dev.c +++ b/net/decnet/dn_dev.c @@ -702,7 +702,8 @@ static int dn_nl_fill_ifaddr(struct sk_buff *skb, struct dn_ifaddr *ifa, nla_put_string(skb, IFA_LABEL, ifa->ifa_label)) || nla_put_u32(skb, IFA_FLAGS, ifa_flags)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index daccc4a36d80..812e5e6e88fb 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1616,7 +1616,8 @@ static int dn_rt_fill_info(struct sk_buff *skb, u32 portid, u32 seq, nla_put_u32(skb, RTA_IIF, rt->fld.flowidn_iif) < 0) goto errout; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; errout: nlmsg_cancel(skb, nlh); diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c index 3f19fcbf126d..1540b506e3e0 100644 --- a/net/decnet/dn_table.c +++ b/net/decnet/dn_table.c @@ -367,7 +367,8 @@ static int dn_fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, nla_nest_end(skb, mp_head); } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; errout: nlmsg_cancel(skb, nlh); diff --git a/net/ieee802154/netlink.c b/net/ieee802154/netlink.c index fa1464762d0d..c8133c07ceee 100644 --- a/net/ieee802154/netlink.c +++ b/net/ieee802154/netlink.c @@ -63,13 +63,9 @@ int ieee802154_nl_mcast(struct sk_buff *msg, unsigned int group) struct nlmsghdr *nlh = nlmsg_hdr(msg); void *hdr = genlmsg_data(nlmsg_data(nlh)); - if (genlmsg_end(msg, hdr) < 0) - goto out; + genlmsg_end(msg, hdr); return genlmsg_multicast(&nl802154_family, msg, 0, group, GFP_ATOMIC); -out: - nlmsg_free(msg); - return -ENOBUFS; } struct sk_buff *ieee802154_nl_new_reply(struct genl_info *info, @@ -96,13 +92,9 @@ int ieee802154_nl_reply(struct sk_buff *msg, struct genl_info *info) struct nlmsghdr *nlh = nlmsg_hdr(msg); void *hdr = genlmsg_data(nlmsg_data(nlh)); - if (genlmsg_end(msg, hdr) < 0) - goto out; + genlmsg_end(msg, hdr); return genlmsg_reply(msg, info); -out: - nlmsg_free(msg); - return -ENOBUFS; } static const struct genl_ops ieee8021154_ops[] = { diff --git a/net/ieee802154/nl-mac.c b/net/ieee802154/nl-mac.c index 3c902e9516fb..9105265920fe 100644 --- a/net/ieee802154/nl-mac.c +++ b/net/ieee802154/nl-mac.c @@ -136,7 +136,8 @@ static int ieee802154_nl_fill_iface(struct sk_buff *msg, u32 portid, } wpan_phy_put(phy); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: wpan_phy_put(phy); diff --git a/net/ieee802154/nl-phy.c b/net/ieee802154/nl-phy.c index 7baf98b14611..1b9d25f6e898 100644 --- a/net/ieee802154/nl-phy.c +++ b/net/ieee802154/nl-phy.c @@ -65,7 +65,8 @@ static int ieee802154_nl_fill_phy(struct sk_buff *msg, u32 portid, goto nla_put_failure; mutex_unlock(&phy->pib_lock); kfree(buf); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: mutex_unlock(&phy->pib_lock); diff --git a/net/ieee802154/nl802154.c b/net/ieee802154/nl802154.c index a25b9bbd077b..a4daf91b8d0a 100644 --- a/net/ieee802154/nl802154.c +++ b/net/ieee802154/nl802154.c @@ -306,7 +306,8 @@ static int nl802154_send_wpan_phy(struct cfg802154_registered_device *rdev, goto nla_put_failure; finish: - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -489,7 +490,8 @@ nl802154_send_iface(struct sk_buff *msg, u32 portid, u32 seq, int flags, if (nla_put_u8(msg, NL802154_ATTR_LBT_MODE, wpan_dev->lbt)) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 214882e7d6de..5f344eb3fc25 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -1522,7 +1522,8 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa, preferred, valid)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -1566,7 +1567,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) if (inet_fill_ifaddr(skb, ifa, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, - RTM_NEWADDR, NLM_F_MULTI) <= 0) { + RTM_NEWADDR, NLM_F_MULTI) < 0) { rcu_read_unlock(); goto done; } @@ -1749,7 +1750,8 @@ static int inet_netconf_fill_devconf(struct sk_buff *skb, int ifindex, IPV4_DEVCONF(*devconf, PROXY_ARP)) < 0) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index d2b7b5521b1b..265cb72b7c1b 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1091,7 +1091,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, nla_nest_end(skb, mp); } #endif - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index e34dccbc4d70..81751f12645f 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -203,7 +203,8 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, icsk->icsk_ca_ops->get_info(sk, ext, skb); out: - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; errout: nlmsg_cancel(skb, nlh); @@ -271,7 +272,8 @@ static int inet_twsk_diag_fill(struct inet_timewait_sock *tw, } #endif - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, @@ -758,7 +760,8 @@ static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk, } #endif - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk, diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index c8034587859d..9d78427652d2 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2290,7 +2290,8 @@ static int ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, if (err < 0 && err != -ENOENT) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv4/route.c b/net/ipv4/route.c index ce112d0f2698..f6e43ca5e641 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2390,7 +2390,8 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, if (rtnl_put_cacheinfo(skb, &rt->dst, 0, expires, error) < 0) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index ed9c9a91851c..e5f41bd5ec1b 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -886,7 +886,8 @@ static int tcp_metrics_dump_info(struct sk_buff *skb, if (tcp_metrics_fill_info(skb, tm) < 0) goto nla_put_failure; - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index f7c8bbeb27b7..8975d9501d50 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -489,7 +489,8 @@ static int inet6_netconf_fill_devconf(struct sk_buff *skb, int ifindex, nla_put_s32(skb, NETCONFA_PROXY_NEIGH, devconf->proxy_ndp) < 0) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -619,7 +620,7 @@ static int inet6_netconf_dump_devconf(struct sk_buff *skb, cb->nlh->nlmsg_seq, RTM_NEWNETCONF, NLM_F_MULTI, - -1) <= 0) { + -1) < 0) { rcu_read_unlock(); goto done; } @@ -635,7 +636,7 @@ cont: NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWNETCONF, NLM_F_MULTI, - -1) <= 0) + -1) < 0) goto done; else h++; @@ -646,7 +647,7 @@ cont: NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWNETCONF, NLM_F_MULTI, - -1) <= 0) + -1) < 0) goto done; else h++; @@ -4047,7 +4048,8 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa, if (nla_put_u32(skb, IFA_FLAGS, ifa->flags) < 0) goto error; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; error: nlmsg_cancel(skb, nlh); @@ -4076,7 +4078,8 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca, return -EMSGSIZE; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int inet6_fill_ifacaddr(struct sk_buff *skb, struct ifacaddr6 *ifaca, @@ -4101,7 +4104,8 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, struct ifacaddr6 *ifaca, return -EMSGSIZE; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } enum addr_type_t { @@ -4134,7 +4138,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, cb->nlh->nlmsg_seq, RTM_NEWADDR, NLM_F_MULTI); - if (err <= 0) + if (err < 0) break; nl_dump_check_consistent(cb, nlmsg_hdr(skb)); } @@ -4151,7 +4155,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, cb->nlh->nlmsg_seq, RTM_GETMULTICAST, NLM_F_MULTI); - if (err <= 0) + if (err < 0) break; } break; @@ -4166,7 +4170,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, cb->nlh->nlmsg_seq, RTM_GETANYCAST, NLM_F_MULTI); - if (err <= 0) + if (err < 0) break; } break; @@ -4638,7 +4642,8 @@ static int inet6_fill_ifinfo(struct sk_buff *skb, struct inet6_dev *idev, goto nla_put_failure; nla_nest_end(skb, protoinfo); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -4670,7 +4675,7 @@ static int inet6_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb) if (inet6_fill_ifinfo(skb, idev, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, - RTM_NEWLINK, NLM_F_MULTI) <= 0) + RTM_NEWLINK, NLM_F_MULTI) < 0) goto out; cont: idx++; @@ -4747,7 +4752,8 @@ static int inet6_fill_prefix(struct sk_buff *skb, struct inet6_dev *idev, ci.valid_time = ntohl(pinfo->valid); if (nla_put(skb, PREFIX_CACHEINFO, sizeof(ci), &ci)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c index fd0dc47f471d..e43e79d0a612 100644 --- a/net/ipv6/addrlabel.c +++ b/net/ipv6/addrlabel.c @@ -490,7 +490,8 @@ static int ip6addrlbl_fill(struct sk_buff *skb, return -EMSGSIZE; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int ip6addrlbl_dump(struct sk_buff *skb, struct netlink_callback *cb) @@ -510,7 +511,7 @@ static int ip6addrlbl_dump(struct sk_buff *skb, struct netlink_callback *cb) cb->nlh->nlmsg_seq, RTM_NEWADDRLABEL, NLM_F_MULTI); - if (err <= 0) + if (err < 0) break; } idx++; diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 03c520a4ebeb..53775ee7d376 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -277,7 +277,6 @@ static int fib6_dump_node(struct fib6_walker *w) w->leaf = rt; return 1; } - WARN_ON(res == 0); } w->leaf = NULL; return 0; diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index 722669754bbf..34b682617f50 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -2388,7 +2388,8 @@ static int ip6mr_fill_mroute(struct mr6_table *mrt, struct sk_buff *skb, if (err < 0 && err != -ENOENT) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 34dcbb59df75..c60f15775c53 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2725,7 +2725,8 @@ static int rt6_fill_node(struct net *net, if (rtnl_put_cacheinfo(skb, &rt->dst, 0, expires, rt->dst.error) < 0) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c index 6b16598f31d5..b4e923f77954 100644 --- a/net/l2tp/l2tp_netlink.c +++ b/net/l2tp/l2tp_netlink.c @@ -390,7 +390,8 @@ static int l2tp_nl_tunnel_send(struct sk_buff *skb, u32 portid, u32 seq, int fla } out: - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); @@ -451,7 +452,7 @@ static int l2tp_nl_cmd_tunnel_dump(struct sk_buff *skb, struct netlink_callback if (l2tp_nl_tunnel_send(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI, - tunnel, L2TP_CMD_TUNNEL_GET) <= 0) + tunnel, L2TP_CMD_TUNNEL_GET) < 0) goto out; ti++; @@ -752,7 +753,8 @@ static int l2tp_nl_session_send(struct sk_buff *skb, u32 portid, u32 seq, int fl goto nla_put_failure; nla_nest_end(skb, nest); - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); @@ -816,7 +818,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, struct netlink_callback if (l2tp_nl_session_send(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI, - session, L2TP_CMD_SESSION_GET) <= 0) + session, L2TP_CMD_SESSION_GET) < 0) break; si++; diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index b8295a430a56..e55759056361 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -2887,7 +2887,8 @@ static int ip_vs_genl_dump_service(struct sk_buff *skb, if (ip_vs_genl_fill_service(skb, svc) < 0) goto nla_put_failure; - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); @@ -3079,7 +3080,8 @@ static int ip_vs_genl_dump_dest(struct sk_buff *skb, struct ip_vs_dest *dest, if (ip_vs_genl_fill_dest(skb, dest) < 0) goto nla_put_failure; - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); @@ -3215,7 +3217,8 @@ static int ip_vs_genl_dump_daemon(struct sk_buff *skb, __u32 state, if (ip_vs_genl_fill_daemon(skb, state, mcast_ifn, syncid)) goto nla_put_failure; - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 3b3ddb4fb9ee..70f697827b9b 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -427,7 +427,8 @@ static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net, nla_put_be32(skb, NFTA_TABLE_USE, htonl(table->use))) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); @@ -971,7 +972,8 @@ static int nf_tables_fill_chain_info(struct sk_buff *skb, struct net *net, if (nla_put_be32(skb, NFTA_CHAIN_USE, htonl(chain->use))) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); @@ -1707,7 +1709,8 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net, nla_put(skb, NFTA_RULE_USERDATA, rule->ulen, nft_userdata(rule))) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); @@ -2361,7 +2364,8 @@ static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx, goto nla_put_failure; nla_nest_end(skb, desc); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); @@ -3035,7 +3039,8 @@ static int nf_tables_fill_setelem_info(struct sk_buff *skb, nla_nest_end(skb, nest); - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); @@ -3324,7 +3329,8 @@ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net, if (nla_put_be32(skb, NFTA_GEN_ID, htonl(net->nft.base_seq))) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_trim(skb, nlh); diff --git a/net/netlabel/netlabel_cipso_v4.c b/net/netlabel/netlabel_cipso_v4.c index c2f2a53a4879..179625353cac 100644 --- a/net/netlabel/netlabel_cipso_v4.c +++ b/net/netlabel/netlabel_cipso_v4.c @@ -641,7 +641,8 @@ static int netlbl_cipsov4_listall_cb(struct cipso_v4_doi *doi_def, void *arg) if (ret_val != 0) goto listall_cb_failure; - return genlmsg_end(cb_arg->skb, data); + genlmsg_end(cb_arg->skb, data); + return 0; listall_cb_failure: genlmsg_cancel(cb_arg->skb, data); diff --git a/net/netlabel/netlabel_mgmt.c b/net/netlabel/netlabel_mgmt.c index e66e977ef2fa..8b3b789c43c2 100644 --- a/net/netlabel/netlabel_mgmt.c +++ b/net/netlabel/netlabel_mgmt.c @@ -456,7 +456,8 @@ static int netlbl_mgmt_listall_cb(struct netlbl_dom_map *entry, void *arg) goto listall_cb_failure; cb_arg->seq++; - return genlmsg_end(cb_arg->skb, data); + genlmsg_end(cb_arg->skb, data); + return 0; listall_cb_failure: genlmsg_cancel(cb_arg->skb, data); @@ -620,7 +621,8 @@ static int netlbl_mgmt_protocols_cb(struct sk_buff *skb, if (ret_val != 0) goto protocols_cb_failure; - return genlmsg_end(skb, data); + genlmsg_end(skb, data); + return 0; protocols_cb_failure: genlmsg_cancel(skb, data); diff --git a/net/netlabel/netlabel_unlabeled.c b/net/netlabel/netlabel_unlabeled.c index 78a63c18779e..aec7994f78cf 100644 --- a/net/netlabel/netlabel_unlabeled.c +++ b/net/netlabel/netlabel_unlabeled.c @@ -1163,7 +1163,8 @@ static int netlbl_unlabel_staticlist_gen(u32 cmd, goto list_cb_failure; cb_arg->seq++; - return genlmsg_end(cb_arg->skb, data); + genlmsg_end(cb_arg->skb, data); + return 0; list_cb_failure: genlmsg_cancel(cb_arg->skb, data); diff --git a/net/netlink/diag.c b/net/netlink/diag.c index bb59a7ed0859..3ee63a3cff30 100644 --- a/net/netlink/diag.c +++ b/net/netlink/diag.c @@ -91,7 +91,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, sk_diag_put_rings_cfg(sk, skb)) goto out_nlmsg_trim; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; out_nlmsg_trim: nlmsg_cancel(skb, nlh); diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 2e11061ef885..f52a7d5734cd 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -756,7 +756,8 @@ static int ctrl_fill_info(struct genl_family *family, u32 portid, u32 seq, nla_nest_end(skb, nla_grps); } - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); @@ -796,7 +797,8 @@ static int ctrl_fill_mcgrp_info(struct genl_family *family, nla_nest_end(skb, nest); nla_nest_end(skb, nla_grps); - return genlmsg_end(skb, hdr); + genlmsg_end(skb, hdr); + return 0; nla_put_failure: genlmsg_cancel(skb, hdr); diff --git a/net/nfc/netlink.c b/net/nfc/netlink.c index 44989fc8cddf..be387e6219a0 100644 --- a/net/nfc/netlink.c +++ b/net/nfc/netlink.c @@ -102,7 +102,8 @@ static int nfc_genl_send_target(struct sk_buff *msg, struct nfc_target *target, goto nla_put_failure; } - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -518,7 +519,8 @@ static int nfc_genl_send_device(struct sk_buff *msg, struct nfc_dev *dev, nla_put_u8(msg, NFC_ATTR_RF_MODE, dev->rf_mode)) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -908,7 +910,8 @@ static int nfc_genl_send_params(struct sk_buff *msg, nla_put_u16(msg, NFC_ATTR_LLC_PARAM_MIUX, be16_to_cpu(local->miux))) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: @@ -1247,8 +1250,7 @@ static int nfc_genl_send_se(struct sk_buff *msg, struct nfc_dev *dev, nla_put_u8(msg, NFC_ATTR_SE_TYPE, se->type)) goto nla_put_failure; - if (genlmsg_end(msg, hdr) < 0) - goto nla_put_failure; + genlmsg_end(msg, hdr); } return 0; diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 8bda3cc12344..f45f1bf4422c 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -799,7 +799,8 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex, if (err) goto error; - return genlmsg_end(skb, ovs_header); + genlmsg_end(skb, ovs_header); + return 0; error: genlmsg_cancel(skb, ovs_header); @@ -1349,7 +1350,8 @@ static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb, if (nla_put_u32(skb, OVS_DP_ATTR_USER_FEATURES, dp->user_features)) goto nla_put_failure; - return genlmsg_end(skb, ovs_header); + genlmsg_end(skb, ovs_header); + return 0; nla_put_failure: genlmsg_cancel(skb, ovs_header); @@ -1723,7 +1725,8 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb, if (err == -EMSGSIZE) goto error; - return genlmsg_end(skb, ovs_header); + genlmsg_end(skb, ovs_header); + return 0; nla_put_failure: err = -EMSGSIZE; diff --git a/net/packet/diag.c b/net/packet/diag.c index 92f2c7107eec..0ed68f0238bf 100644 --- a/net/packet/diag.c +++ b/net/packet/diag.c @@ -177,7 +177,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, PACKET_DIAG_FILTER)) goto out_nlmsg_trim; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; out_nlmsg_trim: nlmsg_cancel(skb, nlh); diff --git a/net/phonet/pn_netlink.c b/net/phonet/pn_netlink.c index b64151ade6b3..54d766842c2b 100644 --- a/net/phonet/pn_netlink.c +++ b/net/phonet/pn_netlink.c @@ -121,7 +121,8 @@ static int fill_addr(struct sk_buff *skb, struct net_device *dev, u8 addr, ifm->ifa_index = dev->ifindex; if (nla_put_u8(skb, IFA_LOCAL, addr)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -190,7 +191,8 @@ static int fill_route(struct sk_buff *skb, struct net_device *dev, u8 dst, if (nla_put_u8(skb, RTA_DST, dst) || nla_put_u32(skb, RTA_OIF, dev->ifindex)) goto nla_put_failure; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; nla_put_failure: nlmsg_cancel(skb, nlh); @@ -282,9 +284,13 @@ static int route_dumpit(struct sk_buff *skb, struct netlink_callback *cb) if (addr_idx++ < addr_start_idx) continue; - if (fill_route(skb, dev, addr << 2, NETLINK_CB(cb->skb).portid, - cb->nlh->nlmsg_seq, RTM_NEWROUTE)) - goto out; + fill_route(skb, dev, addr << 2, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, RTM_NEWROUTE); + /* fill_route() used to return > 0 (or negative errors) but + * never 0 - ignore the return value and just go out to + * call dumpit again from outside to preserve the behavior + */ + goto out; } out: diff --git a/net/unix/diag.c b/net/unix/diag.c index 86fa0f3b2caf..ef542fbca9fe 100644 --- a/net/unix/diag.c +++ b/net/unix/diag.c @@ -155,7 +155,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, struct unix_diag_r if (nla_put_u8(skb, UNIX_DIAG_SHUTDOWN, sk->sk_shutdown)) goto out_nlmsg_trim; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; out_nlmsg_trim: nlmsg_cancel(skb, nlh); diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 380784378df8..4ed9039bd5f9 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -1721,7 +1721,8 @@ static int nl80211_send_wiphy(struct cfg80211_registered_device *rdev, break; } finish: - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -2404,7 +2405,8 @@ static int nl80211_send_iface(struct sk_buff *msg, u32 portid, u32 seq, int flag goto nla_put_failure; } - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -3825,7 +3827,8 @@ static int nl80211_send_station(struct sk_buff *msg, u32 cmd, u32 portid, sinfo->assoc_req_ies)) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -4555,7 +4558,8 @@ static int nl80211_send_mpath(struct sk_buff *msg, u32 portid, u32 seq, nla_nest_end(msg, pinfoattr); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -5507,7 +5511,8 @@ static int nl80211_send_regdom(struct sk_buff *msg, struct netlink_callback *cb, nla_put_flag(msg, NL80211_ATTR_WIPHY_SELF_MANAGED_REG)) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -6577,7 +6582,8 @@ static int nl80211_send_bss(struct sk_buff *msg, struct netlink_callback *cb, nla_nest_end(msg, bss); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; fail_unlock_rcu: rcu_read_unlock(); @@ -6686,7 +6692,8 @@ static int nl80211_send_survey(struct sk_buff *msg, u32 portid, u32 seq, nla_nest_end(msg, infoattr); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -11025,7 +11032,8 @@ static int nl80211_send_scan_msg(struct sk_buff *msg, /* ignore errors and send incomplete event anyway */ nl80211_add_scan_req(msg, rdev); - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); @@ -11048,7 +11056,8 @@ nl80211_send_sched_scan_msg(struct sk_buff *msg, nla_put_u32(msg, NL80211_ATTR_IFINDEX, netdev->ifindex)) goto nla_put_failure; - return genlmsg_end(msg, hdr); + genlmsg_end(msg, hdr); + return 0; nla_put_failure: genlmsg_cancel(msg, hdr); diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 8128594ab379..7de2ed9ec46d 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1019,7 +1019,8 @@ static int build_spdinfo(struct sk_buff *skb, struct net *net, return err; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_set_spdinfo(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -1121,7 +1122,8 @@ static int build_sadinfo(struct sk_buff *skb, struct net *net, return err; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_get_sadinfo(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -1842,7 +1844,8 @@ static int build_aevent(struct sk_buff *skb, struct xfrm_state *x, const struct if (err) goto out_cancel; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; out_cancel: nlmsg_cancel(skb, nlh); @@ -2282,7 +2285,8 @@ static int build_migrate(struct sk_buff *skb, const struct xfrm_migrate *m, goto out_cancel; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; out_cancel: nlmsg_cancel(skb, nlh); @@ -2490,7 +2494,8 @@ static int build_expire(struct sk_buff *skb, struct xfrm_state *x, const struct if (err) return err; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_exp_state_notify(struct xfrm_state *x, const struct km_event *c) @@ -2712,7 +2717,8 @@ static int build_acquire(struct sk_buff *skb, struct xfrm_state *x, return err; } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_send_acquire(struct xfrm_state *x, struct xfrm_tmpl *xt, @@ -2827,7 +2833,8 @@ static int build_polexpire(struct sk_buff *skb, struct xfrm_policy *xp, } upe->hard = !!hard; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_exp_policy_notify(struct xfrm_policy *xp, int dir, const struct km_event *c) @@ -2986,7 +2993,8 @@ static int build_report(struct sk_buff *skb, u8 proto, return err; } } - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_send_report(struct net *net, u8 proto, @@ -3031,7 +3039,8 @@ static int build_mapping(struct sk_buff *skb, struct xfrm_state *x, um->old_sport = x->encap->encap_sport; um->reqid = x->props.reqid; - return nlmsg_end(skb, nlh); + nlmsg_end(skb, nlh); + return 0; } static int xfrm_send_mapping(struct xfrm_state *x, xfrm_address_t *ipaddr, -- cgit v1.2.3 From 32b7eb877165fdb29f1722071c6c64ced1789014 Mon Sep 17 00:00:00 2001 From: Miroslav Benes Date: Tue, 20 Jan 2015 12:49:35 +0100 Subject: livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING Change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING in Kconfigs. HAVE_ bools are prevalent there and we should go with the flow. Suggested-by: Andrew Morton Signed-off-by: Miroslav Benes Acked-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- arch/x86/Kconfig | 2 +- kernel/livepatch/Kconfig | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 460b31b79938..29b095231276 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -17,7 +17,7 @@ config X86_64 depends on 64BIT select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF - select ARCH_HAVE_LIVE_PATCHING + select HAVE_LIVE_PATCHING ### Arch settings config X86 diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig index 66797aa597c0..347ee2221137 100644 --- a/kernel/livepatch/Kconfig +++ b/kernel/livepatch/Kconfig @@ -1,4 +1,4 @@ -config ARCH_HAVE_LIVE_PATCHING +config HAVE_LIVE_PATCHING bool help Arch supports kernel live patching @@ -9,7 +9,7 @@ config LIVE_PATCHING depends on MODULES depends on SYSFS depends on KALLSYMS_ALL - depends on ARCH_HAVE_LIVE_PATCHING + depends on HAVE_LIVE_PATCHING help Say Y here if you want to support kernel live patching. This option has no runtime impact until a kernel "patch" -- cgit v1.2.3 From 2fded7f44b8fcf79e274c3f0cfbd0298f95308f3 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Tue, 23 Dec 2014 16:39:54 -0500 Subject: audit: remove vestiges of vers_ops Should have been removed with commit 18900909 ("audit: remove the old depricated kernel interface"). Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- include/linux/audit.h | 1 - kernel/auditfilter.c | 2 -- 2 files changed, 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/audit.h b/include/linux/audit.h index 93331929d643..b481779a8dee 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -46,7 +46,6 @@ struct audit_tree; struct sk_buff; struct audit_krule { - int vers_ops; u32 pflags; u32 flags; u32 listnr; diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 103586e239a2..81c94d739e3f 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -425,7 +425,6 @@ static struct audit_entry *audit_data_to_entry(struct audit_rule_data *data, goto exit_nofree; bufp = data->buf; - entry->rule.vers_ops = 2; for (i = 0; i < data->field_count; i++) { struct audit_field *f = &entry->rule.fields[i]; @@ -758,7 +757,6 @@ struct audit_entry *audit_dupe_rule(struct audit_krule *old) return ERR_PTR(-ENOMEM); new = &entry->rule; - new->vers_ops = old->vers_ops; new->flags = old->flags; new->pflags = old->pflags; new->listnr = old->listnr; -- cgit v1.2.3 From 83a90bb1345767f0cb96d242fd8b9db44b2b0e17 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Tue, 20 Jan 2015 09:26:18 -0600 Subject: livepatch: enforce patch stacking semantics Only allow the topmost patch on the stack to be enabled or disabled, so that patches can't be removed or added in an arbitrary order. Suggested-by: Jiri Kosina Signed-off-by: Josh Poimboeuf Reviewed-by: Jiri Slaby Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 3d9c00b5223a..2401e7f955d3 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -379,6 +379,11 @@ static int __klp_disable_patch(struct klp_patch *patch) struct klp_object *obj; int ret; + /* enforce stacking: only the last enabled patch can be disabled */ + if (!list_is_last(&patch->list, &klp_patches) && + list_next_entry(patch, list)->state == KLP_ENABLED) + return -EBUSY; + pr_notice("disabling patch '%s'\n", patch->mod->name); for (obj = patch->objs; obj->funcs; obj++) { @@ -435,6 +440,11 @@ static int __klp_enable_patch(struct klp_patch *patch) if (WARN_ON(patch->state != KLP_DISABLED)) return -EINVAL; + /* enforce stacking: only the first disabled patch can be enabled */ + if (patch->list.prev != &klp_patches && + list_prev_entry(patch, list)->state == KLP_DISABLED) + return -EBUSY; + pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n"); add_taint(TAINT_LIVEPATCH, LOCKDEP_STILL_OK); -- cgit v1.2.3 From 3c33f5b99d688deafd21d4a770303691c7c3a320 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Tue, 20 Jan 2015 09:26:19 -0600 Subject: livepatch: support for repatching a function Add support for patching a function multiple times. If multiple patches affect a function, the function in the most recently enabled patch "wins". This enables a cumulative patch upgrade path, where each patch is a superset of previous patches. This requires restructuring the data a little bit. With the current design, where each klp_func struct has its own ftrace_ops, we'd have to unregister the old ops and then register the new ops, because FTRACE_OPS_FL_IPMODIFY prevents us from having two ops registered for the same function at the same time. That would leave a regression window where the function isn't patched at all (not good for a patch upgrade path). This patch replaces the per-klp_func ftrace_ops with a global klp_ops list, with one ftrace_ops per original function. A single ftrace_ops is shared between all klp_funcs which have the same old_addr. This allows the switch between function versions to happen instantaneously by updating the klp_ops struct's func_stack list. The winner is the klp_func at the top of the func_stack (front of the list). [ jkosina@suse.cz: turn WARN_ON() into WARN_ON_ONCE() in ftrace handler to avoid storm in pathological cases ] Signed-off-by: Josh Poimboeuf Reviewed-by: Jiri Slaby Signed-off-by: Jiri Kosina --- include/linux/livepatch.h | 4 +- kernel/livepatch/core.c | 170 ++++++++++++++++++++++++++++++++-------------- 2 files changed, 121 insertions(+), 53 deletions(-) (limited to 'kernel') diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 950bc615842f..f14c6fb262b4 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -40,8 +40,8 @@ enum klp_state { * @old_addr: a hint conveying at what address the old function * can be found (optional, vmlinux patches only) * @kobj: kobject for sysfs resources - * @fops: ftrace operations structure * @state: tracks function-level patch application state + * @stack_node: list node for klp_ops func_stack list */ struct klp_func { /* external */ @@ -59,8 +59,8 @@ struct klp_func { /* internal */ struct kobject kobj; - struct ftrace_ops *fops; enum klp_state state; + struct list_head stack_node; }; /** diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 2401e7f955d3..bc05d390ce85 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -29,17 +29,53 @@ #include #include -/* - * The klp_mutex protects the klp_patches list and state transitions of any - * structure reachable from the patches list. References to any structure must - * be obtained under mutex protection. +/** + * struct klp_ops - structure for tracking registered ftrace ops structs + * + * A single ftrace_ops is shared between all enabled replacement functions + * (klp_func structs) which have the same old_addr. This allows the switch + * between function versions to happen instantaneously by updating the klp_ops + * struct's func_stack list. The winner is the klp_func at the top of the + * func_stack (front of the list). + * + * @node: node for the global klp_ops list + * @func_stack: list head for the stack of klp_func's (active func is on top) + * @fops: registered ftrace ops struct */ +struct klp_ops { + struct list_head node; + struct list_head func_stack; + struct ftrace_ops fops; +}; +/* + * The klp_mutex protects the global lists and state transitions of any + * structure reachable from them. References to any structure must be obtained + * under mutex protection (except in klp_ftrace_handler(), which uses RCU to + * ensure it gets consistent data). + */ static DEFINE_MUTEX(klp_mutex); + static LIST_HEAD(klp_patches); +static LIST_HEAD(klp_ops); static struct kobject *klp_root_kobj; +static struct klp_ops *klp_find_ops(unsigned long old_addr) +{ + struct klp_ops *ops; + struct klp_func *func; + + list_for_each_entry(ops, &klp_ops, node) { + func = list_first_entry(&ops->func_stack, struct klp_func, + stack_node); + if (func->old_addr == old_addr) + return ops; + } + + return NULL; +} + static bool klp_is_module(struct klp_object *obj) { return obj->name; @@ -267,16 +303,28 @@ static int klp_write_object_relocations(struct module *pmod, static void notrace klp_ftrace_handler(unsigned long ip, unsigned long parent_ip, - struct ftrace_ops *ops, + struct ftrace_ops *fops, struct pt_regs *regs) { - struct klp_func *func = ops->private; + struct klp_ops *ops; + struct klp_func *func; + + ops = container_of(fops, struct klp_ops, fops); + + rcu_read_lock(); + func = list_first_or_null_rcu(&ops->func_stack, struct klp_func, + stack_node); + rcu_read_unlock(); + + if (WARN_ON_ONCE(!func)) + return; klp_arch_set_pc(regs, (unsigned long)func->new_func); } static int klp_disable_func(struct klp_func *func) { + struct klp_ops *ops; int ret; if (WARN_ON(func->state != KLP_ENABLED)) @@ -285,16 +333,28 @@ static int klp_disable_func(struct klp_func *func) if (WARN_ON(!func->old_addr)) return -EINVAL; - ret = unregister_ftrace_function(func->fops); - if (ret) { - pr_err("failed to unregister ftrace handler for function '%s' (%d)\n", - func->old_name, ret); - return ret; - } + ops = klp_find_ops(func->old_addr); + if (WARN_ON(!ops)) + return -EINVAL; - ret = ftrace_set_filter_ip(func->fops, func->old_addr, 1, 0); - if (ret) - pr_warn("function unregister succeeded but failed to clear the filter\n"); + if (list_is_singular(&ops->func_stack)) { + ret = unregister_ftrace_function(&ops->fops); + if (ret) { + pr_err("failed to unregister ftrace handler for function '%s' (%d)\n", + func->old_name, ret); + return ret; + } + + ret = ftrace_set_filter_ip(&ops->fops, func->old_addr, 1, 0); + if (ret) + pr_warn("function unregister succeeded but failed to clear the filter\n"); + + list_del_rcu(&func->stack_node); + list_del(&ops->node); + kfree(ops); + } else { + list_del_rcu(&func->stack_node); + } func->state = KLP_DISABLED; @@ -303,6 +363,7 @@ static int klp_disable_func(struct klp_func *func) static int klp_enable_func(struct klp_func *func) { + struct klp_ops *ops; int ret; if (WARN_ON(!func->old_addr)) @@ -311,22 +372,50 @@ static int klp_enable_func(struct klp_func *func) if (WARN_ON(func->state != KLP_DISABLED)) return -EINVAL; - ret = ftrace_set_filter_ip(func->fops, func->old_addr, 0, 0); - if (ret) { - pr_err("failed to set ftrace filter for function '%s' (%d)\n", - func->old_name, ret); - return ret; - } + ops = klp_find_ops(func->old_addr); + if (!ops) { + ops = kzalloc(sizeof(*ops), GFP_KERNEL); + if (!ops) + return -ENOMEM; + + ops->fops.func = klp_ftrace_handler; + ops->fops.flags = FTRACE_OPS_FL_SAVE_REGS | + FTRACE_OPS_FL_DYNAMIC | + FTRACE_OPS_FL_IPMODIFY; + + list_add(&ops->node, &klp_ops); + + INIT_LIST_HEAD(&ops->func_stack); + list_add_rcu(&func->stack_node, &ops->func_stack); + + ret = ftrace_set_filter_ip(&ops->fops, func->old_addr, 0, 0); + if (ret) { + pr_err("failed to set ftrace filter for function '%s' (%d)\n", + func->old_name, ret); + goto err; + } + + ret = register_ftrace_function(&ops->fops); + if (ret) { + pr_err("failed to register ftrace handler for function '%s' (%d)\n", + func->old_name, ret); + ftrace_set_filter_ip(&ops->fops, func->old_addr, 1, 0); + goto err; + } + - ret = register_ftrace_function(func->fops); - if (ret) { - pr_err("failed to register ftrace handler for function '%s' (%d)\n", - func->old_name, ret); - ftrace_set_filter_ip(func->fops, func->old_addr, 1, 0); } else { - func->state = KLP_ENABLED; + list_add_rcu(&func->stack_node, &ops->func_stack); } + func->state = KLP_ENABLED; + + return ret; + +err: + list_del_rcu(&func->stack_node); + list_del(&ops->node); + kfree(ops); return ret; } @@ -582,10 +671,6 @@ static struct kobj_type klp_ktype_patch = { static void klp_kobj_release_func(struct kobject *kobj) { - struct klp_func *func; - - func = container_of(kobj, struct klp_func, kobj); - kfree(func->fops); } static struct kobj_type klp_ktype_func = { @@ -642,28 +727,11 @@ static void klp_free_patch(struct klp_patch *patch) static int klp_init_func(struct klp_object *obj, struct klp_func *func) { - struct ftrace_ops *ops; - int ret; - - ops = kzalloc(sizeof(*ops), GFP_KERNEL); - if (!ops) - return -ENOMEM; - - ops->private = func; - ops->func = klp_ftrace_handler; - ops->flags = FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_DYNAMIC | - FTRACE_OPS_FL_IPMODIFY; - func->fops = ops; + INIT_LIST_HEAD(&func->stack_node); func->state = KLP_DISABLED; - ret = kobject_init_and_add(&func->kobj, &klp_ktype_func, - obj->kobj, func->old_name); - if (ret) { - kfree(func->fops); - return ret; - } - - return 0; + return kobject_init_and_add(&func->kobj, &klp_ktype_func, + obj->kobj, func->old_name); } /* parts of the initialization that is done only when the object is loaded */ -- cgit v1.2.3 From dbed7ddab967550d1181633e8ac7905808f29a94 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Tue, 20 Jan 2015 16:07:55 -0600 Subject: livepatch: fix uninitialized return value Fix a potentially uninitialized return value in klp_enable_func(). Signed-off-by: Josh Poimboeuf Reviewed-by: Miroslav Benes Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index bc05d390ce85..9adf86bfbcd5 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -410,7 +410,7 @@ static int klp_enable_func(struct klp_func *func) func->state = KLP_ENABLED; - return ret; + return 0; err: list_del_rcu(&func->stack_node); -- cgit v1.2.3 From 3efb5f21a36fbddd524cffe36426a84622ce580e Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Tue, 20 Jan 2015 11:28:28 -0500 Subject: tracing: Remove unneeded includes of debugfs.h and fs.h The creation of tracing files and directories is for the most part encapsulated in helper functions in trace.c. Other files do not need to include debugfs.h or fs.h, as they may have needed to in the past. Remove them from the files that do not need them. Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 2 -- kernel/trace/trace_branch.c | 1 - kernel/trace/trace_export.c | 2 -- kernel/trace/trace_irqsoff.c | 2 -- kernel/trace/trace_nop.c | 2 -- kernel/trace/trace_printk.c | 2 -- kernel/trace/trace_sched_switch.c | 2 -- kernel/trace/trace_sched_wakeup.c | 2 -- kernel/trace/trace_stack.c | 2 -- 9 files changed, 17 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 7a4104cb95cb..96079180de3d 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include #include /* for self test */ @@ -23,7 +22,6 @@ #include #include #include -#include #include diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c index 7d6e2afde669..57cbf1efdd44 100644 --- a/kernel/trace/trace_branch.c +++ b/kernel/trace/trace_branch.c @@ -7,7 +7,6 @@ #include #include #include -#include #include #include #include diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c index d4ddde28a81a..12e2b99be862 100644 --- a/kernel/trace/trace_export.c +++ b/kernel/trace/trace_export.c @@ -6,12 +6,10 @@ #include #include #include -#include #include #include #include #include -#include #include "trace_output.h" diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 9bb104f748d0..8523ea345f2b 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -10,11 +10,9 @@ * Copyright (C) 2004 Nadia Yvette Chambers */ #include -#include #include #include #include -#include #include "trace.h" diff --git a/kernel/trace/trace_nop.c b/kernel/trace/trace_nop.c index fcf0a9e48916..8bb2071474dd 100644 --- a/kernel/trace/trace_nop.c +++ b/kernel/trace/trace_nop.c @@ -6,8 +6,6 @@ */ #include -#include -#include #include #include "trace.h" diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c index c4e70b6bd7fa..7ee4b5cc1ce5 100644 --- a/kernel/trace/trace_printk.c +++ b/kernel/trace/trace_printk.c @@ -5,7 +5,6 @@ * */ #include -#include #include #include #include @@ -15,7 +14,6 @@ #include #include #include -#include #include "trace.h" diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index 2e293beb186e..419ca37e72c9 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -5,8 +5,6 @@ * */ #include -#include -#include #include #include #include diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 8fb84b362816..d6e1003724e9 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -10,8 +10,6 @@ * Copyright (C) 2004 Nadia Yvette Chambers */ #include -#include -#include #include #include #include diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index 16eddb308c33..e80927b88eb0 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -7,12 +7,10 @@ #include #include #include -#include #include #include #include #include -#include #include -- cgit v1.2.3 From 14a5ae40f0def33a422a45b2ed09198adb7bf11c Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Tue, 20 Jan 2015 11:14:16 -0500 Subject: tracing: Use IS_ERR() check for return value of tracing_init_dentry() tracing_init_dentry() will soon return NULL as a valid pointer for the top level tracing directroy. NULL can not be used as an error value. Instead, switch to ERR_PTR() and check the return status with IS_ERR(). Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 2 +- kernel/trace/trace.c | 8 ++++---- kernel/trace/trace_events.c | 2 +- kernel/trace/trace_functions_graph.c | 2 +- kernel/trace/trace_kprobe.c | 2 +- kernel/trace/trace_printk.c | 2 +- kernel/trace/trace_stack.c | 2 +- kernel/trace/trace_stat.c | 2 +- kernel/trace/trace_uprobe.c | 2 +- 9 files changed, 12 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 929a733d302e..80c9d34540dd 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -5419,7 +5419,7 @@ static __init int ftrace_init_debugfs(void) struct dentry *d_tracer; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; ftrace_init_dyn_debugfs(d_tracer); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 7669b1f3234e..acd27555dc5b 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5820,7 +5820,7 @@ struct dentry *tracing_init_dentry_tr(struct trace_array *tr) return tr->dir; if (!debugfs_initialized()) - return NULL; + return ERR_PTR(-ENODEV); if (tr->flags & TRACE_ARRAY_FL_GLOBAL) tr->dir = debugfs_create_dir("tracing", NULL); @@ -5844,7 +5844,7 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) return tr->percpu_dir; d_tracer = tracing_init_dentry_tr(tr); - if (!d_tracer) + if (IS_ERR(d_tracer)) return NULL; tr->percpu_dir = debugfs_create_dir("per_cpu", d_tracer); @@ -6047,7 +6047,7 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr) return tr->options; d_tracer = tracing_init_dentry_tr(tr); - if (!d_tracer) + if (IS_ERR(d_tracer)) return NULL; tr->options = debugfs_create_dir("options", d_tracer); @@ -6538,7 +6538,7 @@ static __init int tracer_init_debugfs(void) trace_access_lock_init(); d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; init_tracer_debugfs(&global_trace, d_tracer); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 366a78a3e61e..4ff8c1394017 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2490,7 +2490,7 @@ static __init int event_trace_init(void) return -ENODEV; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; entry = debugfs_create_file("available_events", 0444, d_tracer, diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index ba476009e5de..2d25ad1526bb 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -1437,7 +1437,7 @@ static __init int init_graph_debugfs(void) struct dentry *d_tracer; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; trace_create_file("max_graph_depth", 0644, d_tracer, diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 5edb518be345..b4a00def88f5 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1320,7 +1320,7 @@ static __init int init_kprobe_trace(void) return -EINVAL; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; entry = debugfs_create_file("kprobe_events", 0644, d_tracer, diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c index 7ee4b5cc1ce5..36c1455b7567 100644 --- a/kernel/trace/trace_printk.c +++ b/kernel/trace/trace_printk.c @@ -347,7 +347,7 @@ static __init int init_trace_printk_function_export(void) struct dentry *d_tracer; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; trace_create_file("printk_formats", 0444, d_tracer, diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index e80927b88eb0..c3e4fcfddd45 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -460,7 +460,7 @@ static __init int stack_trace_init(void) struct dentry *d_tracer; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; trace_create_file("stack_max_size", 0644, d_tracer, diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c index 7af67360b330..75e19e86c954 100644 --- a/kernel/trace/trace_stat.c +++ b/kernel/trace/trace_stat.c @@ -276,7 +276,7 @@ static int tracing_stat_init(void) struct dentry *d_tracing; d_tracing = tracing_init_dentry(); - if (!d_tracing) + if (IS_ERR(d_tracing)) return 0; stat_dir = debugfs_create_dir("trace_stat", d_tracing); diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 8520acc34b18..5f0eba9e5e6b 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1321,7 +1321,7 @@ static __init int init_uprobe_trace(void) struct dentry *d_tracer; d_tracer = tracing_init_dentry(); - if (!d_tracer) + if (IS_ERR(d_tracer)) return 0; trace_create_file("uprobe_events", 0644, d_tracer, -- cgit v1.2.3 From fd3522fdc84023b050bb40318d9fc71a9adc22bc Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Thu, 22 Jan 2015 00:00:10 -0500 Subject: audit: enable filename recording via getname_kernel() Enable recording of filenames in getname_kernel() and remove the kludgy workaround in __audit_inode() now that we have proper filename logging for kernel users. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore Reviewed-by: Richard Guy Briggs Signed-off-by: Al Viro --- fs/namei.c | 1 + kernel/auditsc.c | 40 +++------------------------------------- 2 files changed, 4 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/fs/namei.c b/fs/namei.c index 5ec3515162e6..a3fde77d4abf 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -243,6 +243,7 @@ getname_kernel(const char * filename) memcpy((char *)result->name, filename, len); result->uptr = NULL; result->aname = NULL; + audit_getname(result); return result; } diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 072566dd0caf..132dbcdef6ec 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1882,44 +1882,10 @@ out_alloc: n = audit_alloc_name(context, AUDIT_TYPE_UNKNOWN); if (!n) return; - /* unfortunately, while we may have a path name to record with the - * inode, we can't always rely on the string lasting until the end of - * the syscall so we need to create our own copy, it may fail due to - * memory allocation issues, but we do our best */ - if (name) { - /* we can't use getname_kernel() due to size limits */ - size_t len = strlen(name->name) + 1; - struct filename *new = __getname(); - - if (unlikely(!new)) - goto out; + if (name) + /* no need to set ->name_put as the original will cleanup */ + n->name = name; - if (len <= (PATH_MAX - sizeof(*new))) { - new->name = (char *)(new) + sizeof(*new); - new->separate = false; - } else if (len <= PATH_MAX) { - /* this looks odd, but is due to final_putname() */ - struct filename *new2; - - new2 = kmalloc(sizeof(*new2), GFP_KERNEL); - if (unlikely(!new2)) { - __putname(new); - goto out; - } - new2->name = (char *)new; - new2->separate = true; - new = new2; - } else { - /* we should never get here, but let's be safe */ - __putname(new); - goto out; - } - strlcpy((char *)new->name, name->name, len); - new->uptr = NULL; - new->aname = n; - n->name = new; - n->name_put = true; - } out: if (parent) { n->name_len = n->name ? parent_len(n->name->name) : AUDIT_NAME_FULL; -- cgit v1.2.3 From 57c59f5837bdfd0b4fee3b02a44857e263a09bfa Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Thu, 22 Jan 2015 00:00:16 -0500 Subject: audit: fix filename matching in __audit_inode() and __audit_inode_child() In all likelihood there were some subtle, and perhaps not so subtle, bugs with filename matching in audit_inode() and audit_inode_child() for some time, however, recent changes to the audit filename code have definitely broken the filename matching code. The breakage could result in duplicate filenames in the audit log and other odd audit record entries. This patch fixes the filename matching code and restores some sanity to the filename audit records. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore Signed-off-by: Al Viro --- kernel/auditsc.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 132dbcdef6ec..4f521964ccaa 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1846,6 +1846,7 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, /* The struct filename _must_ have a populated ->name */ BUG_ON(!name->name); #endif + /* * If we have a pointer to an audit_names entry already, then we can * just use it directly if the type is correct. @@ -1863,7 +1864,17 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, } list_for_each_entry_reverse(n, &context->names_list, list) { - if (!n->name || strcmp(n->name->name, name->name)) + if (n->ino) { + /* valid inode number, use that for the comparison */ + if (n->ino != inode->i_ino || + n->dev != inode->i_sb->s_dev) + continue; + } else if (n->name) { + /* inode number has not been set, check the name */ + if (strcmp(n->name->name, name->name)) + continue; + } else + /* no inode and no name (?!) ... this is odd ... */ continue; /* match the correct record type */ @@ -1936,11 +1947,16 @@ void __audit_inode_child(const struct inode *parent, /* look for a parent entry first */ list_for_each_entry(n, &context->names_list, list) { - if (!n->name || n->type != AUDIT_TYPE_PARENT) + if (!n->name || + (n->type != AUDIT_TYPE_PARENT && + n->type != AUDIT_TYPE_UNKNOWN)) continue; - if (n->ino == parent->i_ino && - !audit_compare_dname_path(dname, n->name->name, n->name_len)) { + if (n->ino == parent->i_ino && n->dev == parent->i_sb->s_dev && + !audit_compare_dname_path(dname, + n->name->name, n->name_len)) { + if (n->type == AUDIT_TYPE_UNKNOWN) + n->type = AUDIT_TYPE_PARENT; found_parent = n; break; } @@ -1949,11 +1965,8 @@ void __audit_inode_child(const struct inode *parent, /* is there a matching child entry? */ list_for_each_entry(n, &context->names_list, list) { /* can only match entries that have a name */ - if (!n->name || n->type != type) - continue; - - /* if we found a parent, make sure this one is a child of it */ - if (found_parent && (n->name != found_parent->name)) + if (!n->name || + (n->type != type && n->type != AUDIT_TYPE_UNKNOWN)) continue; if (!strcmp(dname, n->name->name) || @@ -1961,6 +1974,8 @@ void __audit_inode_child(const struct inode *parent, found_parent ? found_parent->name_len : AUDIT_NAME_FULL)) { + if (n->type == AUDIT_TYPE_UNKNOWN) + n->type = type; found_child = n; break; } @@ -1989,6 +2004,7 @@ void __audit_inode_child(const struct inode *parent, found_child->name_put = false; } } + if (inode) audit_copy_inode(found_child, dentry, inode); else -- cgit v1.2.3 From 55422d0bd292f5ad143cc32cb8bb8505257274c4 Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Thu, 22 Jan 2015 00:00:23 -0500 Subject: audit: replace getname()/putname() hacks with reference counters In order to ensure that filenames are not released before the audit subsystem is done with the strings there are a number of hacks built into the fs and audit subsystems around getname() and putname(). To say these hacks are "ugly" would be kind. This patch removes the filename hackery in favor of a more conventional reference count based approach. The diffstat below tells most of the story; lots of audit/fs specific code is replaced with a traditional reference count based approach that is easily understood, even by those not familiar with the audit and/or fs subsystems. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore Signed-off-by: Al Viro --- fs/namei.c | 29 +++++++------- include/linux/audit.h | 3 -- include/linux/fs.h | 9 +---- kernel/audit.h | 17 +------- kernel/auditsc.c | 105 ++++++-------------------------------------------- 5 files changed, 29 insertions(+), 134 deletions(-) (limited to 'kernel') diff --git a/fs/namei.c b/fs/namei.c index a3fde77d4abf..96ca11dea4a2 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -118,15 +118,6 @@ * POSIX.1 2.4: an empty pathname is invalid (ENOENT). * PATH_MAX includes the nul terminator --RR. */ -void final_putname(struct filename *name) -{ - if (name->separate) { - __putname(name->name); - kfree(name); - } else { - __putname(name); - } -} #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename)) @@ -145,6 +136,7 @@ getname_flags(const char __user *filename, int flags, int *empty) result = __getname(); if (unlikely(!result)) return ERR_PTR(-ENOMEM); + result->refcnt = 1; /* * First, try to embed the struct filename inside the names_cache @@ -179,6 +171,7 @@ recopy: } result->name = kname; result->separate = true; + result->refcnt = 1; max = PATH_MAX; goto recopy; } @@ -202,7 +195,7 @@ recopy: return result; error: - final_putname(result); + putname(result); return err; } @@ -243,19 +236,25 @@ getname_kernel(const char * filename) memcpy((char *)result->name, filename, len); result->uptr = NULL; result->aname = NULL; + result->refcnt = 1; audit_getname(result); return result; } -#ifdef CONFIG_AUDITSYSCALL void putname(struct filename *name) { - if (unlikely(!audit_dummy_context())) - return audit_putname(name); - final_putname(name); + BUG_ON(name->refcnt <= 0); + + if (--name->refcnt > 0) + return; + + if (name->separate) { + __putname(name->name); + kfree(name); + } else + __putname(name); } -#endif static int check_acl(struct inode *inode, int mask) { diff --git a/include/linux/audit.h b/include/linux/audit.h index af84234e1f6e..87c2d347d255 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -128,7 +128,6 @@ extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1, extern void __audit_syscall_exit(int ret_success, long ret_value); extern struct filename *__audit_reusename(const __user char *uptr); extern void __audit_getname(struct filename *name); -extern void audit_putname(struct filename *name); #define AUDIT_INODE_PARENT 1 /* dentry represents the parent */ #define AUDIT_INODE_HIDDEN 2 /* audit record should be hidden */ @@ -353,8 +352,6 @@ static inline struct filename *audit_reusename(const __user char *name) } static inline void audit_getname(struct filename *name) { } -static inline void audit_putname(struct filename *name) -{ } static inline void __audit_inode(struct filename *name, const struct dentry *dentry, unsigned int flags) diff --git a/include/linux/fs.h b/include/linux/fs.h index f90c0282c114..b49842fe203f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2080,6 +2080,7 @@ struct filename { const char *name; /* pointer to actual string */ const __user char *uptr; /* original userland pointer */ struct audit_names *aname; + int refcnt; bool separate; /* should "name" be freed? */ }; @@ -2101,6 +2102,7 @@ extern int filp_close(struct file *, fl_owner_t id); extern struct filename *getname_flags(const char __user *, int, int *); extern struct filename *getname(const char __user *); extern struct filename *getname_kernel(const char *); +extern void putname(struct filename *name); enum { FILE_CREATED = 1, @@ -2121,15 +2123,8 @@ extern void __init vfs_caches_init(unsigned long); extern struct kmem_cache *names_cachep; -extern void final_putname(struct filename *name); - #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) #define __putname(name) kmem_cache_free(names_cachep, (void *)(name)) -#ifndef CONFIG_AUDITSYSCALL -#define putname(name) final_putname(name) -#else -extern void putname(struct filename *name); -#endif #ifdef CONFIG_BLOCK extern int register_blkdev(unsigned int, const char *); diff --git a/kernel/audit.h b/kernel/audit.h index 3cdffad5a1d9..1caa0d345d90 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -24,12 +24,6 @@ #include #include -/* 0 = no checking - 1 = put_count checking - 2 = verbose put_count checking -*/ -#define AUDIT_DEBUG 0 - /* AUDIT_NAMES is the number of slots we reserve in the audit_context * for saving names from getname(). If we get more names we will allocate * a name dynamically and also add those to the list anchored by names_list. */ @@ -74,9 +68,8 @@ struct audit_cap_data { }; }; -/* When fs/namei.c:getname() is called, we store the pointer in name and - * we don't let putname() free it (instead we free all of the saved - * pointers at syscall exit time). +/* When fs/namei.c:getname() is called, we store the pointer in name and bump + * the refcnt in the associated filename struct. * * Further, in fs/namei.c:path_lookup() we store the inode and device. */ @@ -86,7 +79,6 @@ struct audit_names { struct filename *name; int name_len; /* number of chars to log */ bool hidden; /* don't log this record */ - bool name_put; /* call __putname()? */ unsigned long ino; dev_t dev; @@ -208,11 +200,6 @@ struct audit_context { }; int fds[2]; struct audit_proctitle proctitle; - -#if AUDIT_DEBUG - int put_count; - int ino_count; -#endif }; extern u32 audit_ever_enabled; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 4f521964ccaa..0c38604a630c 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -866,33 +866,10 @@ static inline void audit_free_names(struct audit_context *context) { struct audit_names *n, *next; -#if AUDIT_DEBUG == 2 - if (context->put_count + context->ino_count != context->name_count) { - int i = 0; - - pr_err("%s:%d(:%d): major=%d in_syscall=%d" - " name_count=%d put_count=%d ino_count=%d" - " [NOT freeing]\n", __FILE__, __LINE__, - context->serial, context->major, context->in_syscall, - context->name_count, context->put_count, - context->ino_count); - list_for_each_entry(n, &context->names_list, list) { - pr_err("names[%d] = %p = %s\n", i++, n->name, - n->name->name ?: "(null)"); - } - dump_stack(); - return; - } -#endif -#if AUDIT_DEBUG - context->put_count = 0; - context->ino_count = 0; -#endif - list_for_each_entry_safe(n, next, &context->names_list, list) { list_del(&n->list); - if (n->name && n->name_put) - final_putname(n->name); + if (n->name) + putname(n->name); if (n->should_free) kfree(n); } @@ -1711,9 +1688,6 @@ static struct audit_names *audit_alloc_name(struct audit_context *context, list_add_tail(&aname->list, &context->names_list); context->name_count++; -#if AUDIT_DEBUG - context->ino_count++; -#endif return aname; } @@ -1734,8 +1708,10 @@ __audit_reusename(const __user char *uptr) list_for_each_entry(n, &context->names_list, list) { if (!n->name) continue; - if (n->name->uptr == uptr) + if (n->name->uptr == uptr) { + n->name->refcnt++; return n->name; + } } return NULL; } @@ -1752,19 +1728,8 @@ void __audit_getname(struct filename *name) struct audit_context *context = current->audit_context; struct audit_names *n; - if (!context->in_syscall) { -#if AUDIT_DEBUG == 2 - pr_err("%s:%d(:%d): ignoring getname(%p)\n", - __FILE__, __LINE__, context->serial, name); - dump_stack(); -#endif + if (!context->in_syscall) return; - } - -#if AUDIT_DEBUG - /* The filename _must_ have a populated ->name */ - BUG_ON(!name->name); -#endif n = audit_alloc_name(context, AUDIT_TYPE_UNKNOWN); if (!n) @@ -1772,56 +1737,13 @@ void __audit_getname(struct filename *name) n->name = name; n->name_len = AUDIT_NAME_FULL; - n->name_put = true; name->aname = n; + name->refcnt++; if (!context->pwd.dentry) get_fs_pwd(current->fs, &context->pwd); } -/* audit_putname - intercept a putname request - * @name: name to intercept and delay for putname - * - * If we have stored the name from getname in the audit context, - * then we delay the putname until syscall exit. - * Called from include/linux/fs.h:putname(). - */ -void audit_putname(struct filename *name) -{ - struct audit_context *context = current->audit_context; - - BUG_ON(!context); - if (!name->aname || !context->in_syscall) { -#if AUDIT_DEBUG == 2 - pr_err("%s:%d(:%d): final_putname(%p)\n", - __FILE__, __LINE__, context->serial, name); - if (context->name_count) { - struct audit_names *n; - int i = 0; - - list_for_each_entry(n, &context->names_list, list) - pr_err("name[%d] = %p = %s\n", i++, n->name, - n->name->name ?: "(null)"); - } -#endif - final_putname(name); - } -#if AUDIT_DEBUG - else { - ++context->put_count; - if (context->put_count > context->name_count) { - pr_err("%s:%d(:%d): major=%d in_syscall=%d putname(%p)" - " name_count=%d put_count=%d\n", - __FILE__, __LINE__, - context->serial, context->major, - context->in_syscall, name->name, - context->name_count, context->put_count); - dump_stack(); - } - } -#endif -} - /** * __audit_inode - store the inode and device from a lookup * @name: name being audited @@ -1842,11 +1764,6 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, if (!name) goto out_alloc; -#if AUDIT_DEBUG - /* The struct filename _must_ have a populated ->name */ - BUG_ON(!name->name); -#endif - /* * If we have a pointer to an audit_names entry already, then we can * just use it directly if the type is correct. @@ -1893,9 +1810,10 @@ out_alloc: n = audit_alloc_name(context, AUDIT_TYPE_UNKNOWN); if (!n) return; - if (name) - /* no need to set ->name_put as the original will cleanup */ + if (name) { n->name = name; + name->refcnt++; + } out: if (parent) { @@ -2000,8 +1918,7 @@ void __audit_inode_child(const struct inode *parent, if (found_parent) { found_child->name = found_parent->name; found_child->name_len = AUDIT_NAME_FULL; - /* don't call __putname() */ - found_child->name_put = false; + found_child->name->refcnt++; } } -- cgit v1.2.3 From e2e64a932556cdfae455497dbe94a8db151fc9fa Mon Sep 17 00:00:00 2001 From: Jesse Brandeburg Date: Thu, 18 Dec 2014 17:22:06 -0800 Subject: genirq: Set initial affinity in irq_set_affinity_hint() Problem: The default behavior of the kernel is somewhat undesirable as all requested interrupts end up on CPU0 after registration. A user can run irqbalance daemon, or can manually configure smp_affinity via the proc filesystem, but the default affinity of the interrupts for all devices is always CPU zero, this can cause performance problems or very heavy cpu use of only one core if not noticed and fixed by the user. Solution: Enable the setting of the initial affinity directly when the driver sets a hint. This enabling means that kernel drivers can include an initial affinity setting for the interrupt, instead of all interrupts starting out life on CPU0. Of course if irqbalance is still running then the interrupts will get moved as before. This function is currently called by drivers in block, crypto, infiniband, ethernet and scsi trees, but only a handful, so these will be the devices affected by this change. Tested on i40e, and default interrupts were spread across the CPUs according to the hint. drivers/block/mtip32xx/mtip32xx.c:3 drivers/block/nvme-core.c:2 drivers/crypto/qat/qat_dh895xcc/adf_isr.c:3 drivers/infiniband/hw/qib/qib_iba7322.c:2 drivers/net/ethernet/intel/i40e/i40e_main.c:3 drivers/net/ethernet/intel/i40evf/i40evf_main.c:3 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3 drivers/net/ethernet/mellanox/mlx4/en_cq.c:2 drivers/scsi/hpsa.c:3 drivers/scsi/lpfc/lpfc_init.c:3 drivers/scsi/megaraid/megaraid_sas_base.c:8 drivers/soc/ti/knav_qmss_acc.c:1 drivers/soc/ti/knav_qmss_queue.c:2 drivers/virtio/virtio_pci_common.c:2 Signed-off-by: Jesse Brandeburg Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20141219012206.4220.27491.stgit@jbrandeb-cp2.jf.intel.com Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 80692373abd6..f038e586a4b9 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -243,6 +243,8 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) return -EINVAL; desc->affinity_hint = m; irq_put_desc_unlock(desc, flags); + /* set the initial affinity to prevent every interrupt being on CPU0 */ + __irq_set_affinity(irq, m, false); return 0; } EXPORT_SYMBOL_GPL(irq_set_affinity_hint); -- cgit v1.2.3 From 89f703f0932341b316b2312581dacddba14b3876 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 13 Jan 2015 22:24:00 +0100 Subject: X.509: shut up about included cert for silent build Every kernel build that includes X.509 support prints out a message like - Including cert signing_key.x509 This may be useful for some cases, but when doing automated build tests, it just means noise. To hide the message, this uses '$(kecho)' for printing the message, which means we still see it when building with V=1, but not at the normal level or when building with 'make -s'. Signed-off-by: Arnd Bergmann Signed-off-by: David Howells --- kernel/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index a59481a3fa6c..23e17a7e7a63 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -142,7 +142,7 @@ endif kernel/system_certificates.o: $(obj)/x509_certificate_list quiet_cmd_x509certs = CERTS $@ - cmd_x509certs = cat $(X509_CERTIFICATES) /dev/null >$@ $(foreach X509,$(X509_CERTIFICATES),; echo " - Including cert $(X509)") + cmd_x509certs = cat $(X509_CERTIFICATES) /dev/null >$@ $(foreach X509,$(X509_CERTIFICATES),; $(kecho) " - Including cert $(X509)") targets += $(obj)/x509_certificate_list $(obj)/x509_certificate_list: $(X509_CERTIFICATES) $(obj)/.x509.list -- cgit v1.2.3 From f5f4eda4c9b9e26ec7d7c8ce083e386016ffc0f8 Mon Sep 17 00:00:00 2001 From: Nishanth Menon Date: Fri, 5 Dec 2014 11:19:08 -0600 Subject: PM / QoS: Add debugfs support to view the list of constraints PM QoS requests are notoriously hard to debug and made even more so due to their highly dynamic nature. Having visibility into the internal data representation per constraint allows us to have much better appreciation of potential issues or bad usage by drivers in the system. So introduce for all classes of PM QoS, an entry in /sys/kernel/debug/pm_qos that shall show all the current requests as well as the snapshot of the value these requests boil down to. For example: ==> /sys/kernel/debug/pm_qos/cpu_dma_latency <== 1: 4444: Active 2: 2000000000: Default 3: 2000000000: Default 4: 2000000000: Default Type=Minimum, Value=4444, Requests: active=1 / total=4 ==> /sys/kernel/debug/pm_qos/memory_bandwidth <== Empty! ... The actual value listed will have their meaning based on the QoS it is on, the 'Type' indicates what logic it would use to collate the information - Minimum, Maximum, or Sum. Value is the collation of all requests. This interface also compares the values with the defaults for the QoS class and marks the ones that are currently active. Signed-off-by: Nishanth Menon Signed-off-by: Dave Gerlach Acked-by: Kevin Hilman Signed-off-by: Rafael J. Wysocki --- kernel/power/qos.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 89 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 5f4c006c4b1e..97b0df71303e 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -41,6 +41,8 @@ #include #include #include +#include +#include #include #include @@ -182,6 +184,81 @@ static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) c->target_value = value; } +static inline int pm_qos_get_value(struct pm_qos_constraints *c); +static int pm_qos_dbg_show_requests(struct seq_file *s, void *unused) +{ + struct pm_qos_object *qos = (struct pm_qos_object *)s->private; + struct pm_qos_constraints *c; + struct pm_qos_request *req; + char *type; + unsigned long flags; + int tot_reqs = 0; + int active_reqs = 0; + + if (IS_ERR_OR_NULL(qos)) { + pr_err("%s: bad qos param!\n", __func__); + return -EINVAL; + } + c = qos->constraints; + if (IS_ERR_OR_NULL(c)) { + pr_err("%s: Bad constraints on qos?\n", __func__); + return -EINVAL; + } + + /* Lock to ensure we have a snapshot */ + spin_lock_irqsave(&pm_qos_lock, flags); + if (plist_head_empty(&c->list)) { + seq_puts(s, "Empty!\n"); + goto out; + } + + switch (c->type) { + case PM_QOS_MIN: + type = "Minimum"; + break; + case PM_QOS_MAX: + type = "Maximum"; + break; + case PM_QOS_SUM: + type = "Sum"; + break; + default: + type = "Unknown"; + } + + plist_for_each_entry(req, &c->list, node) { + char *state = "Default"; + + if ((req->node).prio != c->default_value) { + active_reqs++; + state = "Active"; + } + tot_reqs++; + seq_printf(s, "%d: %d: %s\n", tot_reqs, + (req->node).prio, state); + } + + seq_printf(s, "Type=%s, Value=%d, Requests: active=%d / total=%d\n", + type, pm_qos_get_value(c), active_reqs, tot_reqs); + +out: + spin_unlock_irqrestore(&pm_qos_lock, flags); + return 0; +} + +static int pm_qos_dbg_open(struct inode *inode, struct file *file) +{ + return single_open(file, pm_qos_dbg_show_requests, + inode->i_private); +} + +static const struct file_operations pm_qos_debug_fops = { + .open = pm_qos_dbg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + /** * pm_qos_update_target - manages the constraints list and calls the notifiers * if needed @@ -509,12 +586,17 @@ int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier) EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); /* User space interface to PM QoS classes via misc devices */ -static int register_pm_qos_misc(struct pm_qos_object *qos) +static int register_pm_qos_misc(struct pm_qos_object *qos, struct dentry *d) { qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; qos->pm_qos_power_miscdev.name = qos->name; qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; + if (d) { + (void)debugfs_create_file(qos->name, S_IRUGO, d, + (void *)qos, &pm_qos_debug_fops); + } + return misc_register(&qos->pm_qos_power_miscdev); } @@ -608,11 +690,16 @@ static int __init pm_qos_power_init(void) { int ret = 0; int i; + struct dentry *d; BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES); + d = debugfs_create_dir("pm_qos", NULL); + if (IS_ERR_OR_NULL(d)) + d = NULL; + for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) { - ret = register_pm_qos_misc(pm_qos_array[i]); + ret = register_pm_qos_misc(pm_qos_array[i], d); if (ret < 0) { printk(KERN_ERR "pm_qos_param: %s setup failed\n", pm_qos_array[i]->name); -- cgit v1.2.3 From d78cb3680c8e95c100d5a351b8cfc7dd3c589b68 Mon Sep 17 00:00:00 2001 From: Rickard Strandqvist Date: Sun, 11 Jan 2015 23:27:38 +0100 Subject: PM / hibernate: Remove unused function Remove the function get_safe_write_buffer() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist Acked-by: Pavel Machek Signed-off-by: Rafael J. Wysocki --- kernel/power/snapshot.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0c40c16174b4..8e75a5a1e9b4 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -2310,8 +2310,6 @@ static inline void free_highmem_data(void) free_image_page(buffer, PG_UNSAFE_CLEAR); } #else -static inline int get_safe_write_buffer(void) { return 0; } - static unsigned int count_highmem_image_pages(struct memory_bitmap *bm) { return 0; } -- cgit v1.2.3 From f4a4a8b1252a08b60cde66a6622bbca4a7f4af2e Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sun, 28 Dec 2014 09:27:07 -0500 Subject: file->f_path.dentry is pinned down for as long as the file is open... Signed-off-by: Al Viro --- kernel/auditsc.c | 5 +---- security/commoncap.c | 6 +----- 2 files changed, 2 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 072566dd0caf..55f82fce2526 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2405,7 +2405,6 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm, struct audit_aux_data_bprm_fcaps *ax; struct audit_context *context = current->audit_context; struct cpu_vfs_cap_data vcaps; - struct dentry *dentry; ax = kmalloc(sizeof(*ax), GFP_KERNEL); if (!ax) @@ -2415,9 +2414,7 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm, ax->d.next = context->aux; context->aux = (void *)ax; - dentry = dget(bprm->file->f_path.dentry); - get_vfs_caps_from_disk(dentry, &vcaps); - dput(dentry); + get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps); ax->fcap.permitted = vcaps.permitted; ax->fcap.inheritable = vcaps.inheritable; diff --git a/security/commoncap.c b/security/commoncap.c index 2915d8503054..f66713bd7450 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -434,7 +434,6 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data */ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_cap) { - struct dentry *dentry; int rc = 0; struct cpu_vfs_cap_data vcaps; @@ -446,9 +445,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) return 0; - dentry = dget(bprm->file->f_path.dentry); - - rc = get_vfs_caps_from_disk(dentry, &vcaps); + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps); if (rc < 0) { if (rc == -EINVAL) printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n", @@ -464,7 +461,6 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c __func__, rc, bprm->filename); out: - dput(dentry); if (rc) bprm_clear_caps(bprm); -- cgit v1.2.3 From 9e251d02041432487d89cb340e72490c4bbc198a Mon Sep 17 00:00:00 2001 From: Al Viro Date: Fri, 9 Jan 2015 20:40:02 -0500 Subject: kill pin_put() Signed-off-by: Al Viro --- fs/fs_pin.c | 11 ----------- include/linux/fs_pin.h | 1 - kernel/acct.c | 14 ++++++++++---- 3 files changed, 10 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/fs/fs_pin.c b/fs/fs_pin.c index 9368236ca100..f173313760b8 100644 --- a/fs/fs_pin.c +++ b/fs/fs_pin.c @@ -4,19 +4,8 @@ #include "internal.h" #include "mount.h" -static void pin_free_rcu(struct rcu_head *head) -{ - kfree(container_of(head, struct fs_pin, rcu)); -} - static DEFINE_SPINLOCK(pin_lock); -void pin_put(struct fs_pin *p) -{ - if (atomic_long_dec_and_test(&p->count)) - call_rcu(&p->rcu, pin_free_rcu); -} - void pin_remove(struct fs_pin *pin) { spin_lock(&pin_lock); diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h index f66525e72ccf..68a54b7741aa 100644 --- a/include/linux/fs_pin.h +++ b/include/linux/fs_pin.h @@ -12,6 +12,5 @@ struct fs_pin { void (*kill)(struct fs_pin *); }; -void pin_put(struct fs_pin *); void pin_remove(struct fs_pin *); void pin_insert(struct fs_pin *, struct vfsmount *); diff --git a/kernel/acct.c b/kernel/acct.c index 33738ef972f3..7bb9e659a7da 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -124,6 +124,12 @@ out: return acct->active; } +static void acct_put(struct bsd_acct_struct *p) +{ + if (atomic_long_dec_and_test(&p->pin.count)) + kfree_rcu(p, pin.rcu); +} + static struct bsd_acct_struct *acct_get(struct pid_namespace *ns) { struct bsd_acct_struct *res; @@ -144,7 +150,7 @@ again: mutex_lock(&res->lock); if (!res->ns) { mutex_unlock(&res->lock); - pin_put(&res->pin); + acct_put(res); goto again; } return res; @@ -175,7 +181,7 @@ static void acct_kill(struct bsd_acct_struct *acct, acct->ns = NULL; atomic_long_dec(&acct->pin.count); mutex_unlock(&acct->lock); - pin_put(&acct->pin); + acct_put(acct); } } @@ -186,7 +192,7 @@ static void acct_pin_kill(struct fs_pin *pin) mutex_lock(&acct->lock); if (!acct->ns) { mutex_unlock(&acct->lock); - pin_put(pin); + acct_put(acct); acct = NULL; } acct_kill(acct, NULL); @@ -576,7 +582,7 @@ static void slow_acct_process(struct pid_namespace *ns) if (acct) { do_acct_process(acct); mutex_unlock(&acct->lock); - pin_put(&acct->pin); + acct_put(acct); } } } -- cgit v1.2.3 From 32426f6653cbfde1ca16aff27a530ee36332f796 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 10 Jan 2015 00:07:35 -0500 Subject: pull bumping refcount into ->kill() there will be one more change of ->kill() calling conventions; this isn't final. Signed-off-by: Al Viro --- fs/fs_pin.c | 12 ------------ kernel/acct.c | 6 ++++++ 2 files changed, 6 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/fs/fs_pin.c b/fs/fs_pin.c index f173313760b8..5eb39a93a560 100644 --- a/fs/fs_pin.c +++ b/fs/fs_pin.c @@ -34,12 +34,6 @@ void mnt_pin_kill(struct mount *m) break; } pin = hlist_entry(p, struct fs_pin, m_list); - if (!atomic_long_inc_not_zero(&pin->count)) { - rcu_read_unlock(); - cpu_relax(); - continue; - } - rcu_read_unlock(); pin->kill(pin); } } @@ -56,12 +50,6 @@ void sb_pin_kill(struct super_block *sb) break; } pin = hlist_entry(p, struct fs_pin, s_list); - if (!atomic_long_inc_not_zero(&pin->count)) { - rcu_read_unlock(); - cpu_relax(); - continue; - } - rcu_read_unlock(); pin->kill(pin); } } diff --git a/kernel/acct.c b/kernel/acct.c index 7bb9e659a7da..a74f3d1a0c26 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -189,6 +189,12 @@ static void acct_pin_kill(struct fs_pin *pin) { struct bsd_acct_struct *acct; acct = container_of(pin, struct bsd_acct_struct, pin); + if (!atomic_long_inc_not_zero(&pin->count)) { + rcu_read_unlock(); + cpu_relax(); + return; + } + rcu_read_unlock(); mutex_lock(&acct->lock); if (!acct->ns) { mutex_unlock(&acct->lock); -- cgit v1.2.3 From 34cece2e8a1d2b66f00e153a19b80b4d4cec4eb8 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 10 Jan 2015 12:47:38 -0500 Subject: take count and rcu_head out of fs_pin Signed-off-by: Al Viro --- include/linux/fs_pin.h | 10 ++-------- kernel/acct.c | 14 ++++++++------ 2 files changed, 10 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h index 68a54b7741aa..a2e71912e758 100644 --- a/include/linux/fs_pin.h +++ b/include/linux/fs_pin.h @@ -1,14 +1,8 @@ #include struct fs_pin { - atomic_long_t count; - union { - struct { - struct hlist_node s_list; - struct hlist_node m_list; - }; - struct rcu_head rcu; - }; + struct hlist_node s_list; + struct hlist_node m_list; void (*kill)(struct fs_pin *); }; diff --git a/kernel/acct.c b/kernel/acct.c index a74f3d1a0c26..b8fbefb8678f 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -80,6 +80,8 @@ static void do_acct_process(struct bsd_acct_struct *acct); struct bsd_acct_struct { struct fs_pin pin; + atomic_long_t count; + struct rcu_head rcu; struct mutex lock; int active; unsigned long needcheck; @@ -126,8 +128,8 @@ out: static void acct_put(struct bsd_acct_struct *p) { - if (atomic_long_dec_and_test(&p->pin.count)) - kfree_rcu(p, pin.rcu); + if (atomic_long_dec_and_test(&p->count)) + kfree_rcu(p, rcu); } static struct bsd_acct_struct *acct_get(struct pid_namespace *ns) @@ -141,7 +143,7 @@ again: rcu_read_unlock(); return NULL; } - if (!atomic_long_inc_not_zero(&res->pin.count)) { + if (!atomic_long_inc_not_zero(&res->count)) { rcu_read_unlock(); cpu_relax(); goto again; @@ -179,7 +181,7 @@ static void acct_kill(struct bsd_acct_struct *acct, pin_remove(&acct->pin); ns->bacct = new; acct->ns = NULL; - atomic_long_dec(&acct->pin.count); + atomic_long_dec(&acct->count); mutex_unlock(&acct->lock); acct_put(acct); } @@ -189,7 +191,7 @@ static void acct_pin_kill(struct fs_pin *pin) { struct bsd_acct_struct *acct; acct = container_of(pin, struct bsd_acct_struct, pin); - if (!atomic_long_inc_not_zero(&pin->count)) { + if (!atomic_long_inc_not_zero(&acct->count)) { rcu_read_unlock(); cpu_relax(); return; @@ -250,7 +252,7 @@ static int acct_on(struct filename *pathname) mnt = file->f_path.mnt; file->f_path.mnt = internal; - atomic_long_set(&acct->pin.count, 1); + atomic_long_set(&acct->count, 1); acct->pin.kill = acct_pin_kill; acct->file = file; acct->needcheck = jiffies; -- cgit v1.2.3 From 3b994d98a815d934ab6a77a380882865982c14f9 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 10 Jan 2015 17:18:37 -0500 Subject: get rid of the second argument of acct_kill() Replace the old ns->bacct only with NULL and only if it still points to acct. And assign the new value to it *before* calling acct_kill() in acct_on(). That way we don't need to pass the new acct to acct_kill(). Signed-off-by: Al Viro --- kernel/acct.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/acct.c b/kernel/acct.c index b8fbefb8678f..cf6588ab517b 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -168,8 +168,7 @@ static void close_work(struct work_struct *work) complete(&acct->done); } -static void acct_kill(struct bsd_acct_struct *acct, - struct bsd_acct_struct *new) +static void acct_kill(struct bsd_acct_struct *acct) { if (acct) { struct pid_namespace *ns = acct->ns; @@ -179,7 +178,7 @@ static void acct_kill(struct bsd_acct_struct *acct, schedule_work(&acct->work); wait_for_completion(&acct->done); pin_remove(&acct->pin); - ns->bacct = new; + cmpxchg(&ns->bacct, acct, NULL); acct->ns = NULL; atomic_long_dec(&acct->count); mutex_unlock(&acct->lock); @@ -203,7 +202,7 @@ static void acct_pin_kill(struct fs_pin *pin) acct_put(acct); acct = NULL; } - acct_kill(acct, NULL); + acct_kill(acct); } static int acct_on(struct filename *pathname) @@ -262,10 +261,8 @@ static int acct_on(struct filename *pathname) pin_insert(&acct->pin, mnt); old = acct_get(ns); - if (old) - acct_kill(old, acct); - else - ns->bacct = acct; + ns->bacct = acct; + acct_kill(old); mutex_unlock(&acct->lock); mnt_drop_write(mnt); mntput(mnt); @@ -302,7 +299,7 @@ SYSCALL_DEFINE1(acct, const char __user *, name) mutex_unlock(&acct_on_mutex); putname(tmp); } else { - acct_kill(acct_get(task_active_pid_ns(current)), NULL); + acct_kill(acct_get(task_active_pid_ns(current))); } return error; @@ -310,7 +307,7 @@ SYSCALL_DEFINE1(acct, const char __user *, name) void acct_exit_ns(struct pid_namespace *ns) { - acct_kill(acct_get(ns), NULL); + acct_kill(acct_get(ns)); } /* -- cgit v1.2.3 From 59eda0e07f43c950d31756213b607af673e551f0 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 10 Jan 2015 17:53:21 -0500 Subject: new fs_pin killing logics Signed-off-by: Al Viro --- fs/fs_pin.c | 54 +++++++++++++++++++++++++---- include/linux/fs_pin.h | 13 ++++++- include/linux/pid_namespace.h | 4 +-- kernel/acct.c | 81 ++++++++++++++++++------------------------- 4 files changed, 96 insertions(+), 56 deletions(-) (limited to 'kernel') diff --git a/fs/fs_pin.c b/fs/fs_pin.c index 50ef7d2ef03c..0c77bdc238b2 100644 --- a/fs/fs_pin.c +++ b/fs/fs_pin.c @@ -1,4 +1,5 @@ #include +#include #include #include #include "internal.h" @@ -12,6 +13,10 @@ void pin_remove(struct fs_pin *pin) hlist_del(&pin->m_list); hlist_del(&pin->s_list); spin_unlock(&pin_lock); + spin_lock_irq(&pin->wait.lock); + pin->done = 1; + wake_up_locked(&pin->wait); + spin_unlock_irq(&pin->wait.lock); } void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p) @@ -28,19 +33,58 @@ void pin_insert(struct fs_pin *pin, struct vfsmount *m) pin_insert_group(pin, m, &m->mnt_sb->s_pins); } +void pin_kill(struct fs_pin *p) +{ + wait_queue_t wait; + + if (!p) { + rcu_read_unlock(); + return; + } + init_wait(&wait); + spin_lock_irq(&p->wait.lock); + if (likely(!p->done)) { + p->done = -1; + spin_unlock_irq(&p->wait.lock); + rcu_read_unlock(); + p->kill(p); + return; + } + if (p->done > 0) { + spin_unlock_irq(&p->wait.lock); + rcu_read_unlock(); + return; + } + __add_wait_queue(&p->wait, &wait); + while (1) { + set_current_state(TASK_UNINTERRUPTIBLE); + spin_unlock_irq(&p->wait.lock); + rcu_read_unlock(); + schedule(); + rcu_read_lock(); + if (likely(list_empty(&wait.task_list))) + break; + /* OK, we know p couldn't have been freed yet */ + spin_lock_irq(&p->wait.lock); + if (p->done > 0) { + spin_unlock_irq(&p->wait.lock); + break; + } + } + rcu_read_unlock(); +} + void mnt_pin_kill(struct mount *m) { while (1) { struct hlist_node *p; - struct fs_pin *pin; rcu_read_lock(); p = ACCESS_ONCE(m->mnt_pins.first); if (!p) { rcu_read_unlock(); break; } - pin = hlist_entry(p, struct fs_pin, m_list); - pin->kill(pin); + pin_kill(hlist_entry(p, struct fs_pin, m_list)); } } @@ -48,14 +92,12 @@ void group_pin_kill(struct hlist_head *p) { while (1) { struct hlist_node *q; - struct fs_pin *pin; rcu_read_lock(); q = ACCESS_ONCE(p->first); if (!q) { rcu_read_unlock(); break; } - pin = hlist_entry(q, struct fs_pin, s_list); - pin->kill(pin); + pin_kill(hlist_entry(q, struct fs_pin, s_list)); } } diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h index 2be38d1464ae..9dc4e0384bfb 100644 --- a/include/linux/fs_pin.h +++ b/include/linux/fs_pin.h @@ -1,11 +1,22 @@ -#include +#include struct fs_pin { + wait_queue_head_t wait; + int done; struct hlist_node s_list; struct hlist_node m_list; void (*kill)(struct fs_pin *); }; +struct vfsmount; + +static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *)) +{ + init_waitqueue_head(&p->wait); + p->kill = kill; +} + void pin_remove(struct fs_pin *); void pin_insert_group(struct fs_pin *, struct vfsmount *, struct hlist_head *); void pin_insert(struct fs_pin *, struct vfsmount *); +void pin_kill(struct fs_pin *); diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index b9cf6c51b181..918b117a7cd3 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -19,7 +19,7 @@ struct pidmap { #define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) #define PIDMAP_ENTRIES ((PID_MAX_LIMIT+BITS_PER_PAGE-1)/BITS_PER_PAGE) -struct bsd_acct_struct; +struct fs_pin; struct pid_namespace { struct kref kref; @@ -37,7 +37,7 @@ struct pid_namespace { struct dentry *proc_thread_self; #endif #ifdef CONFIG_BSD_PROCESS_ACCT - struct bsd_acct_struct *bacct; + struct fs_pin *bacct; #endif struct user_namespace *user_ns; struct work_struct proc_work; diff --git a/kernel/acct.c b/kernel/acct.c index cf6588ab517b..e6c10d1a4058 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -76,7 +76,6 @@ int acct_parm[3] = {4, 2, 30}; /* * External references and all of the globals. */ -static void do_acct_process(struct bsd_acct_struct *acct); struct bsd_acct_struct { struct fs_pin pin; @@ -91,6 +90,8 @@ struct bsd_acct_struct { struct completion done; }; +static void do_acct_process(struct bsd_acct_struct *acct); + /* * Check the amount of free space and suspend/resume accordingly. */ @@ -132,13 +133,18 @@ static void acct_put(struct bsd_acct_struct *p) kfree_rcu(p, rcu); } +static inline struct bsd_acct_struct *to_acct(struct fs_pin *p) +{ + return p ? container_of(p, struct bsd_acct_struct, pin) : NULL; +} + static struct bsd_acct_struct *acct_get(struct pid_namespace *ns) { struct bsd_acct_struct *res; again: smp_rmb(); rcu_read_lock(); - res = ACCESS_ONCE(ns->bacct); + res = to_acct(ACCESS_ONCE(ns->bacct)); if (!res) { rcu_read_unlock(); return NULL; @@ -150,7 +156,7 @@ again: } rcu_read_unlock(); mutex_lock(&res->lock); - if (!res->ns) { + if (res != to_acct(ACCESS_ONCE(ns->bacct))) { mutex_unlock(&res->lock); acct_put(res); goto again; @@ -158,6 +164,19 @@ again: return res; } +static void acct_pin_kill(struct fs_pin *pin) +{ + struct bsd_acct_struct *acct = to_acct(pin); + mutex_lock(&acct->lock); + do_acct_process(acct); + schedule_work(&acct->work); + wait_for_completion(&acct->done); + cmpxchg(&acct->ns->bacct, pin, NULL); + mutex_unlock(&acct->lock); + pin_remove(pin); + acct_put(acct); +} + static void close_work(struct work_struct *work) { struct bsd_acct_struct *acct = container_of(work, struct bsd_acct_struct, work); @@ -168,49 +187,13 @@ static void close_work(struct work_struct *work) complete(&acct->done); } -static void acct_kill(struct bsd_acct_struct *acct) -{ - if (acct) { - struct pid_namespace *ns = acct->ns; - do_acct_process(acct); - INIT_WORK(&acct->work, close_work); - init_completion(&acct->done); - schedule_work(&acct->work); - wait_for_completion(&acct->done); - pin_remove(&acct->pin); - cmpxchg(&ns->bacct, acct, NULL); - acct->ns = NULL; - atomic_long_dec(&acct->count); - mutex_unlock(&acct->lock); - acct_put(acct); - } -} - -static void acct_pin_kill(struct fs_pin *pin) -{ - struct bsd_acct_struct *acct; - acct = container_of(pin, struct bsd_acct_struct, pin); - if (!atomic_long_inc_not_zero(&acct->count)) { - rcu_read_unlock(); - cpu_relax(); - return; - } - rcu_read_unlock(); - mutex_lock(&acct->lock); - if (!acct->ns) { - mutex_unlock(&acct->lock); - acct_put(acct); - acct = NULL; - } - acct_kill(acct); -} - static int acct_on(struct filename *pathname) { struct file *file; struct vfsmount *mnt, *internal; struct pid_namespace *ns = task_active_pid_ns(current); - struct bsd_acct_struct *acct, *old; + struct bsd_acct_struct *acct; + struct fs_pin *old; int err; acct = kzalloc(sizeof(struct bsd_acct_struct), GFP_KERNEL); @@ -252,18 +235,20 @@ static int acct_on(struct filename *pathname) file->f_path.mnt = internal; atomic_long_set(&acct->count, 1); - acct->pin.kill = acct_pin_kill; + init_fs_pin(&acct->pin, acct_pin_kill); acct->file = file; acct->needcheck = jiffies; acct->ns = ns; mutex_init(&acct->lock); + INIT_WORK(&acct->work, close_work); + init_completion(&acct->done); mutex_lock_nested(&acct->lock, 1); /* nobody has seen it yet */ pin_insert(&acct->pin, mnt); - old = acct_get(ns); - ns->bacct = acct; - acct_kill(old); + rcu_read_lock(); + old = xchg(&ns->bacct, &acct->pin); mutex_unlock(&acct->lock); + pin_kill(old); mnt_drop_write(mnt); mntput(mnt); return 0; @@ -299,7 +284,8 @@ SYSCALL_DEFINE1(acct, const char __user *, name) mutex_unlock(&acct_on_mutex); putname(tmp); } else { - acct_kill(acct_get(task_active_pid_ns(current))); + rcu_read_lock(); + pin_kill(task_active_pid_ns(current)->bacct); } return error; @@ -307,7 +293,8 @@ SYSCALL_DEFINE1(acct, const char __user *, name) void acct_exit_ns(struct pid_namespace *ns) { - acct_kill(acct_get(ns)); + rcu_read_lock(); + pin_kill(ns->bacct); } /* -- cgit v1.2.3 From edb0ec0725bb9797ded0deaf172a6795e43795c8 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Sun, 25 Jan 2015 19:50:34 +0100 Subject: kexec, Kconfig: spell "architecture" properly Grepping for "archicture" showed it actually twice! Most unusual spelling error, very interesting. :) Signed-off-by: Borislav Petkov Signed-off-by: Jiri Kosina --- kernel/kexec.c | 2 +- lib/Kconfig.debug | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kexec.c b/kernel/kexec.c index 2abf9f6e9a61..b2ed6a3d99fb 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -2512,7 +2512,7 @@ static int kexec_apply_relocations(struct kimage *image) continue; /* - * Respective archicture needs to provide support for applying + * Respective architecture needs to provide support for applying * relocations of type SHT_RELA/SHT_REL. */ if (sechdrs[i].sh_type == SHT_RELA) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 4e35a5d767ed..cfe8ca6717c2 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -620,7 +620,7 @@ config DEBUG_STACKOVERFLOW depends on DEBUG_KERNEL && HAVE_DEBUG_STACKOVERFLOW ---help--- Say Y here if you want to check for overflows of kernel, IRQ - and exception stacks (if your archicture uses them). This + and exception stacks (if your architecture uses them). This option will show detailed messages if free stack space drops below a certain limit. -- cgit v1.2.3 From 69a1c994cc540cc84469acf9952f72b899b38e8b Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 27 Jan 2015 17:17:20 +0100 Subject: tracing: Remove newline from trace_printk warning banner Remove the output-confusing newline below: [ 0.191328] ********************************************************** [ 0.191493] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 0.191586] ** ** ... Link: http://lkml.kernel.org/r/1422375440-31970-1-git-send-email-bp@alien8.de Signed-off-by: Borislav Petkov [ added an extra '\n' by itself, to keep what it was suppose to do ] Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index acd27555dc5b..f82e53b0e5a7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2036,7 +2036,8 @@ void trace_printk_init_buffers(void) /* trace_printk() is for debug use only. Don't use it in production. */ - pr_warning("\n**********************************************************\n"); + pr_warning("\n"); + pr_warning("**********************************************************\n"); pr_warning("** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **\n"); pr_warning("** **\n"); pr_warning("** trace_printk() being used. Allocating extra memory. **\n"); -- cgit v1.2.3 From 6ea22486ba46bcb665de36514094d74575cd1330 Mon Sep 17 00:00:00 2001 From: Dave Martin Date: Wed, 28 Jan 2015 12:48:53 +0000 Subject: tracing: Add array printing helper If a trace event contains an array, there is currently no standard way to format this for text output. Drivers are currently hacking around this by a) local hacks that use the trace_seq functionailty directly, or b) just not printing that information. For fixed size arrays, formatting of the elements can be open-coded, but this gets cumbersome for arrays of non-trivial size. These approaches result in non-standard content of the event format description delivered to userspace, so userland tools needs to be taught to understand and parse each array printing method individually. This patch implements a __print_array() helper that tracepoint implementations can use instead of reinventing it. A simple C-style syntax is used to delimit the array and its elements {like,this}. So that the helpers can be used with large static arrays as well as dynamic arrays, they take a pointer and element count: they can be used with __get_dynamic_array() for use with dynamic arrays. Link: http://lkml.kernel.org/r/1422449335-8289-2-git-send-email-javi.merino@arm.com Cc: Ingo Molnar Signed-off-by: Dave Martin Signed-off-by: Javi Merino Signed-off-by: Steven Rostedt --- include/linux/ftrace_event.h | 4 ++++ include/trace/ftrace.h | 9 +++++++++ kernel/trace/trace_output.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+) (limited to 'kernel') diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 0bebb5c348b8..5aa4a9269547 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -44,6 +44,10 @@ const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr, const char *ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int len); +const char *ftrace_print_array_seq(struct trace_seq *p, + const void *buf, int buf_len, + size_t el_size); + struct trace_iterator; struct trace_event; diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 139b5067345b..304901fc5f34 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -263,6 +263,14 @@ #undef __print_hex #define __print_hex(buf, buf_len) ftrace_print_hex_seq(p, buf, buf_len) +#undef __print_array +#define __print_array(array, count, el_size) \ + ({ \ + BUILD_BUG_ON(el_size != 1 && el_size != 2 && \ + el_size != 4 && el_size != 8); \ + ftrace_print_array_seq(p, array, count, el_size); \ + }) + #undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ static notrace enum print_line_t \ @@ -674,6 +682,7 @@ static inline void ftrace_test_probe_##call(void) \ #undef __get_dynamic_array_len #undef __get_str #undef __get_bitmask +#undef __print_array #undef TP_printk #define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index b77b9a697619..692bf7184c8c 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -177,6 +177,50 @@ ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len) } EXPORT_SYMBOL(ftrace_print_hex_seq); +const char * +ftrace_print_array_seq(struct trace_seq *p, const void *buf, int buf_len, + size_t el_size) +{ + const char *ret = trace_seq_buffer_ptr(p); + const char *prefix = ""; + void *ptr = (void *)buf; + + trace_seq_putc(p, '{'); + + while (ptr < buf + buf_len) { + switch (el_size) { + case 1: + trace_seq_printf(p, "%s0x%x", prefix, + *(u8 *)ptr); + break; + case 2: + trace_seq_printf(p, "%s0x%x", prefix, + *(u16 *)ptr); + break; + case 4: + trace_seq_printf(p, "%s0x%x", prefix, + *(u32 *)ptr); + break; + case 8: + trace_seq_printf(p, "%s0x%llx", prefix, + *(u64 *)ptr); + break; + default: + trace_seq_printf(p, "BAD SIZE:%zu 0x%x", el_size, + *(u8 *)ptr); + el_size = 1; + } + prefix = ","; + ptr += el_size; + } + + trace_seq_putc(p, '}'); + trace_seq_putc(p, 0); + + return ret; +} +EXPORT_SYMBOL(ftrace_print_array_seq); + int ftrace_raw_output_prep(struct trace_iterator *iter, struct trace_event *trace_event) { -- cgit v1.2.3 From da194930ede6d87945786f72b1c36c7a4ed0f3a6 Mon Sep 17 00:00:00 2001 From: Tina Ruchandani Date: Wed, 28 Jan 2015 19:46:11 +0530 Subject: trace: Use 64-bit timekeeping The ring_buffer_producer uses 'struct timeval' to measure its start and end times. 'struct timeval' on 32-bit systems will have its tv_sec value overflow in year 2038 and beyond. This patch replaces struct timeval with 'ktime_t' which uses 64-bit representation for nanoseconds. Link: http://lkml.kernel.org/r/20150128141611.GA2701@tinar Suggested-by: Arnd Bergmann Suggested-by: Steven Rostedt Signed-off-by: Tina Ruchandani Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer_benchmark.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c index 3f9e328c30b5..13d945c0d03f 100644 --- a/kernel/trace/ring_buffer_benchmark.c +++ b/kernel/trace/ring_buffer_benchmark.c @@ -7,7 +7,7 @@ #include #include #include -#include +#include #include struct rb_page { @@ -17,7 +17,7 @@ struct rb_page { }; /* run time and sleep time in seconds */ -#define RUN_TIME 10 +#define RUN_TIME 10ULL #define SLEEP_TIME 10 /* number of events for writer to wake up the reader */ @@ -212,8 +212,7 @@ static void ring_buffer_consumer(void) static void ring_buffer_producer(void) { - struct timeval start_tv; - struct timeval end_tv; + ktime_t start_time, end_time, timeout; unsigned long long time; unsigned long long entries; unsigned long long overruns; @@ -227,7 +226,8 @@ static void ring_buffer_producer(void) * make the system stall) */ trace_printk("Starting ring buffer hammer\n"); - do_gettimeofday(&start_tv); + start_time = ktime_get(); + timeout = ktime_add_ns(start_time, RUN_TIME * NSEC_PER_SEC); do { struct ring_buffer_event *event; int *entry; @@ -244,7 +244,7 @@ static void ring_buffer_producer(void) ring_buffer_unlock_commit(buffer, event); } } - do_gettimeofday(&end_tv); + end_time = ktime_get(); cnt++; if (consumer && !(cnt % wakeup_interval)) @@ -264,7 +264,7 @@ static void ring_buffer_producer(void) cond_resched(); #endif - } while (end_tv.tv_sec < (start_tv.tv_sec + RUN_TIME) && !kill_test); + } while (ktime_before(end_time, timeout) && !kill_test); trace_printk("End ring buffer hammer\n"); if (consumer) { @@ -280,9 +280,7 @@ static void ring_buffer_producer(void) wait_for_completion(&read_done); } - time = end_tv.tv_sec - start_tv.tv_sec; - time *= USEC_PER_SEC; - time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); + time = ktime_us_delta(end_time, start_time); entries = ring_buffer_entries(buffer); overruns = ring_buffer_overruns(buffer); -- cgit v1.2.3 From c0a80c0c27e5e65b180a25e6c4c2f7ef9e386cd3 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Fri, 9 Jan 2015 13:06:33 +0100 Subject: ftrace: allow architectures to specify ftrace compile options If the kernel is compiled with function tracer support the -pg compile option is passed to gcc to generate extra code into the prologue of each function. This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE makefile variable which architectures can override if a different option should be used for code generation. Acked-by: Steven Rostedt Signed-off-by: Heiko Carstens Signed-off-by: Martin Schwidefsky --- Makefile | 6 +++++- kernel/Makefile | 4 ++-- kernel/events/Makefile | 2 +- kernel/locking/Makefile | 8 ++++---- kernel/sched/Makefile | 2 +- kernel/trace/Makefile | 4 ++-- lib/Makefile | 2 +- scripts/Makefile.build | 5 +++-- 8 files changed, 19 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/Makefile b/Makefile index fd80c6e9bc23..11c6fe8f708f 100644 --- a/Makefile +++ b/Makefile @@ -724,10 +724,14 @@ KBUILD_CFLAGS += $(call cc-option, -femit-struct-debug-baseonly) \ endif ifdef CONFIG_FUNCTION_TRACER +ifndef CC_FLAGS_FTRACE +CC_FLAGS_FTRACE := -pg +endif +export CC_FLAGS_FTRACE ifdef CONFIG_HAVE_FENTRY CC_USING_FENTRY := $(call cc-option, -mfentry -DCC_USING_FENTRY) endif -KBUILD_CFLAGS += -pg $(CC_USING_FENTRY) +KBUILD_CFLAGS += $(CC_FLAGS_FTRACE) $(CC_USING_FENTRY) KBUILD_AFLAGS += $(CC_USING_FENTRY) ifdef CONFIG_DYNAMIC_FTRACE ifdef CONFIG_HAVE_C_RECORDMCOUNT diff --git a/kernel/Makefile b/kernel/Makefile index a59481a3fa6c..13af308f2460 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -13,8 +13,8 @@ obj-y = fork.o exec_domain.o panic.o \ ifdef CONFIG_FUNCTION_TRACER # Do not trace debug files and internal ftrace files -CFLAGS_REMOVE_cgroup-debug.o = -pg -CFLAGS_REMOVE_irq_work.o = -pg +CFLAGS_REMOVE_cgroup-debug.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE) endif # cond_syscall is currently not LTO compatible diff --git a/kernel/events/Makefile b/kernel/events/Makefile index 103f5d147b2f..2925188f50ea 100644 --- a/kernel/events/Makefile +++ b/kernel/events/Makefile @@ -1,5 +1,5 @@ ifdef CONFIG_FUNCTION_TRACER -CFLAGS_REMOVE_core.o = -pg +CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE) endif obj-y := core.o ring_buffer.o callchain.o diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile index 8541bfdfd232..4caca3f7af53 100644 --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -2,10 +2,10 @@ obj-y += mutex.o semaphore.o rwsem.o mcs_spinlock.o ifdef CONFIG_FUNCTION_TRACER -CFLAGS_REMOVE_lockdep.o = -pg -CFLAGS_REMOVE_lockdep_proc.o = -pg -CFLAGS_REMOVE_mutex-debug.o = -pg -CFLAGS_REMOVE_rtmutex-debug.o = -pg +CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_lockdep_proc.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_mutex-debug.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_rtmutex-debug.o = $(CC_FLAGS_FTRACE) endif obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b0db5c..46be87024875 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -1,5 +1,5 @@ ifdef CONFIG_FUNCTION_TRACER -CFLAGS_REMOVE_clock.o = -pg +CFLAGS_REMOVE_clock.o = $(CC_FLAGS_FTRACE) endif ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 67d6369ddf83..9a7e0e60a1a8 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -3,11 +3,11 @@ ifdef CONFIG_FUNCTION_TRACER ORIG_CFLAGS := $(KBUILD_CFLAGS) -KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) +KBUILD_CFLAGS = $(subst $(CC_FLAGS_FTRACE),,$(ORIG_CFLAGS)) ifdef CONFIG_FTRACE_SELFTEST # selftest needs instrumentation -CFLAGS_trace_selftest_dynamic.o = -pg +CFLAGS_trace_selftest_dynamic.o = $(CC_FLAGS_FTRACE) obj-y += trace_selftest_dynamic.o endif endif diff --git a/lib/Makefile b/lib/Makefile index 3c3b30b9e020..f3f73e50519a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -4,7 +4,7 @@ ifdef CONFIG_FUNCTION_TRACER ORIG_CFLAGS := $(KBUILD_CFLAGS) -KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) +KBUILD_CFLAGS = $(subst $(CC_FLAGS_FTRACE),,$(ORIG_CFLAGS)) endif lib-y := ctype.o string.o vsprintf.o cmdline.o \ diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 649ce6844033..01df30af4d4a 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -234,8 +234,9 @@ sub_cmd_record_mcount = set -e ; perl $(srctree)/scripts/recordmcount.pl "$(ARCH "$(if $(part-of-module),1,0)" "$(@)"; recordmcount_source := $(srctree)/scripts/recordmcount.pl endif -cmd_record_mcount = \ - if [ "$(findstring -pg,$(_c_flags))" = "-pg" ]; then \ +cmd_record_mcount = \ + if [ "$(findstring $(CC_FLAGS_FTRACE),$(_c_flags))" = \ + "$(CC_FLAGS_FTRACE)" ]; then \ $(sub_cmd_record_mcount) \ fi; endif -- cgit v1.2.3 From c9257f78b4f785d200737d60453fff51b57a2356 Mon Sep 17 00:00:00 2001 From: Todd E Brandt Date: Fri, 12 Dec 2014 20:06:46 -0800 Subject: PM / sleep: export suspend_resume trace event Export the suspend_resume tracepoint so it can be used in loadable modules. Signed-off-by: Todd Brandt Signed-off-by: Rafael J. Wysocki --- kernel/trace/power-traces.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c index 1c71382b283d..eb4220a132ec 100644 --- a/kernel/trace/power-traces.c +++ b/kernel/trace/power-traces.c @@ -13,5 +13,6 @@ #define CREATE_TRACE_POINTS #include +EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume); EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle); -- cgit v1.2.3 From fd979c0132074856975a6e79bc2226b99435ec5b Mon Sep 17 00:00:00 2001 From: Cody P Schafer Date: Fri, 30 Jan 2015 13:45:57 -0800 Subject: perf: provide sysfs_show for struct perf_pmu_events_attr (struct perf_pmu_events_attr) is defined in include/linux/perf_event.h, but the only "show" for it is in x86 and contains x86 specific stuff. Make a generic one for those of us who are just using the event_str. Signed-off-by: Cody P Schafer Signed-off-by: Sukadev Bhattiprolu Acked-by: Jiri Olsa Signed-off-by: Michael Ellerman --- include/linux/perf_event.h | 3 +++ kernel/events/core.c | 12 ++++++++++++ 2 files changed, 15 insertions(+) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 486e84ccb1f9..58f59bdb590b 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -897,6 +897,9 @@ struct perf_pmu_events_attr { const char *event_str; }; +ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr, + char *page); + #define PMU_EVENT_ATTR(_name, _var, _id, _show) \ static struct perf_pmu_events_attr _var = { \ .attr = __ATTR(_name, 0444, _show, NULL), \ diff --git a/kernel/events/core.c b/kernel/events/core.c index 4c1ee7f2bebc..934687f8d51b 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8276,6 +8276,18 @@ void __init perf_event_init(void) != 1024); } +ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr, + char *page) +{ + struct perf_pmu_events_attr *pmu_attr = + container_of(attr, struct perf_pmu_events_attr, attr); + + if (pmu_attr->event_str) + return sprintf(page, "%s\n", pmu_attr->event_str); + + return 0; +} + static int __init perf_event_sysfs_init(void) { struct pmu *pmu; -- cgit v1.2.3 From c602894814adc93589dde028182101818c5f938b Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 26 Jan 2015 20:38:39 -0500 Subject: tracing: Make tracing_init_dentry_tr() static tracing_init_dentry_tr() is not used outside of trace.c, it should be static. Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 2 +- kernel/trace/trace.h | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index f82e53b0e5a7..2668a0d742ee 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5815,7 +5815,7 @@ static __init int register_snapshot_cmd(void) static inline __init int register_snapshot_cmd(void) { return 0; } #endif /* defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE) */ -struct dentry *tracing_init_dentry_tr(struct trace_array *tr) +static struct dentry *tracing_init_dentry_tr(struct trace_array *tr) { if (tr->dir) return tr->dir; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 0eddfeb05fee..dd8205a35760 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -542,7 +542,6 @@ struct dentry *trace_create_file(const char *name, void *data, const struct file_operations *fops); -struct dentry *tracing_init_dentry_tr(struct trace_array *tr); struct dentry *tracing_init_dentry(void); struct ring_buffer_event; -- cgit v1.2.3 From 7eeafbcab47fe9860e5106286db83d60e3f35644 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Mon, 26 Jan 2015 21:00:48 -0500 Subject: tracing: Separate out initializing top level dir from instances The top level trace array is treated a little different than the instances, as it has to deal with more of the general tracing. The tr->dir is the tracing directory, which is an immutable dentry, where as the tr->dir of instances are the dentry that was created, and can be destroyed later. These should have different functions accessing them. As only tracing_init_dentry() deals with the top level array, fold the code for it into that function, and remove the trace_init_dentry_tr() that was also used by the instances to get their directory dentry. Add a tracing_get_dentry() to just get the tracing dir entry for instances as well as the top level array. Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 51 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 30 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2668a0d742ee..5afce60e1b68 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5815,28 +5815,11 @@ static __init int register_snapshot_cmd(void) static inline __init int register_snapshot_cmd(void) { return 0; } #endif /* defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE) */ -static struct dentry *tracing_init_dentry_tr(struct trace_array *tr) +static struct dentry *tracing_get_dentry(struct trace_array *tr) { - if (tr->dir) - return tr->dir; - - if (!debugfs_initialized()) - return ERR_PTR(-ENODEV); - - if (tr->flags & TRACE_ARRAY_FL_GLOBAL) - tr->dir = debugfs_create_dir("tracing", NULL); - - if (!tr->dir) - pr_warn_once("Could not create debugfs directory 'tracing'\n"); - return tr->dir; } -struct dentry *tracing_init_dentry(void) -{ - return tracing_init_dentry_tr(&global_trace); -} - static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) { struct dentry *d_tracer; @@ -5844,7 +5827,7 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) if (tr->percpu_dir) return tr->percpu_dir; - d_tracer = tracing_init_dentry_tr(tr); + d_tracer = tracing_get_dentry(tr); if (IS_ERR(d_tracer)) return NULL; @@ -6047,7 +6030,7 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr) if (tr->options) return tr->options; - d_tracer = tracing_init_dentry_tr(tr); + d_tracer = tracing_get_dentry(tr); if (IS_ERR(d_tracer)) return NULL; @@ -6532,6 +6515,33 @@ init_tracer_debugfs(struct trace_array *tr, struct dentry *d_tracer) } +/** + * tracing_init_dentry - initialize top level trace array + * + * This is called when creating files or directories in the tracing + * directory. It is called via fs_initcall() by any of the boot up code + * and expects to return the dentry of the top level tracing directory. + */ +struct dentry *tracing_init_dentry(void) +{ + struct trace_array *tr = &global_trace; + + if (tr->dir) + return tr->dir; + + if (WARN_ON(!debugfs_initialized())) + return ERR_PTR(-ENODEV); + + tr->dir = debugfs_create_dir("tracing", NULL); + + if (!tr->dir) { + pr_warn_once("Could not create debugfs directory 'tracing'\n"); + return ERR_PTR(-ENOMEM); + } + + return tr->dir; +} + static __init int tracer_init_debugfs(void) { struct dentry *d_tracer; @@ -6772,7 +6782,6 @@ __init static int tracer_alloc_buffers(void) int ring_buf_size; int ret = -ENOMEM; - if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL)) goto out; -- cgit v1.2.3 From a64fc82c4f1598db024de7b3d79e7213379769ff Mon Sep 17 00:00:00 2001 From: Wonhong Kwon Date: Tue, 3 Feb 2015 17:22:00 +0900 Subject: PM / hibernate: exclude freed pages from allocated pages printout hibernate_preallocate_memory() prints out that how many pages are allocated, but it doesn't take into consideration the pages freed by free_unnecessary_pages(). Therefore, it always shows the count more than actually allocated. Signed-off-by: Wonhong Kwon [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki --- kernel/power/snapshot.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 8e75a5a1e9b4..c24d5a23bf93 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1472,9 +1472,9 @@ static inline unsigned long preallocate_highmem_fraction(unsigned long nr_pages, /** * free_unnecessary_pages - Release preallocated pages not needed for the image */ -static void free_unnecessary_pages(void) +static unsigned long free_unnecessary_pages(void) { - unsigned long save, to_free_normal, to_free_highmem; + unsigned long save, to_free_normal, to_free_highmem, free; save = count_data_pages(); if (alloc_normal >= save) { @@ -1495,6 +1495,7 @@ static void free_unnecessary_pages(void) else to_free_normal = 0; } + free = to_free_normal + to_free_highmem; memory_bm_position_reset(©_bm); @@ -1518,6 +1519,8 @@ static void free_unnecessary_pages(void) swsusp_unset_page_free(page); __free_page(page); } + + return free; } /** @@ -1707,7 +1710,7 @@ int hibernate_preallocate_memory(void) * pages in memory, but we have allocated more. Release the excessive * ones now. */ - free_unnecessary_pages(); + pages -= free_unnecessary_pages(); out: stop = ktime_get(); -- cgit v1.2.3 From 12cf89b550d13eb7cb86ef182bd6c04345a33a1f Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Tue, 3 Feb 2015 16:45:18 -0600 Subject: livepatch: rename config to CONFIG_LIVEPATCH Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of the config and the code more consistent. Signed-off-by: Josh Poimboeuf Reviewed-by: Jingoo Han Signed-off-by: Jiri Kosina --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/livepatch.h | 4 ++-- arch/x86/kernel/Makefile | 2 +- include/linux/livepatch.h | 4 ++-- kernel/livepatch/Kconfig | 6 +++--- kernel/livepatch/Makefile | 2 +- samples/Kconfig | 4 ++-- samples/livepatch/Makefile | 2 +- 8 files changed, 13 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 29b095231276..11970b076862 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -17,7 +17,7 @@ config X86_64 depends on 64BIT select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF - select HAVE_LIVE_PATCHING + select HAVE_LIVEPATCH ### Arch settings config X86 diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h index 26e58134c8cb..a455a53d789a 100644 --- a/arch/x86/include/asm/livepatch.h +++ b/arch/x86/include/asm/livepatch.h @@ -24,7 +24,7 @@ #include #include -#ifdef CONFIG_LIVE_PATCHING +#ifdef CONFIG_LIVEPATCH static inline int klp_check_compiler_support(void) { #ifndef CC_USING_FENTRY @@ -40,7 +40,7 @@ static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip) regs->ip = ip; } #else -#error Live patching support is disabled; check CONFIG_LIVE_PATCHING +#error Live patching support is disabled; check CONFIG_LIVEPATCH #endif #endif /* _ASM_X86_LIVEPATCH_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 316b34e74c15..732223496968 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -63,7 +63,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o obj-y += apic/ obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o -obj-$(CONFIG_LIVE_PATCHING) += livepatch.o +obj-$(CONFIG_LIVEPATCH) += livepatch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o obj-$(CONFIG_X86_TSC) += trace_clock.o diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index f14c6fb262b4..95023fd8b00d 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -24,7 +24,7 @@ #include #include -#if IS_ENABLED(CONFIG_LIVE_PATCHING) +#if IS_ENABLED(CONFIG_LIVEPATCH) #include @@ -128,6 +128,6 @@ extern int klp_unregister_patch(struct klp_patch *); extern int klp_enable_patch(struct klp_patch *); extern int klp_disable_patch(struct klp_patch *); -#endif /* CONFIG_LIVE_PATCHING */ +#endif /* CONFIG_LIVEPATCH */ #endif /* _LINUX_LIVEPATCH_H_ */ diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig index 347ee2221137..045022557936 100644 --- a/kernel/livepatch/Kconfig +++ b/kernel/livepatch/Kconfig @@ -1,15 +1,15 @@ -config HAVE_LIVE_PATCHING +config HAVE_LIVEPATCH bool help Arch supports kernel live patching -config LIVE_PATCHING +config LIVEPATCH bool "Kernel Live Patching" depends on DYNAMIC_FTRACE_WITH_REGS depends on MODULES depends on SYSFS depends on KALLSYMS_ALL - depends on HAVE_LIVE_PATCHING + depends on HAVE_LIVEPATCH help Say Y here if you want to support kernel live patching. This option has no runtime impact until a kernel "patch" diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile index 7c1f00861428..e8780c0901d9 100644 --- a/kernel/livepatch/Makefile +++ b/kernel/livepatch/Makefile @@ -1,3 +1,3 @@ -obj-$(CONFIG_LIVE_PATCHING) += livepatch.o +obj-$(CONFIG_LIVEPATCH) += livepatch.o livepatch-objs := core.o diff --git a/samples/Kconfig b/samples/Kconfig index 0aed20df5f0b..224ebb46bed5 100644 --- a/samples/Kconfig +++ b/samples/Kconfig @@ -63,9 +63,9 @@ config SAMPLE_RPMSG_CLIENT to communicate with an AMP-configured remote processor over the rpmsg bus. -config SAMPLE_LIVE_PATCHING +config SAMPLE_LIVEPATCH tristate "Build live patching sample -- loadable modules only" - depends on LIVE_PATCHING && m + depends on LIVEPATCH && m help Builds a sample live patch that replaces the procfs handler for /proc/cmdline to print "this has been live patched". diff --git a/samples/livepatch/Makefile b/samples/livepatch/Makefile index 7f1cdc131a02..10319d7ea0b1 100644 --- a/samples/livepatch/Makefile +++ b/samples/livepatch/Makefile @@ -1 +1 @@ -obj-$(CONFIG_SAMPLE_LIVE_PATCHING) += livepatch-sample.o +obj-$(CONFIG_SAMPLE_LIVEPATCH) += livepatch-sample.o -- cgit v1.2.3 From 1e0fb9ec679c9273a641f1d6f3d25ea47baef2bb Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Fri, 24 Oct 2014 15:58:10 -0700 Subject: perf: Add pmu callbacks to track event mapping and unmapping Signed-off-by: Andy Lutomirski Signed-off-by: Peter Zijlstra (Intel) Cc: Paul Mackerras Cc: Arnaldo Carvalho de Melo Cc: Kees Cook Cc: Andrea Arcangeli Cc: Vince Weaver Cc: "hillf.zj" Cc: Valdis Kletnieks Cc: Linus Torvalds Link: http://lkml.kernel.org/r/266afcba1d1f91ea5501e4e16e94bbbc1a9339b6.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar --- include/linux/perf_event.h | 7 +++++++ kernel/events/core.c | 9 +++++++++ 2 files changed, 16 insertions(+) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 5cad0e6f3552..33262004c310 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -202,6 +202,13 @@ struct pmu { */ int (*event_init) (struct perf_event *event); + /* + * Notification that the event was mapped or unmapped. Called + * in the context of the mapping task. + */ + void (*event_mapped) (struct perf_event *event); /*optional*/ + void (*event_unmapped) (struct perf_event *event); /*optional*/ + #define PERF_EF_START 0x01 /* start the counter when adding */ #define PERF_EF_RELOAD 0x02 /* reload the counter when starting */ #define PERF_EF_UPDATE 0x04 /* update the counter when stopping */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 7f2fbb8b5069..cc1487145d33 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4293,6 +4293,9 @@ static void perf_mmap_open(struct vm_area_struct *vma) atomic_inc(&event->mmap_count); atomic_inc(&event->rb->mmap_count); + + if (event->pmu->event_mapped) + event->pmu->event_mapped(event); } /* @@ -4312,6 +4315,9 @@ static void perf_mmap_close(struct vm_area_struct *vma) int mmap_locked = rb->mmap_locked; unsigned long size = perf_data_size(rb); + if (event->pmu->event_unmapped) + event->pmu->event_unmapped(event); + atomic_dec(&rb->mmap_count); if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) @@ -4513,6 +4519,9 @@ unlock: vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &perf_mmap_vmops; + if (event->pmu->event_mapped) + event->pmu->event_mapped(event); + return ret; } -- cgit v1.2.3 From c1317ec2b906442930318d9d6e51425c5a69e9cb Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Fri, 24 Oct 2014 15:58:11 -0700 Subject: perf: Pass the event to arch_perf_update_userpage() Signed-off-by: Andy Lutomirski Signed-off-by: Peter Zijlstra (Intel) Cc: Paul Mackerras Cc: Arnaldo Carvalho de Melo Cc: Kees Cook Cc: Andrea Arcangeli Cc: Vince Weaver Cc: "hillf.zj" Cc: Valdis Kletnieks Cc: Linus Torvalds Link: http://lkml.kernel.org/r/0fea9a7fac3c1eea86cb0a5954184e74f4213666.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event.c | 3 ++- kernel/events/core.c | 5 +++-- 2 files changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 6b5acd5f4a34..73e84a348de1 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1915,7 +1915,8 @@ static struct pmu pmu = { .flush_branch_stack = x86_pmu_flush_branch_stack, }; -void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) +void arch_perf_update_userpage(struct perf_event *event, + struct perf_event_mmap_page *userpg, u64 now) { struct cyc2ns_data *data; diff --git a/kernel/events/core.c b/kernel/events/core.c index cc1487145d33..13209a90b751 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4101,7 +4101,8 @@ unlock: rcu_read_unlock(); } -void __weak arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) +void __weak arch_perf_update_userpage( + struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now) { } @@ -4151,7 +4152,7 @@ void perf_event_update_userpage(struct perf_event *event) userpg->time_running = running + atomic64_read(&event->child_total_time_running); - arch_perf_update_userpage(userpg, now); + arch_perf_update_userpage(event, userpg, now); barrier(); ++userpg->lock; -- cgit v1.2.3 From 90e97820619dc912b52cc9d103272819d8b51259 Mon Sep 17 00:00:00 2001 From: Jiang Liu Date: Thu, 5 Feb 2015 13:44:43 +0800 Subject: resources: Move struct resource_list_entry from ACPI into resource core Currently ACPI, PCI and pnp all implement the same resource list management with different data structure. We need to transfer from one data structure into another when passing resources from one subsystem into another subsystem. So move struct resource_list_entry from ACPI into resource core and rename it as resource_entry, then it could be reused by different subystems and avoid the data structure conversion. Introduce dedicated header file resource_ext.h instead of embedding it into ioport.h to avoid header file inclusion order issues. Signed-off-by: Jiang Liu Acked-by: Vinod Koul Signed-off-by: Rafael J. Wysocki --- drivers/acpi/acpi_lpss.c | 8 ++--- drivers/acpi/acpi_platform.c | 4 +-- drivers/acpi/resource.c | 17 ++++------ drivers/dma/acpi-dma.c | 10 +++--- include/linux/acpi.h | 12 +------ include/linux/resource_ext.h | 77 ++++++++++++++++++++++++++++++++++++++++++++ kernel/resource.c | 25 ++++++++++++++ 7 files changed, 120 insertions(+), 33 deletions(-) create mode 100644 include/linux/resource_ext.h (limited to 'kernel') diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index 4f3febf8a589..dfd1b8095dad 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -313,7 +313,7 @@ static int acpi_lpss_create_device(struct acpi_device *adev, { struct lpss_device_desc *dev_desc; struct lpss_private_data *pdata; - struct resource_list_entry *rentry; + struct resource_entry *rentry; struct list_head resource_list; struct platform_device *pdev; int ret; @@ -333,12 +333,12 @@ static int acpi_lpss_create_device(struct acpi_device *adev, goto err_out; list_for_each_entry(rentry, &resource_list, node) - if (resource_type(&rentry->res) == IORESOURCE_MEM) { + if (resource_type(rentry->res) == IORESOURCE_MEM) { if (dev_desc->prv_size_override) pdata->mmio_size = dev_desc->prv_size_override; else - pdata->mmio_size = resource_size(&rentry->res); - pdata->mmio_base = ioremap(rentry->res.start, + pdata->mmio_size = resource_size(rentry->res); + pdata->mmio_base = ioremap(rentry->res->start, pdata->mmio_size); break; } diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c index 6ba8beb6b9d2..1284138e42ab 100644 --- a/drivers/acpi/acpi_platform.c +++ b/drivers/acpi/acpi_platform.c @@ -45,7 +45,7 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev) struct platform_device *pdev = NULL; struct acpi_device *acpi_parent; struct platform_device_info pdevinfo; - struct resource_list_entry *rentry; + struct resource_entry *rentry; struct list_head resource_list; struct resource *resources = NULL; int count; @@ -71,7 +71,7 @@ struct platform_device *acpi_create_platform_device(struct acpi_device *adev) } count = 0; list_for_each_entry(rentry, &resource_list, node) - resources[count++] = rentry->res; + resources[count++] = *rentry->res; acpi_dev_free_resource_list(&resource_list); } diff --git a/drivers/acpi/resource.c b/drivers/acpi/resource.c index 3ea0d17eb951..4752b9939987 100644 --- a/drivers/acpi/resource.c +++ b/drivers/acpi/resource.c @@ -444,12 +444,7 @@ EXPORT_SYMBOL_GPL(acpi_dev_resource_interrupt); */ void acpi_dev_free_resource_list(struct list_head *list) { - struct resource_list_entry *rentry, *re; - - list_for_each_entry_safe(rentry, re, list, node) { - list_del(&rentry->node); - kfree(rentry); - } + resource_list_free(list); } EXPORT_SYMBOL_GPL(acpi_dev_free_resource_list); @@ -464,16 +459,16 @@ struct res_proc_context { static acpi_status acpi_dev_new_resource_entry(struct resource_win *win, struct res_proc_context *c) { - struct resource_list_entry *rentry; + struct resource_entry *rentry; - rentry = kmalloc(sizeof(*rentry), GFP_KERNEL); + rentry = resource_list_create_entry(NULL, 0); if (!rentry) { c->error = -ENOMEM; return AE_NO_MEMORY; } - rentry->res = win->res; + *rentry->res = win->res; rentry->offset = win->offset; - list_add_tail(&rentry->node, c->list); + resource_list_add_tail(rentry, c->list); c->count++; return AE_OK; } @@ -534,7 +529,7 @@ static acpi_status acpi_dev_process_resource(struct acpi_resource *ares, * returned as the final error code. * * The resultant struct resource objects are put on the list pointed to by - * @list, that must be empty initially, as members of struct resource_list_entry + * @list, that must be empty initially, as members of struct resource_entry * objects. Callers of this routine should use %acpi_dev_free_resource_list() to * free that list. * diff --git a/drivers/dma/acpi-dma.c b/drivers/dma/acpi-dma.c index de361a156b34..5a635646e05c 100644 --- a/drivers/dma/acpi-dma.c +++ b/drivers/dma/acpi-dma.c @@ -43,7 +43,7 @@ static int acpi_dma_parse_resource_group(const struct acpi_csrt_group *grp, { const struct acpi_csrt_shared_info *si; struct list_head resource_list; - struct resource_list_entry *rentry; + struct resource_entry *rentry; resource_size_t mem = 0, irq = 0; int ret; @@ -56,10 +56,10 @@ static int acpi_dma_parse_resource_group(const struct acpi_csrt_group *grp, return 0; list_for_each_entry(rentry, &resource_list, node) { - if (resource_type(&rentry->res) == IORESOURCE_MEM) - mem = rentry->res.start; - else if (resource_type(&rentry->res) == IORESOURCE_IRQ) - irq = rentry->res.start; + if (resource_type(rentry->res) == IORESOURCE_MEM) + mem = rentry->res->start; + else if (resource_type(rentry->res) == IORESOURCE_IRQ) + irq = rentry->res->start; } acpi_dev_free_resource_list(&resource_list); diff --git a/include/linux/acpi.h b/include/linux/acpi.h index e818decb631f..e53822148b6a 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -27,6 +27,7 @@ #include #include /* for struct resource */ +#include #include #include @@ -285,11 +286,6 @@ extern int pnpacpi_disabled; #define PXM_INVAL (-1) -struct resource_win { - struct resource res; - resource_size_t offset; -}; - bool acpi_dev_resource_memory(struct acpi_resource *ares, struct resource *res); bool acpi_dev_resource_io(struct acpi_resource *ares, struct resource *res); bool acpi_dev_resource_address_space(struct acpi_resource *ares, @@ -300,12 +296,6 @@ unsigned long acpi_dev_irq_flags(u8 triggering, u8 polarity, u8 shareable); bool acpi_dev_resource_interrupt(struct acpi_resource *ares, int index, struct resource *res); -struct resource_list_entry { - struct list_head node; - struct resource res; - resource_size_t offset; -}; - void acpi_dev_free_resource_list(struct list_head *list); int acpi_dev_get_resources(struct acpi_device *adev, struct list_head *list, int (*preproc)(struct acpi_resource *, void *), diff --git a/include/linux/resource_ext.h b/include/linux/resource_ext.h new file mode 100644 index 000000000000..e2bf63d881d4 --- /dev/null +++ b/include/linux/resource_ext.h @@ -0,0 +1,77 @@ +/* + * Copyright (C) 2015, Intel Corporation + * Author: Jiang Liu + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#ifndef _LINUX_RESOURCE_EXT_H +#define _LINUX_RESOURCE_EXT_H +#include +#include +#include +#include + +/* Represent resource window for bridge devices */ +struct resource_win { + struct resource res; /* In master (CPU) address space */ + resource_size_t offset; /* Translation offset for bridge */ +}; + +/* + * Common resource list management data structure and interfaces to support + * ACPI, PNP and PCI host bridge etc. + */ +struct resource_entry { + struct list_head node; + struct resource *res; /* In master (CPU) address space */ + resource_size_t offset; /* Translation offset for bridge */ + struct resource __res; /* Default storage for res */ +}; + +extern struct resource_entry * +resource_list_create_entry(struct resource *res, size_t extra_size); +extern void resource_list_free(struct list_head *head); + +static inline void resource_list_add(struct resource_entry *entry, + struct list_head *head) +{ + list_add(&entry->node, head); +} + +static inline void resource_list_add_tail(struct resource_entry *entry, + struct list_head *head) +{ + list_add_tail(&entry->node, head); +} + +static inline void resource_list_del(struct resource_entry *entry) +{ + list_del(&entry->node); +} + +static inline void resource_list_free_entry(struct resource_entry *entry) +{ + kfree(entry); +} + +static inline void +resource_list_destroy_entry(struct resource_entry *entry) +{ + resource_list_del(entry); + resource_list_free_entry(entry); +} + +#define resource_list_for_each_entry(entry, list) \ + list_for_each_entry((entry), (list), node) + +#define resource_list_for_each_entry_safe(entry, tmp, list) \ + list_for_each_entry_safe((entry), (tmp), (list), node) + +#endif /* _LINUX_RESOURCE_EXT_H */ diff --git a/kernel/resource.c b/kernel/resource.c index 0bcebffc4e77..19f2357dfda3 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -22,6 +22,7 @@ #include #include #include +#include #include @@ -1529,6 +1530,30 @@ int iomem_is_exclusive(u64 addr) return err; } +struct resource_entry *resource_list_create_entry(struct resource *res, + size_t extra_size) +{ + struct resource_entry *entry; + + entry = kzalloc(sizeof(*entry) + extra_size, GFP_KERNEL); + if (entry) { + INIT_LIST_HEAD(&entry->node); + entry->res = res ? res : &entry->__res; + } + + return entry; +} +EXPORT_SYMBOL(resource_list_create_entry); + +void resource_list_free(struct list_head *head) +{ + struct resource_entry *entry, *tmp; + + list_for_each_entry_safe(entry, tmp, head, node) + resource_list_destroy_entry(entry); +} +EXPORT_SYMBOL(resource_list_free); + static int __init strict_iomem(char *str) { if (strstr(str, "relaxed")) -- cgit v1.2.3 From de96d79f343321d26ff920af25fcefe6895ca544 Mon Sep 17 00:00:00 2001 From: Andrey Tsyvarev Date: Fri, 6 Feb 2015 15:09:57 +1030 Subject: kernel/module.c: Free lock-classes if parse_args failed parse_args call module parameters' .set handlers, which may use locks defined in the module. So, these classes should be freed in case parse_args returns error(e.g. due to incorrect parameter passed). Signed-off-by: Andrey Tsyvarev Signed-off-by: Rusty Russell --- kernel/module.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index d856e96a3cce..441ed3fc9c89 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3356,6 +3356,9 @@ static int load_module(struct load_info *info, const char __user *uargs, module_bug_cleanup(mod); mutex_unlock(&module_mutex); + /* Free lock-classes: */ + lockdep_free_key_range(mod->module_core, mod->core_size); + /* we can't deallocate the module until we clear memory protection */ unset_module_init_ro_nx(mod); unset_module_core_ro_nx(mod); -- cgit v1.2.3 From ab92ebbb8e10d402f4fe73c6b3d85be72614f1fa Mon Sep 17 00:00:00 2001 From: Marcel Holtmann Date: Fri, 6 Feb 2015 15:09:57 +1030 Subject: module: Remove double spaces in module verification taint message The warning message when loading modules with a wrong signature has two spaces in it: "module verification failed: signature and/or required key missing" Signed-off-by: Marcel Holtmann Signed-off-by: Rusty Russell --- kernel/module.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 441ed3fc9c89..2461370813b3 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3265,7 +3265,7 @@ static int load_module(struct load_info *info, const char __user *uargs, mod->sig_ok = info->sig_ok; if (!mod->sig_ok) { pr_notice_once("%s: module verification failed: signature " - "and/or required key missing - tainting " + "and/or required key missing - tainting " "kernel\n", mod->name); add_taint_module(mod, TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK); } -- cgit v1.2.3 From f638f4dc0880d515c807a67b8210885a4a4f18bb Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Fri, 6 Feb 2015 10:36:32 -0600 Subject: livepatch: add missing newline to error message Signed-off-by: Josh Poimboeuf Signed-off-by: Jiri Kosina --- kernel/livepatch/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 9adf86bfbcd5..ff7f47d026ac 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -211,7 +211,7 @@ static int klp_verify_vmlinux_symbol(const char *name, unsigned long addr) if (kallsyms_on_each_symbol(klp_verify_callback, &args)) return 0; - pr_err("symbol '%s' not found at specified address 0x%016lx, kernel mismatch?", + pr_err("symbol '%s' not found at specified address 0x%016lx, kernel mismatch?\n", name, addr); return -EINVAL; } -- cgit v1.2.3 From 4fe7ffb7e17ca6ad9173b8de35f260c9c8fc2f79 Mon Sep 17 00:00:00 2001 From: Jesse Brandeburg Date: Wed, 28 Jan 2015 10:57:39 -0800 Subject: genirq: Fix null pointer reference in irq_set_affinity_hint() The recent set_affinity commit by me introduced some null pointer dereferences on driver unload, because some drivers call this function with a NULL argument. This fixes the issue by just checking for null before setting the affinity mask. Fixes: e2e64a932556 ("genirq: Set initial affinity in irq_set_affinity_hint()") Reported-by: Yinghai Lu Signed-off-by: Jesse Brandeburg CC: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20150128185739.9689.84588.stgit@jbrandeb-cp2.jf.intel.com Signed-off-by: Ingo Molnar --- kernel/irq/manage.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index f038e586a4b9..196a06fbc122 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -244,7 +244,8 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) desc->affinity_hint = m; irq_put_desc_unlock(desc, flags); /* set the initial affinity to prevent every interrupt being on CPU0 */ - __irq_set_affinity(irq, m, false); + if (m) + __irq_set_affinity(irq, m, false); return 0; } EXPORT_SYMBOL_GPL(irq_set_affinity_hint); -- cgit v1.2.3 From 7215853e985a4bef1a6c14e00e89dfec84f1e457 Mon Sep 17 00:00:00 2001 From: Vikram Mulukutla Date: Wed, 17 Dec 2014 18:50:56 -0800 Subject: tracing: Fix unmapping loop in tracing_mark_write Commit 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce introduced an array map_pages that contains the addresses returned by kmap_atomic. However, when unmapping those pages, map_pages[0] is unmapped before map_pages[1], breaking the nesting requirement as specified in the documentation for kmap_atomic/kunmap_atomic. This was caught by the highmem debug code present in kunmap_atomic. Fix the loop to do the unmapping properly. Link: http://lkml.kernel.org/r/1418871056-6614-1-git-send-email-markivx@codeaurora.org Cc: stable@vger.kernel.org # 3.5+ Reviewed-by: Stephen Boyd Reported-by: Lime Yang Signed-off-by: Vikram Mulukutla Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 5afce60e1b68..2078b86750e0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4928,7 +4928,7 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, *fpos += written; out_unlock: - for (i = 0; i < nr_pages; i++){ + for (i = nr_pages - 1; i >= 0; i--) { kunmap_atomic(map_page[i]); put_page(pages[i]); } -- cgit v1.2.3 From 27ba0644ea9dfe6e7693abc85837b60e40583b96 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 10 Feb 2015 14:09:59 -0800 Subject: rmap: drop support of non-linear mappings We don't create non-linear mappings anymore. Let's drop code which handles them in rmap. Signed-off-by: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cachetlb.txt | 8 +- fs/inode.c | 1 - include/linux/fs.h | 4 +- include/linux/mm.h | 6 -- include/linux/mm_types.h | 4 +- include/linux/rmap.h | 2 - kernel/fork.c | 8 +- mm/migrate.c | 32 ------- mm/mmap.c | 24 ++--- mm/rmap.c | 225 +-------------------------------------------- mm/swap.c | 4 +- 11 files changed, 18 insertions(+), 300 deletions(-) (limited to 'kernel') diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index d79b008e4a32..3f9f808b5119 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -317,10 +317,10 @@ maps this page at its virtual address. about doing this. The idea is, first at flush_dcache_page() time, if - page->mapping->i_mmap is an empty tree and ->i_mmap_nonlinear - an empty list, just mark the architecture private page flag bit. - Later, in update_mmu_cache(), a check is made of this flag bit, - and if set the flush is done and the flag bit is cleared. + page->mapping->i_mmap is an empty tree, just mark the architecture + private page flag bit. Later, in update_mmu_cache(), a check is + made of this flag bit, and if set the flush is done and the flag + bit is cleared. IMPORTANT NOTE: It is often important, if you defer the flush, that the actual flush occurs on the same CPU diff --git a/fs/inode.c b/fs/inode.c index aa149e7262ac..c760fac33c92 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -355,7 +355,6 @@ void address_space_init_once(struct address_space *mapping) INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); mapping->i_mmap = RB_ROOT; - INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/linux/fs.h b/include/linux/fs.h index 47f557c7ef7e..60acab209701 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -401,7 +401,6 @@ struct address_space { spinlock_t tree_lock; /* and lock protecting it */ atomic_t i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ - struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */ /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ @@ -493,8 +492,7 @@ static inline void i_mmap_unlock_read(struct address_space *mapping) */ static inline int mapping_mapped(struct address_space *mapping) { - return !RB_EMPTY_ROOT(&mapping->i_mmap) || - !list_empty(&mapping->i_mmap_nonlinear); + return !RB_EMPTY_ROOT(&mapping->i_mmap); } /* diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ddd9d1d6268..18391eec4864 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1796,12 +1796,6 @@ struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node, for (vma = vma_interval_tree_iter_first(root, start, last); \ vma; vma = vma_interval_tree_iter_next(vma, start, last)) -static inline void vma_nonlinear_insert(struct vm_area_struct *vma, - struct list_head *list) -{ - list_add_tail(&vma->shared.nonlinear, list); -} - void anon_vma_interval_tree_insert(struct anon_vma_chain *node, struct rb_root *root); void anon_vma_interval_tree_remove(struct anon_vma_chain *node, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6d34aa266a8c..3b1d20fb0848 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -273,15 +273,13 @@ struct vm_area_struct { /* * For areas with an address space and backing store, - * linkage into the address_space->i_mmap interval tree, or - * linkage of vma in the address_space->i_mmap_nonlinear list. + * linkage into the address_space->i_mmap interval tree. */ union { struct { struct rb_node rb; unsigned long rb_subtree_last; } linear; - struct list_head nonlinear; } shared; /* diff --git a/include/linux/rmap.h b/include/linux/rmap.h index d9d7e7e56352..b38f559130d5 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -246,7 +246,6 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); * arg: passed to rmap_one() and invalid_vma() * rmap_one: executed on each vma where page is mapped * done: for checking traversing termination condition - * file_nonlinear: for handling file nonlinear mapping * anon_lock: for getting anon_lock by optimized way rather than default * invalid_vma: for skipping uninterested vma */ @@ -255,7 +254,6 @@ struct rmap_walk_control { int (*rmap_one)(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct page *page); - int (*file_nonlinear)(struct page *, struct address_space *, void *arg); struct anon_vma *(*anon_lock)(struct page *page); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; diff --git a/kernel/fork.c b/kernel/fork.c index 4dc2ddade9f1..b379d9abddc7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -438,12 +438,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) atomic_inc(&mapping->i_mmap_writable); flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ - if (unlikely(tmp->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(tmp, - &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert_after(tmp, mpnt, - &mapping->i_mmap); + vma_interval_tree_insert_after(tmp, mpnt, + &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); i_mmap_unlock_write(mapping); } diff --git a/mm/migrate.c b/mm/migrate.c index 344cdf692fc8..6e284bcca8bb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -178,37 +178,6 @@ out: return SWAP_AGAIN; } -/* - * Congratulations to trinity for discovering this bug. - * mm/fremap.c's remap_file_pages() accepts any range within a single vma to - * convert that vma to VM_NONLINEAR; and generic_file_remap_pages() will then - * replace the specified range by file ptes throughout (maybe populated after). - * If page migration finds a page within that range, while it's still located - * by vma_interval_tree rather than lost to i_mmap_nonlinear list, no problem: - * zap_pte() clears the temporary migration entry before mmap_sem is dropped. - * But if the migrating page is in a part of the vma outside the range to be - * remapped, then it will not be cleared, and remove_migration_ptes() needs to - * deal with it. Fortunately, this part of the vma is of course still linear, - * so we just need to use linear location on the nonlinear list. - */ -static int remove_linear_migration_ptes_from_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - /* hugetlbfs does not support remap_pages, so no huge pgoff worries */ - pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - unsigned long addr; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - if (addr >= vma->vm_start && addr < vma->vm_end) - remove_migration_pte(page, vma, addr, arg); - } - return SWAP_AGAIN; -} - /* * Get rid of all migration entries and replace them by * references to the indicated page. @@ -218,7 +187,6 @@ static void remove_migration_ptes(struct page *old, struct page *new) struct rmap_walk_control rwc = { .rmap_one = remove_migration_pte, .arg = old, - .file_nonlinear = remove_linear_migration_ptes_from_nonlinear, }; rmap_walk(new, &rwc); diff --git a/mm/mmap.c b/mm/mmap.c index e023dc5e59a8..14d84666e8ba 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -243,10 +243,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma, mapping_unmap_writable(mapping); flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - list_del_init(&vma->shared.nonlinear); - else - vma_interval_tree_remove(vma, &mapping->i_mmap); + vma_interval_tree_remove(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); } @@ -649,10 +646,7 @@ static void __vma_link_file(struct vm_area_struct *vma) atomic_inc(&mapping->i_mmap_writable); flush_dcache_mmap_lock(mapping); - if (unlikely(vma->vm_flags & VM_NONLINEAR)) - vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); - else - vma_interval_tree_insert(vma, &mapping->i_mmap); + vma_interval_tree_insert(vma, &mapping->i_mmap); flush_dcache_mmap_unlock(mapping); } } @@ -789,14 +783,11 @@ again: remove_next = 1 + (end > next->vm_end); if (file) { mapping = file->f_mapping; - if (!(vma->vm_flags & VM_NONLINEAR)) { - root = &mapping->i_mmap; - uprobe_munmap(vma, vma->vm_start, vma->vm_end); + root = &mapping->i_mmap; + uprobe_munmap(vma, vma->vm_start, vma->vm_end); - if (adjust_next) - uprobe_munmap(next, next->vm_start, - next->vm_end); - } + if (adjust_next) + uprobe_munmap(next, next->vm_start, next->vm_end); i_mmap_lock_write(mapping); if (insert) { @@ -3177,8 +3168,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * * mmap_sem in write mode is required in order to block all operations * that could modify pagetables and free pages without need of - * altering the vma layout (for example populate_range() with - * nonlinear vmas). It's also needed in write mode to avoid new + * altering the vma layout. It's also needed in write mode to avoid new * anon_vmas to be associated with existing vmas. * * A single task can't take more than one mm_take_all_locks() in a row diff --git a/mm/rmap.c b/mm/rmap.c index 71cd5bd0c17d..70b32498d4f2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -590,9 +590,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) if (!vma->anon_vma || !page__anon_vma || vma->anon_vma->root != page__anon_vma->root) return -EFAULT; - } else if (page->mapping && !(vma->vm_flags & VM_NONLINEAR)) { - if (!vma->vm_file || - vma->vm_file->f_mapping != page->mapping) + } else if (page->mapping) { + if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping) return -EFAULT; } else return -EFAULT; @@ -1274,7 +1273,6 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); set_pte_at(mm, address, pte, swp_pte); - BUG_ON(pte_file(*pte)); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) { /* Establish migration entry for a file page */ @@ -1316,211 +1314,6 @@ out_mlock: return ret; } -/* - * objrmap doesn't work for nonlinear VMAs because the assumption that - * offset-into-file correlates with offset-into-virtual-addresses does not hold. - * Consequently, given a particular page and its ->index, we cannot locate the - * ptes which are mapping that page without an exhaustive linear search. - * - * So what this code does is a mini "virtual scan" of each nonlinear VMA which - * maps the file to which the target page belongs. The ->vm_private_data field - * holds the current cursor into that scan. Successive searches will circulate - * around the vma's virtual address space. - * - * So as more replacement pressure is applied to the pages in a nonlinear VMA, - * more scanning pressure is placed against them as well. Eventually pages - * will become fully unmapped and are eligible for eviction. - * - * For very sparsely populated VMAs this is a little inefficient - chances are - * there there won't be many ptes located within the scan cluster. In this case - * maybe we could scan further - to the end of the pte page, perhaps. - * - * Mlocked pages: check VM_LOCKED under mmap_sem held for read, if we can - * acquire it without blocking. If vma locked, mlock the pages in the cluster, - * rather than unmapping them. If we encounter the "check_page" that vmscan is - * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN. - */ -#define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE) -#define CLUSTER_MASK (~(CLUSTER_SIZE - 1)) - -static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, - struct vm_area_struct *vma, struct page *check_page) -{ - struct mm_struct *mm = vma->vm_mm; - pmd_t *pmd; - pte_t *pte; - pte_t pteval; - spinlock_t *ptl; - struct page *page; - unsigned long address; - unsigned long mmun_start; /* For mmu_notifiers */ - unsigned long mmun_end; /* For mmu_notifiers */ - unsigned long end; - int ret = SWAP_AGAIN; - int locked_vma = 0; - - address = (vma->vm_start + cursor) & CLUSTER_MASK; - end = address + CLUSTER_SIZE; - if (address < vma->vm_start) - address = vma->vm_start; - if (end > vma->vm_end) - end = vma->vm_end; - - pmd = mm_find_pmd(mm, address); - if (!pmd) - return ret; - - mmun_start = address; - mmun_end = end; - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); - - /* - * If we can acquire the mmap_sem for read, and vma is VM_LOCKED, - * keep the sem while scanning the cluster for mlocking pages. - */ - if (down_read_trylock(&vma->vm_mm->mmap_sem)) { - locked_vma = (vma->vm_flags & VM_LOCKED); - if (!locked_vma) - up_read(&vma->vm_mm->mmap_sem); /* don't need it */ - } - - pte = pte_offset_map_lock(mm, pmd, address, &ptl); - - /* Update high watermark before we lower rss */ - update_hiwater_rss(mm); - - for (; address < end; pte++, address += PAGE_SIZE) { - if (!pte_present(*pte)) - continue; - page = vm_normal_page(vma, address, *pte); - BUG_ON(!page || PageAnon(page)); - - if (locked_vma) { - if (page == check_page) { - /* we know we have check_page locked */ - mlock_vma_page(page); - ret = SWAP_MLOCK; - } else if (trylock_page(page)) { - /* - * If we can lock the page, perform mlock. - * Otherwise leave the page alone, it will be - * eventually encountered again later. - */ - mlock_vma_page(page); - unlock_page(page); - } - continue; /* don't unmap */ - } - - /* - * No need for _notify because we're within an - * mmu_notifier_invalidate_range_ {start|end} scope. - */ - if (ptep_clear_flush_young(vma, address, pte)) - continue; - - /* Nuke the page table entry. */ - flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); - - /* If nonlinear, store the file page offset in the pte. */ - if (page->index != linear_page_index(vma, address)) { - pte_t ptfile = pgoff_to_pte(page->index); - if (pte_soft_dirty(pteval)) - ptfile = pte_file_mksoft_dirty(ptfile); - set_pte_at(mm, address, pte, ptfile); - } - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); - - page_remove_rmap(page); - page_cache_release(page); - dec_mm_counter(mm, MM_FILEPAGES); - (*mapcount)--; - } - pte_unmap_unlock(pte - 1, ptl); - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); - if (locked_vma) - up_read(&vma->vm_mm->mmap_sem); - return ret; -} - -static int try_to_unmap_nonlinear(struct page *page, - struct address_space *mapping, void *arg) -{ - struct vm_area_struct *vma; - int ret = SWAP_AGAIN; - unsigned long cursor; - unsigned long max_nl_cursor = 0; - unsigned long max_nl_size = 0; - unsigned int mapcount; - - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - if (cursor > max_nl_cursor) - max_nl_cursor = cursor; - cursor = vma->vm_end - vma->vm_start; - if (cursor > max_nl_size) - max_nl_size = cursor; - } - - if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */ - return SWAP_FAIL; - } - - /* - * We don't try to search for this page in the nonlinear vmas, - * and page_referenced wouldn't have found it anyway. Instead - * just walk the nonlinear vmas trying to age and unmap some. - * The mapcount of the page we came in with is irrelevant, - * but even so use it as a guide to how hard we should try? - */ - mapcount = page_mapcount(page); - if (!mapcount) - return ret; - - cond_resched(); - - max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; - if (max_nl_cursor == 0) - max_nl_cursor = CLUSTER_SIZE; - - do { - list_for_each_entry(vma, - &mapping->i_mmap_nonlinear, shared.nonlinear) { - - cursor = (unsigned long) vma->vm_private_data; - while (cursor < max_nl_cursor && - cursor < vma->vm_end - vma->vm_start) { - if (try_to_unmap_cluster(cursor, &mapcount, - vma, page) == SWAP_MLOCK) - ret = SWAP_MLOCK; - cursor += CLUSTER_SIZE; - vma->vm_private_data = (void *) cursor; - if ((int)mapcount <= 0) - return ret; - } - vma->vm_private_data = (void *) max_nl_cursor; - } - cond_resched(); - max_nl_cursor += CLUSTER_SIZE; - } while (max_nl_cursor <= max_nl_size); - - /* - * Don't loop forever (perhaps all the remaining pages are - * in locked vmas). Reset cursor on all unreserved nonlinear - * vmas, now forgetting on which ones it had fallen behind. - */ - list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear) - vma->vm_private_data = NULL; - - return ret; -} - bool is_vma_temporary_stack(struct vm_area_struct *vma) { int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP); @@ -1566,7 +1359,6 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) .rmap_one = try_to_unmap_one, .arg = (void *)flags, .done = page_not_mapped, - .file_nonlinear = try_to_unmap_nonlinear, .anon_lock = page_lock_anon_vma_read, }; @@ -1612,12 +1404,6 @@ int try_to_munlock(struct page *page) .rmap_one = try_to_unmap_one, .arg = (void *)TTU_MUNLOCK, .done = page_not_mapped, - /* - * We don't bother to try to find the munlocked page in - * nonlinears. It's costly. Instead, later, page reclaim logic - * may call try_to_unmap() and recover PG_mlocked lazily. - */ - .file_nonlinear = NULL, .anon_lock = page_lock_anon_vma_read, }; @@ -1748,13 +1534,6 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) goto done; } - if (!rwc->file_nonlinear) - goto done; - - if (list_empty(&mapping->i_mmap_nonlinear)) - goto done; - - ret = rwc->file_nonlinear(page, mapping, rwc->arg); done: i_mmap_unlock_read(mapping); return ret; diff --git a/mm/swap.c b/mm/swap.c index 8a12b33936b4..5b3087228b99 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1140,10 +1140,8 @@ void __init swap_setup(void) if (bdi_init(swapper_spaces[0].backing_dev_info)) panic("Failed to init swap bdi"); - for (i = 0; i < MAX_SWAPFILES; i++) { + for (i = 0; i < MAX_SWAPFILES; i++) spin_lock_init(&swapper_spaces[i].tree_lock); - INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); - } #endif /* Use a smaller cluster for small-memory machines */ -- cgit v1.2.3 From 3cd7645de624939c38f5124b4ac15f8b35a1a8b7 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 10 Feb 2015 14:11:33 -0800 Subject: mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero' leading to out-of-bounds access in proc_doulongvec_minmax(). Use '.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is equivalent to passing minimal value, which is 0 for unsigned types. Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity") Signed-off-by: Andrey Ryabinin Reported-by: Dmitry Vyukov Suggested-by: Manfred Spraul Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 137c7f69b264..88ea2d6e0031 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1248,7 +1248,6 @@ static struct ctl_table vm_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = hugetlb_sysctl_handler, - .extra1 = &zero, }, #ifdef CONFIG_NUMA { @@ -1257,7 +1256,6 @@ static struct ctl_table vm_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = &hugetlb_mempolicy_sysctl_handler, - .extra1 = &zero, }, #endif { @@ -1280,7 +1278,6 @@ static struct ctl_table vm_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = hugetlb_overcommit_handler, - .extra1 = &zero, }, #endif { -- cgit v1.2.3 From d64810f56147b53e92228c31442e925576314aa2 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 11 Feb 2015 15:01:13 +1030 Subject: module: Annotate nested sleep in resolve_symbol() Because wait_event() loops are safe vs spurious wakeups we can allow the occasional sleep -- which ends up being very similar. Reported-by: Dave Jones Signed-off-by: Peter Zijlstra (Intel) Tested-by: Dave Jones Signed-off-by: Rusty Russell --- kernel/module.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 2461370813b3..d7a92682fba3 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1225,6 +1225,12 @@ static const struct kernel_symbol *resolve_symbol(struct module *mod, const unsigned long *crc; int err; + /* + * The module_mutex should not be a heavily contended lock; + * if we get the occasional sleep here, we'll go an extra iteration + * in the wait_event_interruptible(), which is harmless. + */ + sched_annotate_sleep(); mutex_lock(&module_mutex); sym = find_symbol(name, &owner, &crc, !(mod->taints & (1 << TAINT_PROPRIETARY_MODULE)), true); -- cgit v1.2.3 From 9cc019b8c94fa59e02fd82f15f7b7d689e35c190 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 11 Feb 2015 15:01:13 +1030 Subject: module: Replace over-engineered nested sleep Since the introduction of the nested sleep warning; we've established that the occasional sleep inside a wait_event() is fine. wait_event() loops are invariant wrt. spurious wakeups, and the occasional sleep has a similar effect on them. As long as its occasional its harmless. Therefore replace the 'correct' but verbose wait_woken() thing with a simple annotation to shut up the warning. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Rusty Russell --- kernel/module.c | 36 ++++++++---------------------------- 1 file changed, 8 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index d7a92682fba3..82dc1f899e6d 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2984,6 +2984,12 @@ static bool finished_loading(const char *name) struct module *mod; bool ret; + /* + * The module_mutex should not be a heavily contended lock; + * if we get the occasional sleep here, we'll go an extra iteration + * in the wait_event_interruptible(), which is harmless. + */ + sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true); ret = !mod || mod->state == MODULE_STATE_LIVE @@ -3125,32 +3131,6 @@ static int may_init_module(void) return 0; } -/* - * Can't use wait_event_interruptible() because our condition - * 'finished_loading()' contains a blocking primitive itself (mutex_lock). - */ -static int wait_finished_loading(struct module *mod) -{ - DEFINE_WAIT_FUNC(wait, woken_wake_function); - int ret = 0; - - add_wait_queue(&module_wq, &wait); - for (;;) { - if (finished_loading(mod->name)) - break; - - if (signal_pending(current)) { - ret = -ERESTARTSYS; - break; - } - - wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); - } - remove_wait_queue(&module_wq, &wait); - - return ret; -} - /* * We try to place it in the list now to make sure it's unique before * we dedicate too many resources. In particular, temporary percpu @@ -3171,8 +3151,8 @@ again: || old->state == MODULE_STATE_UNFORMED) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); - - err = wait_finished_loading(mod); + err = wait_event_interruptible(module_wq, + finished_loading(mod->name)); if (err) goto out_unlocked; goto again; -- cgit v1.2.3 From 1e0d6714aceb770b04161fbedd7765d0e1fc27bd Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Red Hat)" Date: Tue, 10 Feb 2015 22:14:53 -0500 Subject: ring-buffer: Do not wake up a splice waiter when page is not full When an application connects to the ring buffer via splice, it can only read full pages. Splice does not work with partial pages. If there is not enough data to fill a page, the splice command will either block or return -EAGAIN (if set to nonblock). Code was added where if the page is not full, to just sleep again. The problem is, it will get woken up again on the next event. That is, when something is written into the ring buffer, if there is a waiter it will wake it up. The waiter would then check the buffer, see that it still does not have enough data to fill a page and go back to sleep. To make matters worse, when the waiter goes back to sleep, it could cause another event, which would wake it back up again to see it doesn't have enough data and sleep again. This produces a tremendous overhead and fills the ring buffer with noise. For example, recording sched_switch on an idle system for 10 seconds produces 25,350,475 events!!! Create another wait queue for those waiters wanting full pages. When an event is written, it only wakes up waiters if there's a full page of data. It does not wake up the waiter if the page is not yet full. After this change, recording sched_switch on an idle system for 10 seconds produces only 800 events. Getting rid of 25,349,675 useless events (99.9969% of events!!), is something to take seriously. Cc: stable@vger.kernel.org # 3.16+ Cc: Rabin Vincent Fixes: e30f53aad220 "tracing: Do not busy wait in buffer splice" Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 40 +++++++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 96079180de3d..5040d44fe5a3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -445,7 +445,10 @@ int ring_buffer_print_page_header(struct trace_seq *s) struct rb_irq_work { struct irq_work work; wait_queue_head_t waiters; + wait_queue_head_t full_waiters; bool waiters_pending; + bool full_waiters_pending; + bool wakeup_full; }; /* @@ -527,6 +530,10 @@ static void rb_wake_up_waiters(struct irq_work *work) struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, work); wake_up_all(&rbwork->waiters); + if (rbwork->wakeup_full) { + rbwork->wakeup_full = false; + wake_up_all(&rbwork->full_waiters); + } } /** @@ -551,9 +558,11 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) * data in any cpu buffer, or a specific buffer, put the * caller on the appropriate wait queue. */ - if (cpu == RING_BUFFER_ALL_CPUS) + if (cpu == RING_BUFFER_ALL_CPUS) { work = &buffer->irq_work; - else { + /* Full only makes sense on per cpu reads */ + full = false; + } else { if (!cpumask_test_cpu(cpu, buffer->cpumask)) return -ENODEV; cpu_buffer = buffer->buffers[cpu]; @@ -562,7 +571,10 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) while (true) { - prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); + if (full) + prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE); + else + prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); /* * The events can happen in critical sections where @@ -584,7 +596,10 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) * that is necessary is that the wake up happens after * a task has been queued. It's OK for spurious wake ups. */ - work->waiters_pending = true; + if (full) + work->full_waiters_pending = true; + else + work->waiters_pending = true; if (signal_pending(current)) { ret = -EINTR; @@ -613,7 +628,10 @@ int ring_buffer_wait(struct ring_buffer *buffer, int cpu, bool full) schedule(); } - finish_wait(&work->waiters, &wait); + if (full) + finish_wait(&work->full_waiters, &wait); + else + finish_wait(&work->waiters, &wait); return ret; } @@ -1228,6 +1246,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu) init_completion(&cpu_buffer->update_done); init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(&cpu_buffer->irq_work.waiters); + init_waitqueue_head(&cpu_buffer->irq_work.full_waiters); bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL, cpu_to_node(cpu)); @@ -2799,6 +2818,8 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer, static __always_inline void rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) { + bool pagebusy; + if (buffer->irq_work.waiters_pending) { buffer->irq_work.waiters_pending = false; /* irq_work_queue() supplies it's own memory barriers */ @@ -2810,6 +2831,15 @@ rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) /* irq_work_queue() supplies it's own memory barriers */ irq_work_queue(&cpu_buffer->irq_work.work); } + + pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; + + if (!pagebusy && cpu_buffer->irq_work.full_waiters_pending) { + cpu_buffer->irq_work.wakeup_full = true; + cpu_buffer->irq_work.full_waiters_pending = false; + /* irq_work_queue() supplies it's own memory barriers */ + irq_work_queue(&cpu_buffer->irq_work.work); + } } /** -- cgit v1.2.3 From c0135d07b013fa8f7ba9ec91b4369c372e6a28cb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 22 Jan 2015 22:47:14 -0800 Subject: rcu: Clear need_qs flag to prevent splat If the scheduling-clock interrupt sets the current tasks need_qs flag, but if the current CPU passes through a quiescent state in the meantime, then rcu_preempt_qs() will fail to clear the need_qs flag, which can fool RCU into thinking that additional rcu_read_unlock_special() processing is needed. This commit therefore clears the need_qs flag before checking for additional processing. For this problem to occur, we need rcu_preempt_data.passed_quiesce equal to true and current->rcu_read_unlock_special.b.need_qs also equal to true. This condition can occur as follows: 1. CPU 0 is aware of the current preemptible RCU grace period, but has not yet passed through a quiescent state. Among other things, this means that rcu_preempt_data.passed_quiesce is false. 2. Task A running on CPU 0 enters a preemptible RCU read-side critical section. 3. CPU 0 takes a scheduling-clock interrupt, which notices the RCU read-side critical section and the need for a quiescent state, and thus sets current->rcu_read_unlock_special.b.need_qs to true. 4. Task A is preempted, enters the scheduler, eventually invoking rcu_preempt_note_context_switch() which in turn invokes rcu_preempt_qs(). Because rcu_preempt_data.passed_quiesce is false, control enters the body of the "if" statement, which sets rcu_preempt_data.passed_quiesce to true. 5. At this point, CPU 0 takes an interrupt. The interrupt handler contains an RCU read-side critical section, and the rcu_read_unlock() notes that current->rcu_read_unlock_special is nonzero, and thus invokes rcu_read_unlock_special(). 6. Once in rcu_read_unlock_special(), the fact that current->rcu_read_unlock_special.b.need_qs is true becomes apparent, so rcu_read_unlock_special() invokes rcu_preempt_qs(). Recursively, given that we interrupted out of that same function in the preceding step. 7. Because rcu_preempt_data.passed_quiesce is now true, rcu_preempt_qs() does nothing, and simply returns. 8. Upon return to rcu_read_unlock_special(), it is noted that current->rcu_read_unlock_special is still nonzero (because the interrupted rcu_preempt_qs() had not yet gotten around to clearing current->rcu_read_unlock_special.b.need_qs). 9. Execution proceeds to the WARN_ON_ONCE(), which notes that we are in an interrupt handler and thus duly splats. The solution, as noted above, is to make rcu_read_unlock_special() clear out current->rcu_read_unlock_special.b.need_qs after calling rcu_preempt_qs(). The interrupted rcu_preempt_qs() will clear it again, but this is harmless. The worst that happens is that we clobber another attempt to set this field, but this is not a problem because we just got done reporting a quiescent state. Reported-by: Sasha Levin Signed-off-by: Paul E. McKenney [ paulmck: Fix embarrassing build bug noted by Sasha Levin. ] Tested-by: Sasha Levin --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2e850a51bb8f..bca28b00f7e6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -327,6 +327,7 @@ void rcu_read_unlock_special(struct task_struct *t) special = t->rcu_read_unlock_special; if (special.b.need_qs) { rcu_preempt_qs(); + t->rcu_read_unlock_special.b.need_qs = false; if (!t->rcu_read_unlock_special.s) { local_irq_restore(flags); return; -- cgit v1.2.3 From 49550b605587924b3336386caae53200c68969d3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 11 Feb 2015 15:26:12 -0800 Subject: oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko Cc: Tejun Heo Cc: David Rientjes Cc: Johannes Weiner Cc: Oleg Nesterov Cc: Cong Wang Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/staging/android/lowmemorykiller.c | 7 ++++++- include/linux/oom.h | 4 ++++ kernel/exit.c | 2 +- mm/memcontrol.c | 2 +- mm/oom_kill.c | 23 ++++++++++++++++++++--- 5 files changed, 32 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c index b545d3d1da3e..feafa172b155 100644 --- a/drivers/staging/android/lowmemorykiller.c +++ b/drivers/staging/android/lowmemorykiller.c @@ -160,7 +160,12 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc) selected->pid, selected->comm, selected_oom_score_adj, selected_tasksize); lowmem_deathpending_timeout = jiffies + HZ; - set_tsk_thread_flag(selected, TIF_MEMDIE); + /* + * FIXME: lowmemorykiller shouldn't abuse global OOM killer + * infrastructure. There is no real reason why the selected + * task should have access to the memory reserves. + */ + mark_tsk_oom_victim(selected); send_sig(SIGKILL, selected, 0); rem += selected_tasksize; } diff --git a/include/linux/oom.h b/include/linux/oom.h index 76200984d1e2..b42b80f88c3a 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p) return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN); } +extern void mark_tsk_oom_victim(struct task_struct *tsk); + +extern void unmark_oom_victim(void); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/kernel/exit.c b/kernel/exit.c index 6806c55475ee..02b3d1ab2ec0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - clear_thread_flag(TIF_MEMDIE); + unmark_oom_victim(); } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 11c9e6a1dad5..fe4d258ef32b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1556,7 +1556,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || task_will_free_mem(current)) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 294493a7ae4b..80b34e285f96 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -416,6 +416,23 @@ void note_oom_kill(void) atomic_inc(&oom_kills); } +/** + * mark_tsk_oom_victim - marks the given taks as OOM victim. + * @tsk: task to mark + */ +void mark_tsk_oom_victim(struct task_struct *tsk) +{ + set_tsk_thread_flag(tsk, TIF_MEMDIE); +} + +/** + * unmark_oom_victim - unmarks the current task as OOM victim. + */ +void unmark_oom_victim(void) +{ + clear_thread_flag(TIF_MEMDIE); +} + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -440,7 +457,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, */ task_lock(p); if (p->mm && task_will_free_mem(p)) { - set_tsk_thread_flag(p, TIF_MEMDIE); + mark_tsk_oom_victim(p); task_unlock(p); put_task_struct(p); return; @@ -495,7 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; - set_tsk_thread_flag(victim, TIF_MEMDIE); + mark_tsk_oom_victim(victim); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -652,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, */ if (current->mm && (fatal_signal_pending(current) || task_will_free_mem(current))) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } -- cgit v1.2.3 From 35536ae170f01fb7e5ca032d5324d03e9e5a36bd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 11 Feb 2015 15:26:18 -0800 Subject: PM: convert printk to pr_* equivalent While touching this area let's convert printk to pr_*. This also makes the printing of continuation lines done properly. Signed-off-by: Michal Hocko Acked-by: Tejun Heo Cc: David Rientjes Cc: Johannes Weiner Cc: Oleg Nesterov Cc: Cong Wang Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/process.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..3ac45f192e9f 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only) elapsed_msecs = elapsed_msecs64; if (todo) { - printk("\n"); - printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds " + pr_cont("\n"); + pr_err("Freezing of tasks %s after %d.%03d seconds " "(%d tasks refusing to freeze, wq_busy=%d):\n", wakeup ? "aborted" : "failed", elapsed_msecs / 1000, elapsed_msecs % 1000, @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only) read_unlock(&tasklist_lock); } } else { - printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, + pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, elapsed_msecs % 1000); } @@ -155,7 +155,7 @@ int freeze_processes(void) atomic_inc(&system_freezing_cnt); pm_wakeup_clear(); - printk("Freezing user space processes ... "); + pr_info("Freezing user space processes ... "); pm_freezing = true; oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); @@ -171,13 +171,13 @@ int freeze_processes(void) if (oom_kills_count() != oom_kills_saved && !check_frozen_processes()) { __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); + pr_cont("OOM in progress."); error = -EBUSY; } else { - printk("done."); + pr_cont("done."); } } - printk("\n"); + pr_cont("\n"); BUG_ON(in_atomic()); if (error) @@ -197,13 +197,14 @@ int freeze_kernel_threads(void) { int error; - printk("Freezing remaining freezable tasks ... "); + pr_info("Freezing remaining freezable tasks ... "); + pm_nosig_freezing = true; error = try_to_freeze_tasks(false); if (!error) - printk("done."); + pr_cont("done."); - printk("\n"); + pr_cont("\n"); BUG_ON(in_atomic()); if (error) @@ -224,7 +225,7 @@ void thaw_processes(void) oom_killer_enable(); - printk("Restarting tasks ... "); + pr_info("Restarting tasks ... "); __usermodehelper_set_disable_depth(UMH_FREEZING); thaw_workqueues(); @@ -243,7 +244,7 @@ void thaw_processes(void) usermodehelper_enable(); schedule(); - printk("done.\n"); + pr_cont("done.\n"); trace_suspend_resume(TPS("thaw_processes"), 0, false); } @@ -252,7 +253,7 @@ void thaw_kernel_threads(void) struct task_struct *g, *p; pm_nosig_freezing = false; - printk("Restarting kernel threads ... "); + pr_info("Restarting kernel threads ... "); thaw_workqueues(); @@ -264,5 +265,5 @@ void thaw_kernel_threads(void) read_unlock(&tasklist_lock); schedule(); - printk("done.\n"); + pr_cont("done.\n"); } -- cgit v1.2.3 From c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 11 Feb 2015 15:26:24 -0800 Subject: oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko Suggested-by: Tejun Heo Cc: David Rientjes Cc: Johannes Weiner Cc: Oleg Nesterov Cc: Cong Wang Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/tty/sysrq.c | 5 +- include/linux/oom.h | 14 ++---- kernel/exit.c | 3 +- kernel/power/process.c | 50 ++++--------------- mm/memcontrol.c | 2 +- mm/oom_kill.c | 132 +++++++++++++++++++++++++++++++++++++++++-------- mm/page_alloc.c | 17 +------ 7 files changed, 132 insertions(+), 91 deletions(-) (limited to 'kernel') diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 0071469ecbf1..259a4d5a4e8f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) + pr_info("OOM request ignored because killer is disabled\n"); } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index b42b80f88c3a..d5771bed59c9 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); extern bool oom_killer_disabled; - -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} - -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +extern bool oom_killer_disable(void); +extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); diff --git a/kernel/exit.c b/kernel/exit.c index 02b3d1ab2ec0..feff10bbb307 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -435,7 +435,8 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - unmark_oom_victim(); + if (test_thread_flag(TIF_MEMDIE)) + unmark_oom_victim(); } static struct task_struct *find_alive_thread(struct task_struct *p) diff --git a/kernel/power/process.c b/kernel/power/process.c index 3ac45f192e9f..564f786df470 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,29 +132,22 @@ int freeze_processes(void) pm_wakeup_clear(); pr_info("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - pr_cont("OOM in progress."); - error = -EBUSY; - } else { - pr_cont("done."); - } + pr_cont("done."); } pr_cont("\n"); BUG_ON(in_atomic()); + /* + * Now that the whole userspace is frozen we need to disbale + * the OOM killer to disallow any further interference with + * killable tasks. + */ + if (!error && !oom_killer_disable()) + error = -EBUSY; + if (error) thaw_processes(); return error; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fe4d258ef32b..fbf64e6f64e4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1930,7 +1930,7 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; - if (!handle) + if (!handle || oom_killer_disabled) goto cleanup; owait.memcg = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3cbd76b8c13b..b8df76ee2be3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -398,30 +398,27 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } /* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. + * Number of OOM victims in flight */ -static atomic_t oom_kills = ATOMIC_INIT(0); +static atomic_t oom_victims = ATOMIC_INIT(0); +static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait); -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} +bool oom_killer_disabled __read_mostly; +static DECLARE_RWSEM(oom_sem); /** * mark_tsk_oom_victim - marks the given taks as OOM victim. * @tsk: task to mark + * + * Has to be called with oom_sem taken for read and never after + * oom has been disabled already. */ void mark_tsk_oom_victim(struct task_struct *tsk) { - set_tsk_thread_flag(tsk, TIF_MEMDIE); - + WARN_ON(oom_killer_disabled); + /* OOM killer might race with memcg OOM */ + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) + return; /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free @@ -429,14 +426,70 @@ void mark_tsk_oom_victim(struct task_struct *tsk) * that TIF_MEMDIE tasks should be ignored. */ __thaw_task(tsk); + atomic_inc(&oom_victims); } /** * unmark_oom_victim - unmarks the current task as OOM victim. + * + * Wakes up all waiters in oom_killer_disable() */ void unmark_oom_victim(void) { - clear_thread_flag(TIF_MEMDIE); + if (!test_and_clear_thread_flag(TIF_MEMDIE)) + return; + + down_read(&oom_sem); + /* + * There is no need to signal the lasst oom_victim if there + * is nobody who cares. + */ + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) + wake_up_all(&oom_victims_wait); + up_read(&oom_sem); +} + +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Will block and wait until all OOM victims are killed. + * + * The function cannot be called when there are runnable user tasks because + * the userspace would see unexpected allocation failures as a result. Any + * new usage of this function should be consulted with MM people. + * + * Returns true if successful and false if the OOM killer cannot be + * disabled. + */ +bool oom_killer_disable(void) +{ + /* + * Make sure to not race with an ongoing OOM killer + * and that the current is not the victim. + */ + down_write(&oom_sem); + if (test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + return false; + } + + oom_killer_disabled = true; + up_write(&oom_sem); + + wait_event(oom_victims_wait, !atomic_read(&oom_victims)); + + return true; +} + +/** + * oom_killer_enable - enable OOM killer + */ +void oom_killer_enable(void) +{ + down_write(&oom_sem); + oom_killer_disabled = false; + up_write(&oom_sem); } #define K(x) ((x) << (PAGE_SHIFT-10)) @@ -637,7 +690,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) } /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -649,7 +702,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -718,6 +771,32 @@ out: schedule_timeout_killable(1); } +/** + * out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + bool ret = false; + + down_read(&oom_sem); + if (!oom_killer_disabled) { + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + ret = true; + } + up_read(&oom_sem); + + return ret; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a @@ -727,12 +806,25 @@ void pagefault_out_of_memory(void) { struct zonelist *zonelist; + down_read(&oom_sem); if (mem_cgroup_oom_synchronize(true)) - return; + goto unlock; zonelist = node_zonelist(first_memory_node, GFP_KERNEL); if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { - out_of_memory(NULL, 0, 0, NULL, false); + if (!oom_killer_disabled) + __out_of_memory(NULL, 0, 0, NULL, false); + else + /* + * There shouldn't be any user tasks runable while the + * OOM killer is disabled so the current task has to + * be a racing OOM victim for which oom_killer_disable() + * is waiting for. + */ + WARN_ON(test_thread_flag(TIF_MEMDIE)); + oom_zonelist_unlock(zonelist, GFP_KERNEL); } +unlock: + up_read(&oom_sem); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 641d5a9a8617..134e25525044 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -244,8 +244,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2317,9 +2315,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, *did_some_progress = 0; - if (oom_killer_disabled) - return NULL; - /* * Acquire the per-zone oom lock for each zone. If that * fails, somebody else is making progress for us. @@ -2330,14 +2325,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, return NULL; } - /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if @@ -2372,8 +2359,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false); - *did_some_progress = 1; + if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) + *did_some_progress = 1; out: oom_zonelist_unlock(ac->zonelist, gfp_mask); return page; -- cgit v1.2.3 From dc6c9a35b66b520cf67e05d8ca60ebecad3b0479 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 11 Feb 2015 15:26:50 -0800 Subject: mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include #include #include #include #include #include #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov Reported-by: Dave Hansen Cc: Hugh Dickins Reviewed-by: Cyrill Gorcunov Cc: Pavel Emelyanov Cc: David Rientjes Tested-by: Sedat Dilek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 12 ++++++------ arch/x86/mm/pgtable.c | 14 +++++++++----- fs/proc/task_mmu.c | 9 ++++++--- include/linux/mm.h | 24 ++++++++++++++++++++++++ include/linux/mm_types.h | 3 ++- kernel/fork.c | 3 +++ mm/debug.c | 3 ++- mm/hugetlb.c | 8 ++++++-- mm/memory.c | 15 +++++++++------ mm/mmap.c | 4 +++- mm/oom_kill.c | 9 +++++---- 11 files changed, 75 insertions(+), 29 deletions(-) (limited to 'kernel') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 4415aa915681..e9c706e4627a 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -555,12 +555,12 @@ this is causing problems for your system/application. oom_dump_tasks -Enables a system-wide task dump (excluding kernel threads) to be -produced when the kernel performs an OOM-killing and includes such -information as pid, uid, tgid, vm size, rss, nr_ptes, swapents, -oom_score_adj score, and name. This is helpful to determine why the -OOM killer was invoked, to identify the rogue task that caused it, -and to determine why the OOM killer chose the task it did to kill. +Enables a system-wide task dump (excluding kernel threads) to be produced +when the kernel performs an OOM-killing and includes such information as +pid, uid, tgid, vm size, rss, nr_ptes, nr_pmds, swapents, oom_score_adj +score, and name. This is helpful to determine why the OOM killer was +invoked, to identify the rogue task that caused it, and to determine why +the OOM killer chose the task it did to kill. If this is set to zero, this information is suppressed. On very large systems with thousands of tasks it may not be feasible to dump diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 6fb6927f9e76..7b22adaad4f1 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -190,7 +190,7 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd) #endif /* CONFIG_X86_PAE */ -static void free_pmds(pmd_t *pmds[]) +static void free_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; @@ -198,10 +198,11 @@ static void free_pmds(pmd_t *pmds[]) if (pmds[i]) { pgtable_pmd_page_dtor(virt_to_page(pmds[i])); free_page((unsigned long)pmds[i]); + mm_dec_nr_pmds(mm); } } -static int preallocate_pmds(pmd_t *pmds[]) +static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; bool failed = false; @@ -215,11 +216,13 @@ static int preallocate_pmds(pmd_t *pmds[]) pmd = NULL; failed = true; } + if (pmd) + mm_inc_nr_pmds(mm); pmds[i] = pmd; } if (failed) { - free_pmds(pmds); + free_pmds(mm, pmds); return -ENOMEM; } @@ -246,6 +249,7 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp) paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT); pmd_free(mm, pmd); + mm_dec_nr_pmds(mm); } } } @@ -283,7 +287,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) mm->pgd = pgd; - if (preallocate_pmds(pmds) != 0) + if (preallocate_pmds(mm, pmds) != 0) goto out_free_pgd; if (paravirt_pgd_alloc(mm) != 0) @@ -304,7 +308,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) return pgd; out_free_pmds: - free_pmds(pmds); + free_pmds(mm, pmds); out_free_pgd: free_page((unsigned long)pgd); out: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 6396f88c6687..e6e0abeb5d12 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -21,7 +21,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) { - unsigned long data, text, lib, swap; + unsigned long data, text, lib, swap, ptes, pmds; unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; /* @@ -42,6 +42,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; swap = get_mm_counter(mm, MM_SWAPENTS); + ptes = PTRS_PER_PTE * sizeof(pte_t) * atomic_long_read(&mm->nr_ptes); + pmds = PTRS_PER_PMD * sizeof(pmd_t) * mm_nr_pmds(mm); seq_printf(m, "VmPeak:\t%8lu kB\n" "VmSize:\t%8lu kB\n" @@ -54,6 +56,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) "VmExe:\t%8lu kB\n" "VmLib:\t%8lu kB\n" "VmPTE:\t%8lu kB\n" + "VmPMD:\t%8lu kB\n" "VmSwap:\t%8lu kB\n", hiwater_vm << (PAGE_SHIFT-10), total_vm << (PAGE_SHIFT-10), @@ -63,8 +66,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) total_rss << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, - (PTRS_PER_PTE * sizeof(pte_t) * - atomic_long_read(&mm->nr_ptes)) >> 10, + ptes >> 10, + pmds >> 10, swap << (PAGE_SHIFT-10)); } diff --git a/include/linux/mm.h b/include/linux/mm.h index c6bf813a6b3d..644990b83cda 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1438,8 +1438,32 @@ static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud, { return 0; } + +static inline unsigned long mm_nr_pmds(struct mm_struct *mm) +{ + return 0; +} + +static inline void mm_inc_nr_pmds(struct mm_struct *mm) {} +static inline void mm_dec_nr_pmds(struct mm_struct *mm) {} + #else int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address); + +static inline unsigned long mm_nr_pmds(struct mm_struct *mm) +{ + return atomic_long_read(&mm->nr_pmds); +} + +static inline void mm_inc_nr_pmds(struct mm_struct *mm) +{ + atomic_long_inc(&mm->nr_pmds); +} + +static inline void mm_dec_nr_pmds(struct mm_struct *mm) +{ + atomic_long_dec(&mm->nr_pmds); +} #endif int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 20ff2105b564..199a03aab8dc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -363,7 +363,8 @@ struct mm_struct { pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ - atomic_long_t nr_ptes; /* Page table pages */ + atomic_long_t nr_ptes; /* PTE page table pages */ + atomic_long_t nr_pmds; /* PMD page table pages */ int map_count; /* number of VMAs */ spinlock_t page_table_lock; /* Protects page tables and some counters */ diff --git a/kernel/fork.c b/kernel/fork.c index b379d9abddc7..c99098c52641 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -555,6 +555,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; atomic_long_set(&mm->nr_ptes, 0); +#ifndef __PAGETABLE_PMD_FOLDED + atomic_long_set(&mm->nr_pmds, 0); +#endif mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; diff --git a/mm/debug.c b/mm/debug.c index d69cb5a7ba9a..3eb3ac2fcee7 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -173,7 +173,7 @@ void dump_mm(const struct mm_struct *mm) "get_unmapped_area %p\n" #endif "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" - "pgd %p mm_users %d mm_count %d nr_ptes %lu map_count %d\n" + "pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" "pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" @@ -206,6 +206,7 @@ void dump_mm(const struct mm_struct *mm) mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), atomic_long_read((atomic_long_t *)&mm->nr_ptes), + mm_nr_pmds((struct mm_struct *)mm), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fd28d6ba5e5d..0a9ac6c26832 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3598,6 +3598,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr); if (spte) { + mm_inc_nr_pmds(mm); get_page(virt_to_page(spte)); break; } @@ -3609,11 +3610,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) ptl = huge_pte_lockptr(hstate_vma(vma), mm, spte); spin_lock(ptl); - if (pud_none(*pud)) + if (pud_none(*pud)) { pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK)); - else + } else { put_page(virt_to_page(spte)); + mm_inc_nr_pmds(mm); + } spin_unlock(ptl); out: pte = (pte_t *)pmd_alloc(mm, pud, addr); @@ -3644,6 +3647,7 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) pud_clear(pud); put_page(virt_to_page(ptep)); + mm_dec_nr_pmds(mm); *addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE; return 1; } diff --git a/mm/memory.c b/mm/memory.c index d63849b5188f..bbe6a73a899d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -428,6 +428,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, pmd = pmd_offset(pud, start); pud_clear(pud); pmd_free_tlb(tlb, pmd, start); + mm_dec_nr_pmds(tlb->mm); } static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, @@ -3322,15 +3323,17 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) spin_lock(&mm->page_table_lock); #ifndef __ARCH_HAS_4LEVEL_HACK - if (pud_present(*pud)) /* Another has populated it */ - pmd_free(mm, new); - else + if (!pud_present(*pud)) { + mm_inc_nr_pmds(mm); pud_populate(mm, pud, new); -#else - if (pgd_present(*pud)) /* Another has populated it */ + } else /* Another has populated it */ pmd_free(mm, new); - else +#else + if (!pgd_present(*pud)) { + mm_inc_nr_pmds(mm); pgd_populate(mm, pud, new); + } else /* Another has populated it */ + pmd_free(mm, new); #endif /* __ARCH_HAS_4LEVEL_HACK */ spin_unlock(&mm->page_table_lock); return 0; diff --git a/mm/mmap.c b/mm/mmap.c index 14d84666e8ba..6a7d36d133fb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2853,7 +2853,9 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); WARN_ON(atomic_long_read(&mm->nr_ptes) > - (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); + round_up(FIRST_USER_ADDRESS, PMD_SIZE) >> PMD_SHIFT); + WARN_ON(mm_nr_pmds(mm) > + round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); } /* Insert vm structure into process list sorted by address diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b8df76ee2be3..642f38cb175a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -169,8 +169,8 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ - points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) + - get_mm_counter(p->mm, MM_SWAPENTS); + points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + + atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm); task_unlock(p); /* @@ -351,7 +351,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) struct task_struct *p; struct task_struct *task; - pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name\n"); rcu_read_lock(); for_each_process(p) { if (oom_unkillable_task(p, memcg, nodemask)) @@ -367,10 +367,11 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) continue; } - pr_info("[%5d] %5d %5d %8lu %8lu %7ld %8lu %5hd %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), atomic_long_read(&task->mm->nr_ptes), + mm_nr_pmds(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); task_unlock(task); -- cgit v1.2.3 From b30fe6c7ced70f62862c3d09357e7e8084e98d9f Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 11 Feb 2015 15:26:53 -0800 Subject: mm: fix false-positive warning on exit due mm_nr_pmds(mm) The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens *before* pgd_free(). And if an arch does pte/pmd allocation in pgd_alloc() and frees them in pgd_free() we see offset in counters by the time of the checks. We tried to workaround this by offsetting expected counter value according to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it doesn't work in some cases: 1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but upper addresses occupied with huge pmd entries, so the trick with offsetting expected counter value will get really ugly: we will have to apply it nr_pmds, but not nr_ptes. 2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry which is allocated at boot and shared accross all processes. The proposal is to move the check to check_mm() which happens *after* pgd_free() and do proper accounting during pgd_alloc() and pgd_free() which would bring counters to zero if nothing leaked. Signed-off-by: Kirill A. Shutemov Reported-by: Tyler Baker Tested-by: Tyler Baker Tested-by: Nishanth Menon Cc: Russell King Cc: James Hogan Cc: Guan Xuetao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/mm/pgd.c | 4 ++++ arch/unicore32/mm/pgd.c | 3 +++ kernel/fork.c | 8 ++++++++ mm/mmap.c | 5 ----- 4 files changed, 15 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c index 249379535be2..a3681f11dd9f 100644 --- a/arch/arm/mm/pgd.c +++ b/arch/arm/mm/pgd.c @@ -97,6 +97,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) no_pte: pmd_free(mm, new_pmd); + mm_dec_nr_pmds(mm); no_pmd: pud_free(mm, new_pud); no_pud: @@ -130,9 +131,11 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd_base) pte = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free(mm, pte); + atomic_long_dec(&mm->nr_ptes); no_pmd: pud_clear(pud); pmd_free(mm, pmd); + mm_dec_nr_pmds(mm); no_pud: pgd_clear(pgd); pud_free(mm, pud); @@ -152,6 +155,7 @@ no_pgd: pmd = pmd_offset(pud, 0); pud_clear(pud); pmd_free(mm, pmd); + mm_dec_nr_pmds(mm); pgd_clear(pgd); pud_free(mm, pud); } diff --git a/arch/unicore32/mm/pgd.c b/arch/unicore32/mm/pgd.c index 08b8d4295e70..2ade20d8eab3 100644 --- a/arch/unicore32/mm/pgd.c +++ b/arch/unicore32/mm/pgd.c @@ -69,6 +69,7 @@ pgd_t *get_pgd_slow(struct mm_struct *mm) no_pte: pmd_free(mm, new_pmd); + mm_dec_nr_pmds(mm); no_pmd: free_pages((unsigned long)new_pgd, 0); no_pgd: @@ -96,7 +97,9 @@ void free_pgd_slow(struct mm_struct *mm, pgd_t *pgd) pte = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free(mm, pte); + atomic_long_dec(&mm->nr_ptes); pmd_free(mm, pmd); + mm_dec_nr_pmds(mm); free: free_pages((unsigned long) pgd, 0); } diff --git a/kernel/fork.c b/kernel/fork.c index c99098c52641..66e19c251581 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -606,6 +606,14 @@ static void check_mm(struct mm_struct *mm) printk(KERN_ALERT "BUG: Bad rss-counter state " "mm:%p idx:%d val:%ld\n", mm, i, x); } + + if (atomic_long_read(&mm->nr_ptes)) + pr_alert("BUG: non-zero nr_ptes on freeing mm: %ld\n", + atomic_long_read(&mm->nr_ptes)); + if (mm_nr_pmds(mm)) + pr_alert("BUG: non-zero nr_pmds on freeing mm: %ld\n", + mm_nr_pmds(mm)); + #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS VM_BUG_ON_MM(mm->pmd_huge_pte, mm); #endif diff --git a/mm/mmap.c b/mm/mmap.c index 6a7d36d133fb..c5f44682c0d1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2851,11 +2851,6 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); - - WARN_ON(atomic_long_read(&mm->nr_ptes) > - round_up(FIRST_USER_ADDRESS, PMD_SIZE) >> PMD_SHIFT); - WARN_ON(mm_nr_pmds(mm) > - round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); } /* Insert vm structure into process list sorted by address -- cgit v1.2.3 From 9791554b45a2acc28247f66a5fd5bbc212a6b8c8 Mon Sep 17 00:00:00 2001 From: Paul Burton Date: Thu, 8 Jan 2015 12:17:37 +0000 Subject: MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS Userland code may be built using an ABI which permits linking to objects that have more restrictive floating point requirements. For example, userland code may be built to target the O32 FPXX ABI. Such code may be linked with other FPXX code, or code built for either one of the more restrictive FP32 or FP64. When linking with more restrictive code, the overall requirement of the process becomes that of the more restrictive code. The kernel has no way to know in advance which mode the process will need to be executed in, and indeed it may need to change during execution. The dynamic loader is the only code which will know the overall required mode, and so it needs to have a means to instruct the kernel to switch the FP mode of the process. This patch introduces 2 new options to the prctl syscall which provide such a capability. The FP mode of the process is represented as a simple bitmask combining a number of mode bits mirroring those present in the hardware. Userland can either retrieve the current FP mode of the process: mode = prctl(PR_GET_FP_MODE); or modify the current FP mode of the process: err = prctl(PR_SET_FP_MODE, new_mode); Signed-off-by: Paul Burton Cc: Matthew Fortune Cc: Markos Chandras Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/8899/ Signed-off-by: Ralf Baechle --- arch/mips/include/asm/mmu.h | 3 ++ arch/mips/include/asm/mmu_context.h | 2 + arch/mips/include/asm/processor.h | 11 +++++ arch/mips/kernel/process.c | 92 +++++++++++++++++++++++++++++++++++++ arch/mips/kernel/traps.c | 19 ++++++++ include/uapi/linux/prctl.h | 5 ++ kernel/sys.c | 12 +++++ 7 files changed, 144 insertions(+) (limited to 'kernel') diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h index c436138945a8..1afa1f986df8 100644 --- a/arch/mips/include/asm/mmu.h +++ b/arch/mips/include/asm/mmu.h @@ -1,9 +1,12 @@ #ifndef __ASM_MMU_H #define __ASM_MMU_H +#include + typedef struct { unsigned long asid[NR_CPUS]; void *vdso; + atomic_t fp_mode_switching; } mm_context_t; #endif /* __ASM_MMU_H */ diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h index 2f82568a3ee4..87f11072f557 100644 --- a/arch/mips/include/asm/mmu_context.h +++ b/arch/mips/include/asm/mmu_context.h @@ -132,6 +132,8 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm) for_each_possible_cpu(i) cpu_context(i, mm) = 0; + atomic_set(&mm->context.fp_mode_switching, 0); + return 0; } diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index f1df4cb4a286..9daa38608cd8 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -399,4 +399,15 @@ unsigned long get_wchan(struct task_struct *p); #endif +/* + * Functions & macros implementing the PR_GET_FP_MODE & PR_SET_FP_MODE options + * to the prctl syscall. + */ +extern int mips_get_process_fp_mode(struct task_struct *task); +extern int mips_set_process_fp_mode(struct task_struct *task, + unsigned int value); + +#define GET_FP_MODE(task) mips_get_process_fp_mode(task) +#define SET_FP_MODE(task,value) mips_set_process_fp_mode(task, value) + #endif /* _ASM_PROCESSOR_H */ diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index eb76434828e8..4677b4c67da6 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -550,3 +551,94 @@ void arch_trigger_all_cpu_backtrace(bool include_self) { smp_call_function(arch_dump_stack, NULL, 1); } + +int mips_get_process_fp_mode(struct task_struct *task) +{ + int value = 0; + + if (!test_tsk_thread_flag(task, TIF_32BIT_FPREGS)) + value |= PR_FP_MODE_FR; + if (test_tsk_thread_flag(task, TIF_HYBRID_FPREGS)) + value |= PR_FP_MODE_FRE; + + return value; +} + +int mips_set_process_fp_mode(struct task_struct *task, unsigned int value) +{ + const unsigned int known_bits = PR_FP_MODE_FR | PR_FP_MODE_FRE; + unsigned long switch_count; + struct task_struct *t; + + /* Check the value is valid */ + if (value & ~known_bits) + return -EOPNOTSUPP; + + /* Avoid inadvertently triggering emulation */ + if ((value & PR_FP_MODE_FR) && cpu_has_fpu && + !(current_cpu_data.fpu_id & MIPS_FPIR_F64)) + return -EOPNOTSUPP; + if ((value & PR_FP_MODE_FRE) && cpu_has_fpu && !cpu_has_fre) + return -EOPNOTSUPP; + + /* Save FP & vector context, then disable FPU & MSA */ + if (task->signal == current->signal) + lose_fpu(1); + + /* Prevent any threads from obtaining live FP context */ + atomic_set(&task->mm->context.fp_mode_switching, 1); + smp_mb__after_atomic(); + + /* + * If there are multiple online CPUs then wait until all threads whose + * FP mode is about to change have been context switched. This approach + * allows us to only worry about whether an FP mode switch is in + * progress when FP is first used in a tasks time slice. Pretty much all + * of the mode switch overhead can thus be confined to cases where mode + * switches are actually occuring. That is, to here. However for the + * thread performing the mode switch it may take a while... + */ + if (num_online_cpus() > 1) { + spin_lock_irq(&task->sighand->siglock); + + for_each_thread(task, t) { + if (t == current) + continue; + + switch_count = t->nvcsw + t->nivcsw; + + do { + spin_unlock_irq(&task->sighand->siglock); + cond_resched(); + spin_lock_irq(&task->sighand->siglock); + } while ((t->nvcsw + t->nivcsw) == switch_count); + } + + spin_unlock_irq(&task->sighand->siglock); + } + + /* + * There are now no threads of the process with live FP context, so it + * is safe to proceed with the FP mode switch. + */ + for_each_thread(task, t) { + /* Update desired FP register width */ + if (value & PR_FP_MODE_FR) { + clear_tsk_thread_flag(t, TIF_32BIT_FPREGS); + } else { + set_tsk_thread_flag(t, TIF_32BIT_FPREGS); + clear_tsk_thread_flag(t, TIF_MSA_CTX_LIVE); + } + + /* Update desired FP single layout */ + if (value & PR_FP_MODE_FRE) + set_tsk_thread_flag(t, TIF_HYBRID_FPREGS); + else + clear_tsk_thread_flag(t, TIF_HYBRID_FPREGS); + } + + /* Allow threads to use FP again */ + atomic_set(&task->mm->context.fp_mode_switching, 0); + + return 0; +} diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c index ad3d2031c327..d5fbfb51b9da 100644 --- a/arch/mips/kernel/traps.c +++ b/arch/mips/kernel/traps.c @@ -1134,10 +1134,29 @@ static int default_cu2_call(struct notifier_block *nfb, unsigned long action, return NOTIFY_OK; } +static int wait_on_fp_mode_switch(atomic_t *p) +{ + /* + * The FP mode for this task is currently being switched. That may + * involve modifications to the format of this tasks FP context which + * make it unsafe to proceed with execution for the moment. Instead, + * schedule some other task. + */ + schedule(); + return 0; +} + static int enable_restore_fp_context(int msa) { int err, was_fpu_owner, prior_msa; + /* + * If an FP mode switch is currently underway, wait for it to + * complete before proceeding. + */ + wait_on_atomic_t(¤t->mm->context.fp_mode_switching, + wait_on_fp_mode_switch, TASK_KILLABLE); + if (!used_math()) { /* First time FP context user. */ preempt_disable(); diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 89f63503f903..31891d9535e2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -185,4 +185,9 @@ struct prctl_mm_map { #define PR_MPX_ENABLE_MANAGEMENT 43 #define PR_MPX_DISABLE_MANAGEMENT 44 +#define PR_SET_FP_MODE 45 +#define PR_GET_FP_MODE 46 +# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ +# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index a8c9f5a7dda6..08b16bbdcdf0 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -97,6 +97,12 @@ #ifndef MPX_DISABLE_MANAGEMENT # define MPX_DISABLE_MANAGEMENT(a) (-EINVAL) #endif +#ifndef GET_FP_MODE +# define GET_FP_MODE(a) (-EINVAL) +#endif +#ifndef SET_FP_MODE +# define SET_FP_MODE(a,b) (-EINVAL) +#endif /* * this is where the system-wide overflow UID and GID are defined, for @@ -2215,6 +2221,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_MPX_DISABLE_MANAGEMENT: error = MPX_DISABLE_MANAGEMENT(me); break; + case PR_SET_FP_MODE: + error = SET_FP_MODE(me, arg2); + break; + case PR_GET_FP_MODE: + error = GET_FP_MODE(me); + break; default: error = -EINVAL; break; -- cgit v1.2.3 From 01e586598b224d1a427acd8a7afa0b21e879d3a7 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:26 -0800 Subject: cgroup: release css->id after css_free Currently, we release css->id in css_release_work_fn, right before calling css_free callback, so that when css_free is called, the id may have already been reused for a new cgroup. I am going to use css->id to create unique names for per memcg kmem caches. Since kmem caches are destroyed only on css_free, I need css->id to be freed after css_free was called to avoid name clashes. This patch therefore moves css->id removal to css_free_work_fn. To prevent css_from_id from returning a pointer to a stale css, it makes css_release_work_fn replace the css ptr at css_idr:css->id with NULL. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Acked-by: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cgroup.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 04cfe8ace520..d5f6ec251fb2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4373,16 +4373,20 @@ static void css_free_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = container_of(work, struct cgroup_subsys_state, destroy_work); + struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; percpu_ref_exit(&css->refcnt); - if (css->ss) { + if (ss) { /* css free path */ + int id = css->id; + if (css->parent) css_put(css->parent); - css->ss->css_free(css); + ss->css_free(css); + cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); } else { /* cgroup free path */ @@ -4434,7 +4438,7 @@ static void css_release_work_fn(struct work_struct *work) if (ss) { /* css release path */ - cgroup_idr_remove(&ss->css_idr, css->id); + cgroup_idr_replace(&ss->css_idr, NULL, css->id); if (ss->css_released) ss->css_released(css); } else { -- cgit v1.2.3 From 2d2f5119b8bb057595e18f5b2f07aa097ea1b233 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 12 Feb 2015 14:59:59 -0800 Subject: mm: do not use mm->nr_pmds on !MMU configurations mm->nr_pmds doesn't make sense on !MMU configurations Signed-off-by: Kirill A. Shutemov Cc: Guenter Roeck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 9 ++++++++- kernel/fork.c | 4 +--- 2 files changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/mm.h b/include/linux/mm.h index af4ff88a11e0..bd52e2f14027 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1447,13 +1447,15 @@ static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address); #endif -#ifdef __PAGETABLE_PMD_FOLDED +#if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU) static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) { return 0; } +static inline void mm_nr_pmds_init(struct mm_struct *mm) {} + static inline unsigned long mm_nr_pmds(struct mm_struct *mm) { return 0; @@ -1465,6 +1467,11 @@ static inline void mm_dec_nr_pmds(struct mm_struct *mm) {} #else int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address); +static inline void mm_nr_pmds_init(struct mm_struct *mm) +{ + atomic_long_set(&mm->nr_pmds, 0); +} + static inline unsigned long mm_nr_pmds(struct mm_struct *mm) { return atomic_long_read(&mm->nr_pmds); diff --git a/kernel/fork.c b/kernel/fork.c index 66e19c251581..cf65139615a0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -555,9 +555,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; atomic_long_set(&mm->nr_ptes, 0); -#ifndef __PAGETABLE_PMD_FOLDED - atomic_long_set(&mm->nr_pmds, 0); -#endif + mm_nr_pmds_init(mm); mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; -- cgit v1.2.3 From 8f4ab07f4bf1b1069c01b7c6758a7d444406996b Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:16 -0800 Subject: kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init The only caller of cpuset_init_current_mems_allowed is the __init annotated build_all_zonelists_init, so we can also make the former __init. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 64b257f6bca2..c54a1dae6c11 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2400,7 +2400,7 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk) */ } -void cpuset_init_current_mems_allowed(void) +void __init cpuset_init_current_mems_allowed(void) { nodes_setall(current->mems_allowed); } -- cgit v1.2.3 From f56141e3e2d9aabf7e6b89680ab572c2cdbb2a24 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 12 Feb 2015 15:01:14 -0800 Subject: all arches, signal: move restart_block to struct task_struct If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack] Signed-off-by: Andy Lutomirski Cc: Thomas Gleixner Cc: Al Viro Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Kees Cook Cc: David Miller Acked-by: Richard Weinberger Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Vineet Gupta Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Steven Miao Cc: Mark Salter Cc: Aurelien Jacquiot Cc: Mikael Starvik Cc: Jesper Nilsson Cc: David Howells Cc: Richard Kuo Cc: "Luck, Tony" Cc: Geert Uytterhoeven Cc: Michal Simek Cc: Ralf Baechle Cc: Jonas Bonn Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Acked-by: Michael Ellerman (powerpc) Tested-by: Michael Ellerman (powerpc) Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Chen Liqin Cc: Lennox Wu Cc: Chris Metcalf Cc: Guan Xuetao Cc: Chris Zankel Cc: Max Filippov Cc: Oleg Nesterov Cc: Guenter Roeck Signed-off-by: James Hogan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/thread_info.h | 5 ----- arch/alpha/kernel/signal.c | 2 +- arch/arc/include/asm/thread_info.h | 4 ---- arch/arc/kernel/signal.c | 2 +- arch/arm/include/asm/thread_info.h | 4 ---- arch/arm/kernel/signal.c | 4 ++-- arch/arm64/include/asm/thread_info.h | 4 ---- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 4 ++-- arch/avr32/include/asm/thread_info.h | 4 ---- arch/avr32/kernel/asm-offsets.c | 1 - arch/avr32/kernel/signal.c | 2 +- arch/blackfin/include/asm/thread_info.h | 4 ---- arch/blackfin/kernel/signal.c | 2 +- arch/c6x/include/asm/thread_info.h | 4 ---- arch/c6x/kernel/signal.c | 2 +- arch/cris/arch-v10/kernel/signal.c | 2 +- arch/cris/arch-v32/kernel/signal.c | 2 +- arch/cris/include/asm/thread_info.h | 4 ---- arch/frv/include/asm/thread_info.h | 4 ---- arch/frv/kernel/asm-offsets.c | 1 - arch/frv/kernel/signal.c | 2 +- arch/hexagon/include/asm/thread_info.h | 4 ---- arch/hexagon/kernel/signal.c | 2 +- arch/ia64/include/asm/thread_info.h | 4 ---- arch/ia64/kernel/signal.c | 2 +- arch/m32r/include/asm/thread_info.h | 5 ----- arch/m32r/kernel/signal.c | 2 +- arch/m68k/include/asm/thread_info.h | 4 ---- arch/m68k/kernel/signal.c | 4 ++-- arch/metag/include/asm/thread_info.h | 6 +----- arch/metag/kernel/signal.c | 2 +- arch/microblaze/include/asm/thread_info.h | 4 ---- arch/microblaze/kernel/signal.c | 2 +- arch/mips/include/asm/thread_info.h | 4 ---- arch/mips/kernel/asm-offsets.c | 1 - arch/mips/kernel/signal.c | 2 +- arch/mips/kernel/signal32.c | 2 +- arch/mn10300/include/asm/thread_info.h | 4 ---- arch/mn10300/kernel/asm-offsets.c | 1 - arch/mn10300/kernel/signal.c | 2 +- arch/openrisc/include/asm/thread_info.h | 4 ---- arch/openrisc/kernel/signal.c | 2 +- arch/parisc/include/asm/thread_info.h | 4 ---- arch/parisc/kernel/signal.c | 2 +- arch/powerpc/include/asm/thread_info.h | 4 ---- arch/powerpc/kernel/signal_32.c | 4 ++-- arch/powerpc/kernel/signal_64.c | 2 +- arch/s390/include/asm/thread_info.h | 4 ---- arch/s390/kernel/compat_signal.c | 2 +- arch/s390/kernel/signal.c | 2 +- arch/score/include/asm/thread_info.h | 4 ---- arch/score/kernel/asm-offsets.c | 1 - arch/score/kernel/signal.c | 2 +- arch/sh/include/asm/thread_info.h | 4 ---- arch/sh/kernel/asm-offsets.c | 1 - arch/sh/kernel/signal_32.c | 4 ++-- arch/sh/kernel/signal_64.c | 4 ++-- arch/sparc/include/asm/thread_info_32.h | 6 ------ arch/sparc/include/asm/thread_info_64.h | 12 +++--------- arch/sparc/kernel/signal32.c | 4 ++-- arch/sparc/kernel/signal_32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/sparc/kernel/traps_64.c | 2 -- arch/tile/include/asm/thread_info.h | 4 ---- arch/tile/kernel/signal.c | 2 +- arch/um/include/asm/thread_info.h | 4 ---- arch/unicore32/include/asm/thread_info.h | 4 ---- arch/unicore32/kernel/signal.c | 2 +- arch/x86/ia32/ia32_signal.c | 2 +- arch/x86/include/asm/thread_info.h | 4 ---- arch/x86/kernel/signal.c | 2 +- arch/x86/um/signal.c | 2 +- arch/xtensa/include/asm/thread_info.h | 5 ----- arch/xtensa/kernel/signal.c | 2 +- fs/select.c | 2 +- include/linux/init_task.h | 3 +++ include/linux/sched.h | 2 ++ kernel/compat.c | 5 ++--- kernel/futex.c | 2 +- kernel/signal.c | 2 +- kernel/time/alarmtimer.c | 2 +- kernel/time/hrtimer.c | 2 +- kernel/time/posix-cpu-timers.c | 3 +-- 84 files changed, 62 insertions(+), 194 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h index 48bbea6898b3..d5b98ab514bb 100644 --- a/arch/alpha/include/asm/thread_info.h +++ b/arch/alpha/include/asm/thread_info.h @@ -27,8 +27,6 @@ struct thread_info { int bpt_nsaved; unsigned long bpt_addr[2]; /* breakpoint handling */ unsigned int bpt_insn[2]; - - struct restart_block restart_block; }; /* @@ -40,9 +38,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c index 6cec2881acbf..8dbfb15f1745 100644 --- a/arch/alpha/kernel/signal.c +++ b/arch/alpha/kernel/signal.c @@ -150,7 +150,7 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs) struct switch_stack *sw = (struct switch_stack *)regs - 1; long i, err = __get_user(regs->pc, &sc->sc_pc); - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; sw->r26 = (unsigned long) ret_from_sys_call; diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h index 02bc5ec0fb2e..1163a1838ac1 100644 --- a/arch/arc/include/asm/thread_info.h +++ b/arch/arc/include/asm/thread_info.h @@ -46,7 +46,6 @@ struct thread_info { struct exec_domain *exec_domain;/* execution domain */ __u32 cpu; /* current CPU */ unsigned long thr_ptr; /* TLS ptr */ - struct restart_block restart_block; }; /* @@ -62,9 +61,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arc/kernel/signal.c b/arch/arc/kernel/signal.c index cb3142a2d40b..114234e83caa 100644 --- a/arch/arc/kernel/signal.c +++ b/arch/arc/kernel/signal.c @@ -104,7 +104,7 @@ SYSCALL_DEFINE0(rt_sigreturn) struct pt_regs *regs = current_pt_regs(); /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* Since we stacked the signal on a word boundary, * then 'sp' should be word aligned here. If it's diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h index d890e41f5520..72812a1f3d1c 100644 --- a/arch/arm/include/asm/thread_info.h +++ b/arch/arm/include/asm/thread_info.h @@ -68,7 +68,6 @@ struct thread_info { #ifdef CONFIG_ARM_THUMBEE unsigned long thumbee_state; /* ThumbEE Handler Base register */ #endif - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -81,9 +80,6 @@ struct thread_info { .cpu_domain = domain_val(DOMAIN_USER, DOMAIN_MANAGER) | \ domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \ domain_val(DOMAIN_IO, DOMAIN_CLIENT), \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index 8aa6f1b87c9e..023ac905e4c3 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -191,7 +191,7 @@ asmlinkage int sys_sigreturn(struct pt_regs *regs) struct sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, @@ -221,7 +221,7 @@ asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 459bf8e53208..702e1e6a0d80 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -48,7 +48,6 @@ struct thread_info { mm_segment_t addr_limit; /* address limit */ struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ - struct restart_block restart_block; int preempt_count; /* 0 => preemptable, <0 => bug */ int cpu; /* cpu */ }; @@ -60,9 +59,6 @@ struct thread_info { .flags = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 6fa792137eda..660ccf9f7524 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -131,7 +131,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 128-bit boundary, then 'sp' should diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c index e299de396e9b..c20a300e2213 100644 --- a/arch/arm64/kernel/signal32.c +++ b/arch/arm64/kernel/signal32.c @@ -347,7 +347,7 @@ asmlinkage int compat_sys_sigreturn(struct pt_regs *regs) struct compat_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, @@ -381,7 +381,7 @@ asmlinkage int compat_sys_rt_sigreturn(struct pt_regs *regs) struct compat_rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h index a978f3fe7c25..d56afa99a514 100644 --- a/arch/avr32/include/asm/thread_info.h +++ b/arch/avr32/include/asm/thread_info.h @@ -30,7 +30,6 @@ struct thread_info { saved by debug handler when setting up trampoline */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -41,9 +40,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall \ - } \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/avr32/kernel/asm-offsets.c b/arch/avr32/kernel/asm-offsets.c index d6a8193a1d2f..e41c84516e5d 100644 --- a/arch/avr32/kernel/asm-offsets.c +++ b/arch/avr32/kernel/asm-offsets.c @@ -18,7 +18,6 @@ void foo(void) OFFSET(TI_preempt_count, thread_info, preempt_count); OFFSET(TI_rar_saved, thread_info, rar_saved); OFFSET(TI_rsr_saved, thread_info, rsr_saved); - OFFSET(TI_restart_block, thread_info, restart_block); BLANK(); OFFSET(TSK_active_mm, task_struct, active_mm); BLANK(); diff --git a/arch/avr32/kernel/signal.c b/arch/avr32/kernel/signal.c index d309fbcc3bd6..8f1c63b9b983 100644 --- a/arch/avr32/kernel/signal.c +++ b/arch/avr32/kernel/signal.c @@ -69,7 +69,7 @@ asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) sigset_t set; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *)regs->sp; pr_debug("SIG return: frame = %p\n", frame); diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h index 55f473bdad36..57c3a8bd583d 100644 --- a/arch/blackfin/include/asm/thread_info.h +++ b/arch/blackfin/include/asm/thread_info.h @@ -42,7 +42,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* address limit */ - struct restart_block restart_block; #ifndef CONFIG_SMP struct l1_scratch_task_info l1_task_info; #endif @@ -58,9 +57,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) #define init_stack (init_thread_union.stack) diff --git a/arch/blackfin/kernel/signal.c b/arch/blackfin/kernel/signal.c index ef275571d885..f2a8b5493bd3 100644 --- a/arch/blackfin/kernel/signal.c +++ b/arch/blackfin/kernel/signal.c @@ -44,7 +44,7 @@ rt_restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, int *p int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; #define RESTORE(x) err |= __get_user(regs->x, &sc->sc_##x) diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h index d4e9ef87076d..584e253f3217 100644 --- a/arch/c6x/include/asm/thread_info.h +++ b/arch/c6x/include/asm/thread_info.h @@ -45,7 +45,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 = preemptable, <0 = BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; }; /* @@ -61,9 +60,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/c6x/kernel/signal.c b/arch/c6x/kernel/signal.c index fe68226f6c4d..3c4bb5a5c382 100644 --- a/arch/c6x/kernel/signal.c +++ b/arch/c6x/kernel/signal.c @@ -68,7 +68,7 @@ asmlinkage int do_rt_sigreturn(struct pt_regs *regs) sigset_t set; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a dword boundary, diff --git a/arch/cris/arch-v10/kernel/signal.c b/arch/cris/arch-v10/kernel/signal.c index 9b32d338838b..74d7ba35120d 100644 --- a/arch/cris/arch-v10/kernel/signal.c +++ b/arch/cris/arch-v10/kernel/signal.c @@ -67,7 +67,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) unsigned long old_usp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* restore the regs from &sc->regs (same as sc, since regs is first) * (sc is already checked for VERIFY_READ since the sigframe was diff --git a/arch/cris/arch-v32/kernel/signal.c b/arch/cris/arch-v32/kernel/signal.c index 78ce3b1c9bcb..870e3e069318 100644 --- a/arch/cris/arch-v32/kernel/signal.c +++ b/arch/cris/arch-v32/kernel/signal.c @@ -59,7 +59,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) unsigned long old_usp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Restore the registers from &sc->regs. sc is already checked diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h index 55dede18c032..7286db5ed90e 100644 --- a/arch/cris/include/asm/thread_info.h +++ b/arch/cris/include/asm/thread_info.h @@ -38,7 +38,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -56,9 +55,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h index af29e17c0181..6b917f1c2955 100644 --- a/arch/frv/include/asm/thread_info.h +++ b/arch/frv/include/asm/thread_info.h @@ -41,7 +41,6 @@ struct thread_info { * 0-0xBFFFFFFF for user-thead * 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -65,9 +64,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/frv/kernel/asm-offsets.c b/arch/frv/kernel/asm-offsets.c index 9de96843a278..446e89d500cc 100644 --- a/arch/frv/kernel/asm-offsets.c +++ b/arch/frv/kernel/asm-offsets.c @@ -40,7 +40,6 @@ void foo(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PREEMPT_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); BLANK(); /* offsets into register file storage */ diff --git a/arch/frv/kernel/signal.c b/arch/frv/kernel/signal.c index dc3d59de0870..336713ab4745 100644 --- a/arch/frv/kernel/signal.c +++ b/arch/frv/kernel/signal.c @@ -62,7 +62,7 @@ static int restore_sigcontext(struct sigcontext __user *sc, int *_gr8) unsigned long tbr, psr; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; tbr = user->i.tbr; psr = user->i.psr; diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h index a59dad3b3695..bacd3d6030c5 100644 --- a/arch/hexagon/include/asm/thread_info.h +++ b/arch/hexagon/include/asm/thread_info.h @@ -56,7 +56,6 @@ struct thread_info { * used for syscalls somehow; * seems to have a function pointer and four arguments */ - struct restart_block restart_block; /* Points to the current pt_regs frame */ struct pt_regs *regs; /* @@ -83,9 +82,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .sp = 0, \ .regs = NULL, \ } diff --git a/arch/hexagon/kernel/signal.c b/arch/hexagon/kernel/signal.c index eadd70e47e7e..b039a624c170 100644 --- a/arch/hexagon/kernel/signal.c +++ b/arch/hexagon/kernel/signal.c @@ -239,7 +239,7 @@ asmlinkage int sys_rt_sigreturn(void) sigset_t blocked; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *)pt_psp(regs); if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h index 5b17418b4223..c16f21a068ff 100644 --- a/arch/ia64/include/asm/thread_info.h +++ b/arch/ia64/include/asm/thread_info.h @@ -27,7 +27,6 @@ struct thread_info { __u32 status; /* Thread synchronous flags */ mm_segment_t addr_limit; /* user-level address space limit */ int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */ - struct restart_block restart_block; #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE __u64 ac_stamp; __u64 ac_leave; @@ -46,9 +45,6 @@ struct thread_info { .cpu = 0, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #ifndef ASM_OFFSETS_C diff --git a/arch/ia64/kernel/signal.c b/arch/ia64/kernel/signal.c index 6d92170be457..b3a124da71e5 100644 --- a/arch/ia64/kernel/signal.c +++ b/arch/ia64/kernel/signal.c @@ -46,7 +46,7 @@ restore_sigcontext (struct sigcontext __user *sc, struct sigscratch *scr) long err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* restore scratch that always needs gets updated during signal delivery: */ err = __get_user(flags, &sc->sc_flags); diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h index 00171703402f..32422d0211c3 100644 --- a/arch/m32r/include/asm/thread_info.h +++ b/arch/m32r/include/asm/thread_info.h @@ -34,7 +34,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thread 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -49,7 +48,6 @@ struct thread_info { #define TI_CPU 0x00000010 #define TI_PRE_COUNT 0x00000014 #define TI_ADDR_LIMIT 0x00000018 -#define TI_RESTART_BLOCK 0x000001C #endif @@ -68,9 +66,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/m32r/kernel/signal.c b/arch/m32r/kernel/signal.c index 95408b8f130a..7736c6660a15 100644 --- a/arch/m32r/kernel/signal.c +++ b/arch/m32r/kernel/signal.c @@ -48,7 +48,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; #define COPY(x) err |= __get_user(regs->x, &sc->sc_##x) COPY(r4); diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h index 21a4784ca5a1..c54256e69e64 100644 --- a/arch/m68k/include/asm/thread_info.h +++ b/arch/m68k/include/asm/thread_info.h @@ -31,7 +31,6 @@ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ __u32 cpu; /* should always be 0 on m68k */ unsigned long tp_value; /* thread pointer */ - struct restart_block restart_block; }; #endif /* __ASSEMBLY__ */ @@ -41,9 +40,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_stack (init_thread_union.stack) diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c index 967a8b7e1527..d7179281e74a 100644 --- a/arch/m68k/kernel/signal.c +++ b/arch/m68k/kernel/signal.c @@ -655,7 +655,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *usc, void __u int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* get previous context */ if (copy_from_user(&context, usc, sizeof(context))) @@ -693,7 +693,7 @@ rt_restore_ucontext(struct pt_regs *regs, struct switch_stack *sw, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = __get_user(temp, &uc->uc_mcontext.version); if (temp != MCONTEXT_VERSION) diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h index 47711336119e..afb3ca4776d1 100644 --- a/arch/metag/include/asm/thread_info.h +++ b/arch/metag/include/asm/thread_info.h @@ -35,9 +35,8 @@ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; - u8 supervisor_stack[0]; + u8 supervisor_stack[0] __aligned(8); }; #else /* !__ASSEMBLY__ */ @@ -74,9 +73,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/metag/kernel/signal.c b/arch/metag/kernel/signal.c index 0d100d5c1407..ce49d429c74a 100644 --- a/arch/metag/kernel/signal.c +++ b/arch/metag/kernel/signal.c @@ -48,7 +48,7 @@ static int restore_sigcontext(struct pt_regs *regs, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = metag_gp_regs_copyin(regs, 0, sizeof(struct user_gp_regs), NULL, &sc->regs); diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h index 8c9d36591a03..b699fbd7de4a 100644 --- a/arch/microblaze/include/asm/thread_info.h +++ b/arch/microblaze/include/asm/thread_info.h @@ -71,7 +71,6 @@ struct thread_info { __u32 cpu; /* current CPU */ __s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; struct cpu_context cpu_context; }; @@ -87,9 +86,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c index 235706055b7f..a1cbaf90e2ea 100644 --- a/arch/microblaze/kernel/signal.c +++ b/arch/microblaze/kernel/signal.c @@ -89,7 +89,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) int rval; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h index e4440f92b366..9e1295f874f0 100644 --- a/arch/mips/include/asm/thread_info.h +++ b/arch/mips/include/asm/thread_info.h @@ -34,7 +34,6 @@ struct thread_info { * 0x7fffffff for user-thead * 0xffffffff for kernel-thread */ - struct restart_block restart_block; struct pt_regs *regs; long syscall; /* syscall number */ }; @@ -50,9 +49,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/mips/kernel/asm-offsets.c b/arch/mips/kernel/asm-offsets.c index b1d84bd4efb3..3b2dfdb4865f 100644 --- a/arch/mips/kernel/asm-offsets.c +++ b/arch/mips/kernel/asm-offsets.c @@ -98,7 +98,6 @@ void output_thread_info_defines(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PRE_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); OFFSET(TI_REGS, thread_info, regs); DEFINE(_THREAD_SIZE, THREAD_SIZE); DEFINE(_THREAD_MASK, THREAD_MASK); diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index 545bf11bd2ed..6a28c792d862 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -243,7 +243,7 @@ int restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) int i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err |= __get_user(regs->cp0_epc, &sc->sc_pc); diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c index d69179c0d49d..19a7705f2a01 100644 --- a/arch/mips/kernel/signal32.c +++ b/arch/mips/kernel/signal32.c @@ -220,7 +220,7 @@ static int restore_sigcontext32(struct pt_regs *regs, int i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err |= __get_user(regs->cp0_epc, &sc->sc_pc); err |= __get_user(regs->hi, &sc->sc_mdhi); diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h index bf280eaccd36..c1c374f0ec12 100644 --- a/arch/mn10300/include/asm/thread_info.h +++ b/arch/mn10300/include/asm/thread_info.h @@ -50,7 +50,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -80,9 +79,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/mn10300/kernel/asm-offsets.c b/arch/mn10300/kernel/asm-offsets.c index 47b3bb0c04ff..d780670cbaf3 100644 --- a/arch/mn10300/kernel/asm-offsets.c +++ b/arch/mn10300/kernel/asm-offsets.c @@ -28,7 +28,6 @@ void foo(void) OFFSET(TI_cpu, thread_info, cpu); OFFSET(TI_preempt_count, thread_info, preempt_count); OFFSET(TI_addr_limit, thread_info, addr_limit); - OFFSET(TI_restart_block, thread_info, restart_block); BLANK(); OFFSET(REG_D0, pt_regs, d0); diff --git a/arch/mn10300/kernel/signal.c b/arch/mn10300/kernel/signal.c index a6c0858592c3..8609845f12c5 100644 --- a/arch/mn10300/kernel/signal.c +++ b/arch/mn10300/kernel/signal.c @@ -40,7 +40,7 @@ static int restore_sigcontext(struct pt_regs *regs, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (is_using_fpu(current)) fpu_kill_state(current); diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h index d797acc901e4..875f0845a707 100644 --- a/arch/openrisc/include/asm/thread_info.h +++ b/arch/openrisc/include/asm/thread_info.h @@ -57,7 +57,6 @@ struct thread_info { 0-0x7FFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; /* saved context data */ @@ -79,9 +78,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .ksp = 0, \ } diff --git a/arch/openrisc/kernel/signal.c b/arch/openrisc/kernel/signal.c index 7d1b8235bf90..4112175bf803 100644 --- a/arch/openrisc/kernel/signal.c +++ b/arch/openrisc/kernel/signal.c @@ -46,7 +46,7 @@ static int restore_sigcontext(struct pt_regs *regs, int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Restore the regs from &sc->regs. diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h index a84611835549..fb13e3865563 100644 --- a/arch/parisc/include/asm/thread_info.h +++ b/arch/parisc/include/asm/thread_info.h @@ -14,7 +14,6 @@ struct thread_info { mm_segment_t addr_limit; /* user-level address space limit */ __u32 cpu; /* current CPU */ int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */ - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -25,9 +24,6 @@ struct thread_info { .cpu = 0, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall \ - } \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/parisc/kernel/signal.c b/arch/parisc/kernel/signal.c index 012d4fa63d97..9b910a0251b8 100644 --- a/arch/parisc/kernel/signal.c +++ b/arch/parisc/kernel/signal.c @@ -99,7 +99,7 @@ sys_rt_sigreturn(struct pt_regs *regs, int in_syscall) sigframe_size = PARISC_RT_SIGFRAME_SIZE32; #endif - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* Unwind the user stack to get the rt_sigframe structure. */ frame = (struct rt_sigframe __user *) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index e8abc83e699f..72489799cf02 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -43,7 +43,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 => preemptable, <0 => BUG */ - struct restart_block restart_block; unsigned long local_flags; /* private flags for thread */ /* low level flags - has atomic operations done on it */ @@ -59,9 +58,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .flags = 0, \ } diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index b171001698ff..d3a831ac0f92 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1231,7 +1231,7 @@ long sys_rt_sigreturn(int r3, int r4, int r5, int r6, int r7, int r8, int tm_restore = 0; #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; rt_sf = (struct rt_sigframe __user *) (regs->gpr[1] + __SIGNAL_FRAMESIZE + 16); @@ -1504,7 +1504,7 @@ long sys_sigreturn(int r3, int r4, int r5, int r6, int r7, int r8, #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; sf = (struct sigframe __user *)(regs->gpr[1] + __SIGNAL_FRAMESIZE); sc = &sf->sctx; diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 2cb0c94cafa5..c7c24d2e2bdb 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -666,7 +666,7 @@ int sys_rt_sigreturn(unsigned long r3, unsigned long r4, unsigned long r5, #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, uc, sizeof(*uc))) goto badframe; diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index 4d62fd5b56e5..ef1df718642d 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -39,7 +39,6 @@ struct thread_info { unsigned long sys_call_table; /* System call table address */ unsigned int cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ - struct restart_block restart_block; unsigned int system_call; __u64 user_timer; __u64 system_timer; @@ -56,9 +55,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/s390/kernel/compat_signal.c b/arch/s390/kernel/compat_signal.c index 34d5fa7b01b5..bc1df12dd4f8 100644 --- a/arch/s390/kernel/compat_signal.c +++ b/arch/s390/kernel/compat_signal.c @@ -209,7 +209,7 @@ static int restore_sigregs32(struct pt_regs *regs,_sigregs32 __user *sregs) int i; /* Alwys make any pending restarted system call return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (__copy_from_user(&user_sregs, &sregs->regs, sizeof(user_sregs))) return -EFAULT; diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c index 6a2ac257d98f..b3ae6f70c6d6 100644 --- a/arch/s390/kernel/signal.c +++ b/arch/s390/kernel/signal.c @@ -162,7 +162,7 @@ static int restore_sigregs(struct pt_regs *regs, _sigregs __user *sregs) _sigregs user_sregs; /* Alwys make any pending restarted system call return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (__copy_from_user(&user_sregs, sregs, sizeof(user_sregs))) return -EFAULT; diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h index 656b7ada9326..33864fa2a8d4 100644 --- a/arch/score/include/asm/thread_info.h +++ b/arch/score/include/asm/thread_info.h @@ -42,7 +42,6 @@ struct thread_info { * 0-0xFFFFFFFF for kernel-thread */ mm_segment_t addr_limit; - struct restart_block restart_block; struct pt_regs *regs; }; @@ -58,9 +57,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/score/kernel/asm-offsets.c b/arch/score/kernel/asm-offsets.c index 57788f44c6fb..b4d5214a7a7e 100644 --- a/arch/score/kernel/asm-offsets.c +++ b/arch/score/kernel/asm-offsets.c @@ -106,7 +106,6 @@ void output_thread_info_defines(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PRE_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); OFFSET(TI_REGS, thread_info, regs); DEFINE(KERNEL_STACK_SIZE, THREAD_SIZE); DEFINE(KERNEL_STACK_MASK, THREAD_MASK); diff --git a/arch/score/kernel/signal.c b/arch/score/kernel/signal.c index 1651807774ad..e381c8c4ff65 100644 --- a/arch/score/kernel/signal.c +++ b/arch/score/kernel/signal.c @@ -141,7 +141,7 @@ score_rt_sigreturn(struct pt_regs *regs) int sig; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *) regs->regs[0]; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h index ad27ffa65e2e..657c03919627 100644 --- a/arch/sh/include/asm/thread_info.h +++ b/arch/sh/include/asm/thread_info.h @@ -33,7 +33,6 @@ struct thread_info { __u32 cpu; int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; unsigned long previous_sp; /* sp of previous stack in case of nested IRQ stacks */ __u8 supervisor_stack[0]; @@ -63,9 +62,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/sh/kernel/asm-offsets.c b/arch/sh/kernel/asm-offsets.c index 08a2be775b6c..542225fedb11 100644 --- a/arch/sh/kernel/asm-offsets.c +++ b/arch/sh/kernel/asm-offsets.c @@ -25,7 +25,6 @@ int main(void) DEFINE(TI_FLAGS, offsetof(struct thread_info, flags)); DEFINE(TI_CPU, offsetof(struct thread_info, cpu)); DEFINE(TI_PRE_COUNT, offsetof(struct thread_info, preempt_count)); - DEFINE(TI_RESTART_BLOCK,offsetof(struct thread_info, restart_block)); DEFINE(TI_SIZE, sizeof(struct thread_info)); #ifdef CONFIG_HIBERNATION diff --git a/arch/sh/kernel/signal_32.c b/arch/sh/kernel/signal_32.c index 2f002b24fb92..0b34f2a704fe 100644 --- a/arch/sh/kernel/signal_32.c +++ b/arch/sh/kernel/signal_32.c @@ -156,7 +156,7 @@ asmlinkage int sys_sigreturn(void) int r0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; @@ -186,7 +186,7 @@ asmlinkage int sys_rt_sigreturn(void) int r0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/sh/kernel/signal_64.c b/arch/sh/kernel/signal_64.c index 897abe7b871e..71993c6a7d94 100644 --- a/arch/sh/kernel/signal_64.c +++ b/arch/sh/kernel/signal_64.c @@ -260,7 +260,7 @@ asmlinkage int sys_sigreturn(unsigned long r2, unsigned long r3, long long ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; @@ -294,7 +294,7 @@ asmlinkage int sys_rt_sigreturn(unsigned long r2, unsigned long r3, long long ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/sparc/include/asm/thread_info_32.h b/arch/sparc/include/asm/thread_info_32.h index 025c98446b1e..fd7bd0a440ca 100644 --- a/arch/sparc/include/asm/thread_info_32.h +++ b/arch/sparc/include/asm/thread_info_32.h @@ -47,8 +47,6 @@ struct thread_info { struct reg_window32 reg_window[NSWINS]; /* align for ldd! */ unsigned long rwbuf_stkptrs[NSWINS]; unsigned long w_saved; - - struct restart_block restart_block; }; /* @@ -62,9 +60,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) @@ -103,7 +98,6 @@ register struct thread_info *current_thread_info_reg asm("g6"); #define TI_REG_WINDOW 0x30 #define TI_RWIN_SPTRS 0x230 #define TI_W_SAVED 0x250 -/* #define TI_RESTART_BLOCK 0x25n */ /* Nobody cares */ /* * thread information flag bit numbers diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h index 798f0279a4b5..ff455164732a 100644 --- a/arch/sparc/include/asm/thread_info_64.h +++ b/arch/sparc/include/asm/thread_info_64.h @@ -58,8 +58,6 @@ struct thread_info { unsigned long gsr[7]; unsigned long xfsr[7]; - struct restart_block restart_block; - struct pt_regs *kern_una_regs; unsigned int kern_una_insn; @@ -92,10 +90,9 @@ struct thread_info { #define TI_RWIN_SPTRS 0x000003c8 #define TI_GSR 0x00000400 #define TI_XFSR 0x00000438 -#define TI_RESTART_BLOCK 0x00000470 -#define TI_KUNA_REGS 0x000004a0 -#define TI_KUNA_INSN 0x000004a8 -#define TI_FPREGS 0x000004c0 +#define TI_KUNA_REGS 0x00000470 +#define TI_KUNA_INSN 0x00000478 +#define TI_FPREGS 0x00000480 /* We embed this in the uppermost byte of thread_info->flags */ #define FAULT_CODE_WRITE 0x01 /* Write access, implies D-TLB */ @@ -124,9 +121,6 @@ struct thread_info { .current_ds = ASI_P, \ .exec_domain = &default_exec_domain, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c index 62deba7be1a9..4eed773a7735 100644 --- a/arch/sparc/kernel/signal32.c +++ b/arch/sparc/kernel/signal32.c @@ -150,7 +150,7 @@ void do_sigreturn32(struct pt_regs *regs) int err, i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); @@ -235,7 +235,7 @@ asmlinkage void do_rt_sigreturn32(struct pt_regs *regs) int err, i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); regs->u_regs[UREG_FP] &= 0x00000000ffffffffUL; diff --git a/arch/sparc/kernel/signal_32.c b/arch/sparc/kernel/signal_32.c index 9ee72fc8e0e4..52aa5e4ce5e7 100644 --- a/arch/sparc/kernel/signal_32.c +++ b/arch/sparc/kernel/signal_32.c @@ -70,7 +70,7 @@ asmlinkage void do_sigreturn(struct pt_regs *regs) int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c index 1a6999868031..d88beff47bab 100644 --- a/arch/sparc/kernel/signal_64.c +++ b/arch/sparc/kernel/signal_64.c @@ -254,7 +254,7 @@ void do_rt_sigreturn(struct pt_regs *regs) int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack (); sf = (struct rt_signal_frame __user *) diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c index 981a769b9558..a27651e866e7 100644 --- a/arch/sparc/kernel/traps_64.c +++ b/arch/sparc/kernel/traps_64.c @@ -2730,8 +2730,6 @@ void __init trap_init(void) TI_NEW_CHILD != offsetof(struct thread_info, new_child) || TI_CURRENT_DS != offsetof(struct thread_info, current_ds) || - TI_RESTART_BLOCK != offsetof(struct thread_info, - restart_block) || TI_KUNA_REGS != offsetof(struct thread_info, kern_una_regs) || TI_KUNA_INSN != offsetof(struct thread_info, diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index 48e4fd0f38e4..96c14c1430d8 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -36,7 +36,6 @@ struct thread_info { mm_segment_t addr_limit; /* thread address space (KERNEL_DS or USER_DS) */ - struct restart_block restart_block; struct single_step_state *step_state; /* single step state (if non-zero) */ int align_ctl; /* controls unaligned access */ @@ -57,9 +56,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .step_state = NULL, \ .align_ctl = 0, \ } diff --git a/arch/tile/kernel/signal.c b/arch/tile/kernel/signal.c index bb0a9ce7ae23..8a524e332c1a 100644 --- a/arch/tile/kernel/signal.c +++ b/arch/tile/kernel/signal.c @@ -48,7 +48,7 @@ int restore_sigcontext(struct pt_regs *regs, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Enforce that sigcontext is like pt_regs, and doesn't mess diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h index 1c5b2a83046a..e04114c4fcd9 100644 --- a/arch/um/include/asm/thread_info.h +++ b/arch/um/include/asm/thread_info.h @@ -22,7 +22,6 @@ struct thread_info { mm_segment_t addr_limit; /* thread address space: 0-0xBFFFFFFF for user 0-0xFFFFFFFF for kernel */ - struct restart_block restart_block; struct thread_info *real_thread; /* Points to non-IRQ stack */ }; @@ -34,9 +33,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .real_thread = NULL, \ } diff --git a/arch/unicore32/include/asm/thread_info.h b/arch/unicore32/include/asm/thread_info.h index af36d8eabdf1..63e2839dfeb8 100644 --- a/arch/unicore32/include/asm/thread_info.h +++ b/arch/unicore32/include/asm/thread_info.h @@ -79,7 +79,6 @@ struct thread_info { #ifdef CONFIG_UNICORE_FPU_F64 struct fp_state fpstate __attribute__((aligned(8))); #endif - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -89,9 +88,6 @@ struct thread_info { .flags = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/unicore32/kernel/signal.c b/arch/unicore32/kernel/signal.c index 7c8fb7018dc6..d329f85766cc 100644 --- a/arch/unicore32/kernel/signal.c +++ b/arch/unicore32/kernel/signal.c @@ -105,7 +105,7 @@ asmlinkage int __sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index f9e181aaba97..d0165c9a2932 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -169,7 +169,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs, u32 tmp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; get_user_try { /* diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e82e95abc92b..1d4e4f279a32 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -31,7 +31,6 @@ struct thread_info { __u32 cpu; /* current CPU */ int saved_preempt_count; mm_segment_t addr_limit; - struct restart_block restart_block; void __user *sysenter_return; unsigned int sig_on_uaccess_error:1; unsigned int uaccess_err:1; /* uaccess failed */ @@ -45,9 +44,6 @@ struct thread_info { .cpu = 0, \ .saved_preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 2a33c8f68319..e5042463c1bc 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -69,7 +69,7 @@ int restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; get_user_try { diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c index 79d824551c1a..0c8c32bfd792 100644 --- a/arch/x86/um/signal.c +++ b/arch/x86/um/signal.c @@ -157,7 +157,7 @@ static int copy_sc_from_user(struct pt_regs *regs, int err, pid; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = copy_from_user(&sc, from, sizeof(sc)); if (err) diff --git a/arch/xtensa/include/asm/thread_info.h b/arch/xtensa/include/asm/thread_info.h index 470153e8547c..a9b5d3ba196c 100644 --- a/arch/xtensa/include/asm/thread_info.h +++ b/arch/xtensa/include/asm/thread_info.h @@ -51,7 +51,6 @@ struct thread_info { __s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; unsigned long cpenable; @@ -72,7 +71,6 @@ struct thread_info { #define TI_CPU 0x00000010 #define TI_PRE_COUNT 0x00000014 #define TI_ADDR_LIMIT 0x00000018 -#define TI_RESTART_BLOCK 0x000001C #endif @@ -90,9 +88,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 4612321c73cc..3d733ba16f28 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -245,7 +245,7 @@ asmlinkage long xtensa_rt_sigreturn(long a0, long a1, long a2, long a3, int ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (regs->depc > 64) panic("rt_sigreturn in double exception!\n"); diff --git a/fs/select.c b/fs/select.c index 467bb1cb3ea5..f684c750e08a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -971,7 +971,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, if (ret == -EINTR) { struct restart_block *restart_block; - restart_block = ¤t_thread_info()->restart_block; + restart_block = ¤t->restart_block; restart_block->fn = do_restart_poll; restart_block->poll.ufds = ufds; restart_block->poll.nfds = nfds; diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 3037fc085e8e..d3d43ecf148c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -193,6 +193,9 @@ extern struct task_group root_task_group; .nr_cpus_allowed= NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ + .restart_block = { \ + .fn = do_no_restart_syscall, \ + }, \ .se = { \ .group_node = LIST_HEAD_INIT(tsk.se.group_node), \ }, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..22ee0d5d7f8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1370,6 +1370,8 @@ struct task_struct { unsigned long atomic_flags; /* Flags needing atomic access. */ + struct restart_block restart_block; + pid_t pid; pid_t tgid; diff --git a/kernel/compat.c b/kernel/compat.c index ebb3c369d03d..24f00610c575 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -276,8 +276,7 @@ COMPAT_SYSCALL_DEFINE2(nanosleep, struct compat_timespec __user *, rqtp, * core implementation decides to return random nonsense. */ if (ret == -ERESTART_RESTARTBLOCK) { - struct restart_block *restart - = ¤t_thread_info()->restart_block; + struct restart_block *restart = ¤t->restart_block; restart->fn = compat_nanosleep_restart; restart->nanosleep.compat_rmtp = rmtp; @@ -860,7 +859,7 @@ COMPAT_SYSCALL_DEFINE4(clock_nanosleep, clockid_t, which_clock, int, flags, return -EFAULT; if (err == -ERESTART_RESTARTBLOCK) { - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = compat_clock_nanosleep_restart; restart->nanosleep.compat_rmtp = rmtp; } diff --git a/kernel/futex.c b/kernel/futex.c index 4eeb63de7e54..2a5e3830e953 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2217,7 +2217,7 @@ retry: if (!abs_time) goto out; - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = futex_wait_restart; restart->futex.uaddr = uaddr; restart->futex.val = val; diff --git a/kernel/signal.c b/kernel/signal.c index 16a305295256..33a52759cc0e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2501,7 +2501,7 @@ EXPORT_SYMBOL(unblock_all_signals); */ SYSCALL_DEFINE0(restart_syscall) { - struct restart_block *restart = ¤t_thread_info()->restart_block; + struct restart_block *restart = ¤t->restart_block; return restart->fn(restart); } diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index a7077d3ae52f..1b001ed1edb9 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -788,7 +788,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags, goto out; } - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = alarm_timer_nsleep_restart; restart->nanosleep.clockid = type; restart->nanosleep.expires = exp.tv64; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 3f5e183c3d97..bee0c1f78091 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1583,7 +1583,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp, goto out; } - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = hrtimer_nanosleep_restart; restart->nanosleep.clockid = t.timer.base->clockid; restart->nanosleep.rmtp = rmtp; diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index a16b67859e2a..0075da74abf0 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -1334,8 +1334,7 @@ static long posix_cpu_nsleep_restart(struct restart_block *restart_block); static int posix_cpu_nsleep(const clockid_t which_clock, int flags, struct timespec *rqtp, struct timespec __user *rmtp) { - struct restart_block *restart_block = - ¤t_thread_info()->restart_block; + struct restart_block *restart_block = ¤t->restart_block; struct itimerspec it; int error; -- cgit v1.2.3 From 545a2bf742fb41f17d03486dd8a8c74ad511dec2 Mon Sep 17 00:00:00 2001 From: Cyril Bur Date: Thu, 12 Feb 2015 15:01:24 -0800 Subject: kernel/sched/clock.c: add another clock for use with the soft lockup watchdog When the hypervisor pauses a virtualised kernel the kernel will observe a jump in timebase, this can cause spurious messages from the softlockup detector. Whilst these messages are harmless, they are accompanied with a stack trace which causes undue concern and more problematically the stack trace in the guest has nothing to do with the observed problem and can only be misleading. Futhermore, on POWER8 this is completely avoidable with the introduction of the Virtual Time Base (VTB) register. This patch (of 2): This permits the use of arch specific clocks for which virtualised kernels can use their notion of 'running' time, not the elpased wall time which will include host execution time. Signed-off-by: Cyril Bur Cc: Michael Ellerman Cc: Andrew Jones Acked-by: Don Zickus Cc: Ingo Molnar Cc: Ulrich Obergfell Cc: chai wen Cc: Fabian Frederick Cc: Aaron Tomlin Cc: Ben Zhang Cc: Martin Schwidefsky Cc: John Stultz Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 1 + kernel/sched/clock.c | 13 +++++++++++++ kernel/watchdog.c | 2 +- 3 files changed, 15 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 22ee0d5d7f8c..048b91b983ed 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2147,6 +2147,7 @@ extern unsigned long long notrace sched_clock(void); */ extern u64 cpu_clock(int cpu); extern u64 local_clock(void); +extern u64 running_clock(void); extern u64 sched_clock_cpu(int cpu); diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index c27e4f8f4879..c0a205101c23 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -420,3 +420,16 @@ u64 local_clock(void) EXPORT_SYMBOL_GPL(cpu_clock); EXPORT_SYMBOL_GPL(local_clock); + +/* + * Running clock - returns the time that has elapsed while a guest has been + * running. + * On a guest this value should be local_clock minus the time the guest was + * suspended by the hypervisor (for any reason). + * On bare metal this function should return the same as local_clock. + * Architectures and sub-architectures can override this. + */ +u64 __weak running_clock(void) +{ + return local_clock(); +} diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf11815f84..3174bf8e3538 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -154,7 +154,7 @@ static int get_softlockup_thresh(void) */ static unsigned long get_timestamp(void) { - return local_clock() >> 30LL; /* 2^30 ~= 10^9 */ + return running_clock() >> 30LL; /* 2^30 ~= 10^9 */ } static void set_sample_period(void) -- cgit v1.2.3 From 205bd3d23e938530cb89ff9e14afda389ac85dc3 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Thu, 12 Feb 2015 15:01:34 -0800 Subject: printk: correct timeout comment, neaten MODULE_PARM_DESC Neaten the MODULE_PARAM_DESC message. Use 30 seconds in the comment for the zap console locks timeout. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/printk/printk.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 02d6b6d28796..c06df7de0963 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -935,8 +935,8 @@ static int __init ignore_loglevel_setup(char *str) early_param("ignore_loglevel", ignore_loglevel_setup); module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); -MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting, to" - "print all kernel messages to the console."); +MODULE_PARM_DESC(ignore_loglevel, + "ignore loglevel setting (prints all kernel messages to the console)"); #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1419,16 +1419,16 @@ static void call_console_drivers(int level, const char *text, size_t len) } /* - * Zap console related locks when oopsing. Only zap at most once - * every 10 seconds, to leave time for slow consoles to print a - * full oops. + * Zap console related locks when oopsing. + * To leave time for slow consoles to print a full oops, + * only zap at most once every 30 seconds. */ static void zap_locks(void) { static unsigned long oops_timestamp; if (time_after_eq(jiffies, oops_timestamp) && - !time_after(jiffies, oops_timestamp + 30 * HZ)) + !time_after(jiffies, oops_timestamp + 30 * HZ)) return; oops_timestamp = jiffies; -- cgit v1.2.3 From 3810631332465d967ba5e27ea2c7dff2c9afac6c Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 12 Feb 2015 23:33:15 +0100 Subject: PM / sleep: Re-implement suspend-to-idle handling In preparation for adding support for quiescing timers in the final stage of suspend-to-idle transitions, rework the freeze_enter() function making the system wait on a wakeup event, the freeze_wake() function terminating the suspend-to-idle loop and the mechanism by which deep idle states are entered during suspend-to-idle. First of all, introduce a simple state machine for suspend-to-idle and make the code in question use it. Second, prevent freeze_enter() from losing wakeup events due to race conditions and ensure that the number of online CPUs won't change while it is being executed. In addition to that, make it force all of the CPUs re-enter the idle loop in case they are in idle states already (so they can enter deeper idle states if possible). Next, drop cpuidle_use_deepest_state() and replace use_deepest_state checks in cpuidle_select() and cpuidle_reflect() with a single suspend-to-idle state check in cpuidle_idle_call(). Finally, introduce cpuidle_enter_freeze() that will simply find the deepest idle state available to the given CPU and enter it using cpuidle_enter(). Signed-off-by: Rafael J. Wysocki Acked-by: Peter Zijlstra (Intel) --- drivers/cpuidle/cpuidle.c | 49 +++++++++++++++++++++++++---------------------- include/linux/cpuidle.h | 4 ++-- include/linux/suspend.h | 16 ++++++++++++++++ kernel/power/suspend.c | 43 ++++++++++++++++++++++++++++++++++------- kernel/sched/idle.c | 16 ++++++++++++++++ 5 files changed, 96 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 125150dc6e81..23a8d6cc8d30 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -32,7 +33,6 @@ LIST_HEAD(cpuidle_detected_devices); static int enabled_devices; static int off __read_mostly; static int initialized __read_mostly; -static bool use_deepest_state __read_mostly; int cpuidle_disabled(void) { @@ -66,24 +66,9 @@ int cpuidle_play_dead(void) } /** - * cpuidle_use_deepest_state - Enable/disable the "deepest idle" mode. - * @enable: Whether enable or disable the feature. - * - * If the "deepest idle" mode is enabled, cpuidle will ignore the governor and - * always use the state with the greatest exit latency (out of the states that - * are not disabled). - * - * This function can only be called after cpuidle_pause() to avoid races. - */ -void cpuidle_use_deepest_state(bool enable) -{ - use_deepest_state = enable; -} - -/** - * cpuidle_find_deepest_state - Find the state of the greatest exit latency. - * @drv: cpuidle driver for a given CPU. - * @dev: cpuidle device for a given CPU. + * cpuidle_find_deepest_state - Find deepest state meeting specific conditions. + * @drv: cpuidle driver for the given CPU. + * @dev: cpuidle device for the given CPU. */ static int cpuidle_find_deepest_state(struct cpuidle_driver *drv, struct cpuidle_device *dev) @@ -104,6 +89,27 @@ static int cpuidle_find_deepest_state(struct cpuidle_driver *drv, return ret; } +/** + * cpuidle_enter_freeze - Enter an idle state suitable for suspend-to-idle. + * + * Find the deepest state available and enter it. + */ +void cpuidle_enter_freeze(void) +{ + struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); + int index; + + index = cpuidle_find_deepest_state(drv, dev); + if (index >= 0) + cpuidle_enter(drv, dev, index); + else + arch_cpu_idle(); + + /* Interrupts are enabled again here. */ + local_irq_disable(); +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -166,9 +172,6 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) if (!drv || !dev || !dev->enabled) return -EBUSY; - if (unlikely(use_deepest_state)) - return cpuidle_find_deepest_state(drv, dev); - return cpuidle_curr_governor->select(drv, dev); } @@ -200,7 +203,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ void cpuidle_reflect(struct cpuidle_device *dev, int index) { - if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state)) + if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, index); } diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index ab70f3bc44ad..f63aabf4ee90 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -141,7 +141,7 @@ extern void cpuidle_resume(void); extern int cpuidle_enable_device(struct cpuidle_device *dev); extern void cpuidle_disable_device(struct cpuidle_device *dev); extern int cpuidle_play_dead(void); -extern void cpuidle_use_deepest_state(bool enable); +extern void cpuidle_enter_freeze(void); extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev); #else @@ -174,7 +174,7 @@ static inline int cpuidle_enable_device(struct cpuidle_device *dev) {return -ENODEV; } static inline void cpuidle_disable_device(struct cpuidle_device *dev) { } static inline int cpuidle_play_dead(void) {return -ENODEV; } -static inline void cpuidle_use_deepest_state(bool enable) {} +static inline void cpuidle_enter_freeze(void) { } static inline struct cpuidle_driver *cpuidle_get_cpu_driver( struct cpuidle_device *dev) {return NULL; } #endif diff --git a/include/linux/suspend.h b/include/linux/suspend.h index 3388c1b6f7d8..5efe743ce1e8 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -201,6 +201,21 @@ struct platform_freeze_ops { */ extern void suspend_set_ops(const struct platform_suspend_ops *ops); extern int suspend_valid_only_mem(suspend_state_t state); + +/* Suspend-to-idle state machnine. */ +enum freeze_state { + FREEZE_STATE_NONE, /* Not suspended/suspending. */ + FREEZE_STATE_ENTER, /* Enter suspend-to-idle. */ + FREEZE_STATE_WAKE, /* Wake up from suspend-to-idle. */ +}; + +extern enum freeze_state __read_mostly suspend_freeze_state; + +static inline bool idle_should_freeze(void) +{ + return unlikely(suspend_freeze_state == FREEZE_STATE_ENTER); +} + extern void freeze_set_ops(const struct platform_freeze_ops *ops); extern void freeze_wake(void); @@ -228,6 +243,7 @@ extern int pm_suspend(suspend_state_t state); static inline void suspend_set_ops(const struct platform_suspend_ops *ops) {} static inline int pm_suspend(suspend_state_t state) { return -ENOSYS; } +static inline bool idle_should_freeze(void) { return false; } static inline void freeze_set_ops(const struct platform_freeze_ops *ops) {} static inline void freeze_wake(void) {} #endif /* !CONFIG_SUSPEND */ diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c index c347e3ce3a55..b7d6b3a721b1 100644 --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -37,7 +37,9 @@ const char *pm_states[PM_SUSPEND_MAX]; static const struct platform_suspend_ops *suspend_ops; static const struct platform_freeze_ops *freeze_ops; static DECLARE_WAIT_QUEUE_HEAD(suspend_freeze_wait_head); -static bool suspend_freeze_wake; + +enum freeze_state __read_mostly suspend_freeze_state; +static DEFINE_SPINLOCK(suspend_freeze_lock); void freeze_set_ops(const struct platform_freeze_ops *ops) { @@ -48,22 +50,49 @@ void freeze_set_ops(const struct platform_freeze_ops *ops) static void freeze_begin(void) { - suspend_freeze_wake = false; + suspend_freeze_state = FREEZE_STATE_NONE; } static void freeze_enter(void) { - cpuidle_use_deepest_state(true); + spin_lock_irq(&suspend_freeze_lock); + if (pm_wakeup_pending()) + goto out; + + suspend_freeze_state = FREEZE_STATE_ENTER; + spin_unlock_irq(&suspend_freeze_lock); + + get_online_cpus(); cpuidle_resume(); - wait_event(suspend_freeze_wait_head, suspend_freeze_wake); + + /* Push all the CPUs into the idle loop. */ + wake_up_all_idle_cpus(); + pr_debug("PM: suspend-to-idle\n"); + /* Make the current CPU wait so it can enter the idle loop too. */ + wait_event(suspend_freeze_wait_head, + suspend_freeze_state == FREEZE_STATE_WAKE); + pr_debug("PM: resume from suspend-to-idle\n"); + cpuidle_pause(); - cpuidle_use_deepest_state(false); + put_online_cpus(); + + spin_lock_irq(&suspend_freeze_lock); + + out: + suspend_freeze_state = FREEZE_STATE_NONE; + spin_unlock_irq(&suspend_freeze_lock); } void freeze_wake(void) { - suspend_freeze_wake = true; - wake_up(&suspend_freeze_wait_head); + unsigned long flags; + + spin_lock_irqsave(&suspend_freeze_lock, flags); + if (suspend_freeze_state > FREEZE_STATE_NONE) { + suspend_freeze_state = FREEZE_STATE_WAKE; + wake_up(&suspend_freeze_wait_head); + } + spin_unlock_irqrestore(&suspend_freeze_lock, flags); } EXPORT_SYMBOL_GPL(freeze_wake); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index aaf1c1d5cf5d..94b2d7b88a27 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -7,6 +7,7 @@ #include #include #include +#include #include @@ -104,6 +105,21 @@ static void cpuidle_idle_call(void) */ rcu_idle_enter(); + /* + * Suspend-to-idle ("freeze") is a system state in which all user space + * has been frozen, all I/O devices have been suspended and the only + * activity happens here and in iterrupts (if any). In that case bypass + * the cpuidle governor and go stratight for the deepest idle state + * available. Possibly also suspend the local tick and the entire + * timekeeping to prevent timer interrupts from kicking us out of idle + * until a proper wakeup interrupt happens. + */ + if (idle_should_freeze()) { + cpuidle_enter_freeze(); + local_irq_enable(); + goto exit_idle; + } + /* * Ask the cpuidle framework to choose a convenient idle state. * Fall back to the default arch idle method on errors. -- cgit v1.2.3 From affe3e85ae78507cc953f3f700e0644e50844cff Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 11 Feb 2015 05:01:52 +0100 Subject: timekeeping: Pass readout base to update_fast_timekeeper() Modify update_fast_timekeeper() to take a struct tk_read_base pointer as its argument (instead of a struct timekeeper pointer) and update its kerneldoc comment to reflect that. That will allow a struct tk_read_base that is not part of a struct timekeeper to be passed to it in the next patch. Signed-off-by: Rafael J. Wysocki Acked-by: Peter Zijlstra (Intel) Acked-by: John Stultz --- kernel/time/timekeeping.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index b124af259800..abf08f4366c1 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -230,9 +230,7 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk) /** * update_fast_timekeeper - Update the fast and NMI safe monotonic timekeeper. - * @tk: The timekeeper from which we take the update - * @tkf: The fast timekeeper to update - * @tbase: The time base for the fast timekeeper (mono/raw) + * @tkr: Timekeeping readout base from which we take the update * * We want to use this from any context including NMI and tracing / * instrumenting the timekeeping code itself. @@ -244,11 +242,11 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk) * smp_wmb(); <- Ensure that the last base[1] update is visible * tkf->seq++; * smp_wmb(); <- Ensure that the seqcount update is visible - * update(tkf->base[0], tk); + * update(tkf->base[0], tkr); * smp_wmb(); <- Ensure that the base[0] update is visible * tkf->seq++; * smp_wmb(); <- Ensure that the seqcount update is visible - * update(tkf->base[1], tk); + * update(tkf->base[1], tkr); * * The reader side does: * @@ -269,7 +267,7 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk) * slightly wrong timestamp (a few nanoseconds). See * @ktime_get_mono_fast_ns. */ -static void update_fast_timekeeper(struct timekeeper *tk) +static void update_fast_timekeeper(struct tk_read_base *tkr) { struct tk_read_base *base = tk_fast_mono.base; @@ -277,7 +275,7 @@ static void update_fast_timekeeper(struct timekeeper *tk) raw_write_seqcount_latch(&tk_fast_mono.seq); /* Update base[0] */ - memcpy(base, &tk->tkr, sizeof(*base)); + memcpy(base, tkr, sizeof(*base)); /* Force readers back to base[0] */ raw_write_seqcount_latch(&tk_fast_mono.seq); @@ -462,7 +460,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) memcpy(&shadow_timekeeper, &tk_core.timekeeper, sizeof(tk_core.timekeeper)); - update_fast_timekeeper(tk); + update_fast_timekeeper(&tk->tkr); } /** -- cgit v1.2.3 From dfeb0750b630b72b5d4fb2461bc7179eceb54666 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:36:31 -0800 Subject: kernfs: remove KERNFS_STATIC_NAME When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid making a separate copy of its name. It's currently only used for sysfs attributes whose filenames are required to stay accessible and unchanged. There are rare exceptions where these names are allocated and formatted dynamically but for the vast majority of cases they're consts in the rodata section. Now that kernfs is converted to use kstrdup_const() and kfree_const(), there's little point in keeping KERNFS_STATIC_NAME around. Remove it. Signed-off-by: Tejun Heo Cc: Andrzej Hajda Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/kernfs/dir.c | 20 ++++++++------------ fs/kernfs/file.c | 4 ---- fs/sysfs/file.c | 2 +- include/linux/kernfs.h | 7 ++----- kernel/cgroup.c | 2 +- 5 files changed, 12 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 35e40879860a..6acc9648f986 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -411,8 +411,9 @@ void kernfs_put(struct kernfs_node *kn) if (kernfs_type(kn) == KERNFS_LINK) kernfs_put(kn->symlink.target_kn); - if (!(kn->flags & KERNFS_STATIC_NAME)) - kfree_const(kn->name); + + kfree_const(kn->name); + if (kn->iattr) { if (kn->iattr->ia_secdata) security_release_secctx(kn->iattr->ia_secdata, @@ -506,15 +507,12 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, const char *name, umode_t mode, unsigned flags) { - const char *dup_name = NULL; struct kernfs_node *kn; int ret; - if (!(flags & KERNFS_STATIC_NAME)) { - name = dup_name = kstrdup_const(name, GFP_KERNEL); - if (!name) - return NULL; - } + name = kstrdup_const(name, GFP_KERNEL); + if (!name) + return NULL; kn = kmem_cache_zalloc(kernfs_node_cache, GFP_KERNEL); if (!kn) @@ -538,7 +536,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, err_out2: kmem_cache_free(kernfs_node_cache, kn); err_out1: - kfree_const(dup_name); + kfree_const(name); return NULL; } @@ -1285,9 +1283,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, kn->ns = new_ns; if (new_name) { - if (!(kn->flags & KERNFS_STATIC_NAME)) - old_name = kn->name; - kn->flags &= ~KERNFS_STATIC_NAME; + old_name = kn->name; kn->name = new_name; } diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index ddc9f9612f16..b684e8a132e6 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -901,7 +901,6 @@ const struct file_operations kernfs_file_fops = { * @ops: kernfs operations for the file * @priv: private data for the file * @ns: optional namespace tag of the file - * @name_is_static: don't copy file name * @key: lockdep key for the file's active_ref, %NULL to disable lockdep * * Returns the created node on success, ERR_PTR() value on error. @@ -911,7 +910,6 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, umode_t mode, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - bool name_is_static, struct lock_class_key *key) { struct kernfs_node *kn; @@ -919,8 +917,6 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, int rc; flags = KERNFS_FILE; - if (name_is_static) - flags |= KERNFS_STATIC_NAME; kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, flags); if (!kn) diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index dfe928a9540f..7c2867b44141 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -295,7 +295,7 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, key = attr->key ?: (struct lock_class_key *)&attr->skey; #endif kn = __kernfs_create_file(parent, attr->name, mode & 0777, size, ops, - (void *)attr, ns, true, key); + (void *)attr, ns, key); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index d4e01b358341..71ecdab1671b 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -43,7 +43,6 @@ enum kernfs_node_flag { KERNFS_HAS_SEQ_SHOW = 0x0040, KERNFS_HAS_MMAP = 0x0080, KERNFS_LOCKDEP = 0x0100, - KERNFS_STATIC_NAME = 0x0200, KERNFS_SUICIDAL = 0x0400, KERNFS_SUICIDED = 0x0800, }; @@ -291,7 +290,6 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, umode_t mode, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, - bool name_is_static, struct lock_class_key *key); struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, const char *name, @@ -369,8 +367,7 @@ kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, loff_t size, const struct kernfs_ops *ops, - void *priv, const void *ns, bool name_is_static, - struct lock_class_key *key) + void *priv, const void *ns, struct lock_class_key *key) { return ERR_PTR(-ENOSYS); } static inline struct kernfs_node * @@ -439,7 +436,7 @@ kernfs_create_file_ns(struct kernfs_node *parent, const char *name, key = (struct lock_class_key *)&ops->lockdep_key; #endif return __kernfs_create_file(parent, name, mode, size, ops, priv, ns, - false, key); + key); } static inline struct kernfs_node * diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d5f6ec251fb2..29a7b2cc593e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3077,7 +3077,7 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft) #endif kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), cgroup_file_mode(cft), 0, cft->kf_ops, cft, - NULL, false, key); + NULL, key); if (IS_ERR(kn)) return PTR_ERR(kn); -- cgit v1.2.3 From e8e6d97c9bf734c42322dbed0f882ef11bfe7b58 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:23 -0800 Subject: cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. * kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static buffer which is protected by a dedicated spinlock. Removed. Signed-off-by: Tejun Heo Cc: Li Zefan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 42 +++++++++--------------------------------- 1 file changed, 9 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index c54a1dae6c11..1d1fe9361d29 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1707,40 +1707,27 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v) { struct cpuset *cs = css_cs(seq_css(sf)); cpuset_filetype_t type = seq_cft(sf)->private; - ssize_t count; - char *buf, *s; int ret = 0; - count = seq_get_buf(sf, &buf); - s = buf; - spin_lock_irq(&callback_lock); switch (type) { case FILE_CPULIST: - s += cpulist_scnprintf(s, count, cs->cpus_allowed); + seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->cpus_allowed)); break; case FILE_MEMLIST: - s += nodelist_scnprintf(s, count, cs->mems_allowed); + seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->mems_allowed)); break; case FILE_EFFECTIVE_CPULIST: - s += cpulist_scnprintf(s, count, cs->effective_cpus); + seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->effective_cpus)); break; case FILE_EFFECTIVE_MEMLIST: - s += nodelist_scnprintf(s, count, cs->effective_mems); + seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems)); break; default: ret = -EINVAL; - goto out_unlock; } - if (s < buf + count - 1) { - *s++ = '\n'; - seq_commit(sf, s - buf); - } else { - seq_commit(sf, -1); - } -out_unlock: spin_unlock_irq(&callback_lock); return ret; } @@ -2610,8 +2597,6 @@ int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed); } -#define CPUSET_NODELIST_LEN (256) - /** * cpuset_print_task_mems_allowed - prints task's cpuset and mems_allowed * @tsk: pointer to task_struct of some task. @@ -2621,23 +2606,16 @@ int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, */ void cpuset_print_task_mems_allowed(struct task_struct *tsk) { - /* Statically allocated to prevent using excess stack. */ - static char cpuset_nodelist[CPUSET_NODELIST_LEN]; - static DEFINE_SPINLOCK(cpuset_buffer_lock); struct cgroup *cgrp; - spin_lock(&cpuset_buffer_lock); rcu_read_lock(); cgrp = task_cs(tsk)->css.cgroup; - nodelist_scnprintf(cpuset_nodelist, CPUSET_NODELIST_LEN, - tsk->mems_allowed); pr_info("%s cpuset=", tsk->comm); pr_cont_cgroup_name(cgrp); - pr_cont(" mems_allowed=%s\n", cpuset_nodelist); + pr_cont(" mems_allowed=%*pbl\n", nodemask_pr_args(&tsk->mems_allowed)); rcu_read_unlock(); - spin_unlock(&cpuset_buffer_lock); } /* @@ -2715,10 +2693,8 @@ out: /* Display task mems_allowed in /proc//status file. */ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task) { - seq_puts(m, "Mems_allowed:\t"); - seq_nodemask(m, &task->mems_allowed); - seq_puts(m, "\n"); - seq_puts(m, "Mems_allowed_list:\t"); - seq_nodemask_list(m, &task->mems_allowed); - seq_puts(m, "\n"); + seq_printf(m, "Mems_allowed:\t%*pb\n", + nodemask_pr_args(&task->mems_allowed)); + seq_printf(m, "Mems_allowed_list:\t%*pbl\n", + nodemask_pr_args(&task->mems_allowed)); } -- cgit v1.2.3 From ad853b48cb4650285e8544eebbba5bbd9274ee15 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:25 -0800 Subject: rcu: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcu/tree_plugin.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2e850a51bb8f..0d7bbe3095ad 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -49,7 +49,6 @@ DEFINE_PER_CPU(char, rcu_cpu_has_work); static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */ static bool have_rcu_nocb_mask; /* Was rcu_nocb_mask allocated? */ static bool __read_mostly rcu_nocb_poll; /* Offload kthread are to poll. */ -static char __initdata nocb_buf[NR_CPUS * 5]; #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ /* @@ -2386,8 +2385,8 @@ void __init rcu_init_nohz(void) cpumask_and(rcu_nocb_mask, cpu_possible_mask, rcu_nocb_mask); } - cpulist_scnprintf(nocb_buf, sizeof(nocb_buf), rcu_nocb_mask); - pr_info("\tOffload RCU callbacks from CPUs: %s.\n", nocb_buf); + pr_info("\tOffload RCU callbacks from CPUs: %*pbl.\n", + cpumask_pr_args(rcu_nocb_mask)); if (rcu_nocb_poll) pr_info("\tPoll for callbacks from no-CBs CPUs.\n"); -- cgit v1.2.3 From 333470ee46b6851477dc9392eb71ba8ffa4f3387 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:28 -0800 Subject: sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched/core.c | 10 ++++------ kernel/sched/stats.c | 11 ++--------- 2 files changed, 6 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1f37fe7f77a4..13049aac05a6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5462,9 +5462,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, struct cpumask *groupmask) { struct sched_group *group = sd->groups; - char str[256]; - cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd)); cpumask_clear(groupmask); printk(KERN_DEBUG "%*s domain %d: ", level, "", level); @@ -5477,7 +5475,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, return -1; } - printk(KERN_CONT "span %s level %s\n", str, sd->name); + printk(KERN_CONT "span %*pbl level %s\n", + cpumask_pr_args(sched_domain_span(sd)), sd->name); if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) { printk(KERN_ERR "ERROR: domain->span does not contain " @@ -5522,9 +5521,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, cpumask_or(groupmask, groupmask, sched_group_cpus(group)); - cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group)); - - printk(KERN_CONT " %s", str); + printk(KERN_CONT " %*pbl", + cpumask_pr_args(sched_group_cpus(group))); if (group->sgc->capacity != SCHED_CAPACITY_SCALE) { printk(KERN_CONT " (cpu_capacity = %d)", group->sgc->capacity); diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index a476bea17fbc..87e2c9f0c33e 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -15,11 +15,6 @@ static int show_schedstat(struct seq_file *seq, void *v) { int cpu; - int mask_len = DIV_ROUND_UP(NR_CPUS, 32) * 9; - char *mask_str = kmalloc(mask_len, GFP_KERNEL); - - if (mask_str == NULL) - return -ENOMEM; if (v == (void *)1) { seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); @@ -50,9 +45,8 @@ static int show_schedstat(struct seq_file *seq, void *v) for_each_domain(cpu, sd) { enum cpu_idle_type itype; - cpumask_scnprintf(mask_str, mask_len, - sched_domain_span(sd)); - seq_printf(seq, "domain%d %s", dcount++, mask_str); + seq_printf(seq, "domain%d %*pb", dcount++, + cpumask_pr_args(sched_domain_span(sd))); for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES; itype++) { seq_printf(seq, " %u %u %u %u %u %u %u %u", @@ -76,7 +70,6 @@ static int show_schedstat(struct seq_file *seq, void *v) rcu_read_unlock(); #endif } - kfree(mask_str); return 0; } -- cgit v1.2.3 From ffda22c1f316ff9c12f5911ac7a18ec3a49c9d02 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:31 -0800 Subject: time: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/time/tick-sched.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 1363d58f07e9..a4c4edac4528 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -326,13 +326,6 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb, return NOTIFY_OK; } -/* - * Worst case string length in chunks of CPU range seems 2 steps - * separations: 0,2,4,6,... - * This is NR_CPUS + sizeof('\0') - */ -static char __initdata nohz_full_buf[NR_CPUS + 1]; - static int tick_nohz_init_all(void) { int err = -1; @@ -393,8 +386,8 @@ void __init tick_nohz_init(void) context_tracking_cpu_set(cpu); cpu_notifier(tick_nohz_cpu_down_callback, 0); - cpulist_scnprintf(nohz_full_buf, sizeof(nohz_full_buf), tick_nohz_full_mask); - pr_info("NO_HZ: Full dynticks CPUs: %s.\n", nohz_full_buf); + pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", + cpumask_pr_args(tick_nohz_full_mask)); } #endif -- cgit v1.2.3 From dfbcbf42dd07b649bb02b1558e2c7150c035b4dc Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:37 -0800 Subject: workqueue: use %*pb[l] to format bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/workqueue.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index beeeac9e0e3e..f28849394791 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3083,10 +3083,9 @@ static ssize_t wq_cpumask_show(struct device *dev, int written; mutex_lock(&wq->mutex); - written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask); + written = scnprintf(buf, PAGE_SIZE, "%*pb\n", + cpumask_pr_args(wq->unbound_attrs->cpumask)); mutex_unlock(&wq->mutex); - - written += scnprintf(buf + written, PAGE_SIZE - written, "\n"); return written; } -- cgit v1.2.3 From 1a40243bae6fa0cc09cee40d51e258c725d897e6 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:37:39 -0800 Subject: tracing: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: Steven Rostedt Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/trace/trace.c | 6 +++--- kernel/trace/trace_seq.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 77b8dc528006..62c6506d663f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3353,12 +3353,12 @@ tracing_cpumask_read(struct file *filp, char __user *ubuf, mutex_lock(&tracing_cpumask_update_lock); - len = cpumask_scnprintf(mask_str, count, tr->tracing_cpumask); - if (count - len < 2) { + len = snprintf(mask_str, count, "%*pb\n", + cpumask_pr_args(tr->tracing_cpumask)); + if (len >= count) { count = -EINVAL; goto out_err; } - len += sprintf(mask_str + len, "\n"); count = simple_read_from_buffer(ubuf, count, ppos, mask_str, NR_CPUS+1); out_err: diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c index f8b45d8792f9..e694c9f9efa4 100644 --- a/kernel/trace/trace_seq.c +++ b/kernel/trace/trace_seq.c @@ -120,7 +120,7 @@ void trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp, __trace_seq_init(s); - seq_buf_bitmask(&s->seq, maskp, nmaskbits); + seq_buf_printf(&s->seq, "%*pb", nmaskbits, maskp); if (unlikely(seq_buf_has_overflowed(&s->seq))) { s->seq.len = save_len; -- cgit v1.2.3 From 4497da6f950951b8819cd827bbebb8f214e8ecbe Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:38:05 -0800 Subject: padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: Steffen Klassert Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/padata.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/padata.c b/kernel/padata.c index 161402f0b517..b38bea9c466a 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -917,15 +917,10 @@ static ssize_t show_cpumask(struct padata_instance *pinst, else cpumask = pinst->cpumask.pcpu; - len = bitmap_scnprintf(buf, PAGE_SIZE, cpumask_bits(cpumask), - nr_cpu_ids); - if (PAGE_SIZE - len < 2) - len = -EINVAL; - else - len += sprintf(buf + len, "\n"); - + len = snprintf(buf, PAGE_SIZE, "%*pb\n", + nr_cpu_ids, cpumask_bits(cpumask)); mutex_unlock(&pinst->lock); - return len; + return len < PAGE_SIZE ? len : -EINVAL; } static ssize_t store_cpumask(struct padata_instance *pinst, -- cgit v1.2.3 From c1d7f03fdd0ed600b161a7f3309e45a20af89796 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:38:10 -0800 Subject: irq: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/proc.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c index 9dc9bfd8a678..df2f4642d1e7 100644 --- a/kernel/irq/proc.c +++ b/kernel/irq/proc.c @@ -46,10 +46,9 @@ static int show_irq_affinity(int type, struct seq_file *m, void *v) mask = desc->pending_mask; #endif if (type) - seq_cpumask_list(m, mask); + seq_printf(m, "%*pbl\n", cpumask_pr_args(mask)); else - seq_cpumask(m, mask); - seq_putc(m, '\n'); + seq_printf(m, "%*pb\n", cpumask_pr_args(mask)); return 0; } @@ -67,8 +66,7 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v) cpumask_copy(mask, desc->affinity_hint); raw_spin_unlock_irqrestore(&desc->lock, flags); - seq_cpumask(m, mask); - seq_putc(m, '\n'); + seq_printf(m, "%*pb\n", cpumask_pr_args(mask)); free_cpumask_var(mask); return 0; @@ -186,8 +184,7 @@ static const struct file_operations irq_affinity_list_proc_fops = { static int default_affinity_show(struct seq_file *m, void *v) { - seq_cpumask(m, irq_default_affinity); - seq_putc(m, '\n'); + seq_printf(m, "%*pb\n", cpumask_pr_args(irq_default_affinity)); return 0; } -- cgit v1.2.3 From ccbd59c1c104d6e42e949e2588563bfe25d9d98f Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 13 Feb 2015 14:38:13 -0800 Subject: profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/profile.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/profile.c b/kernel/profile.c index 54bf5ba26420..a7bcd28d6e9f 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -422,8 +422,7 @@ void profile_tick(int type) static int prof_cpu_mask_proc_show(struct seq_file *m, void *v) { - seq_cpumask(m, prof_cpu_mask); - seq_putc(m, '\n'); + seq_printf(m, "%*pb\n", cpumask_pr_args(prof_cpu_mask)); return 0; } -- cgit v1.2.3 From bebf56a1b176c2e1c9efe44e7e6915532cc682cf Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Fri, 13 Feb 2015 14:40:17 -0800 Subject: kasan: enable instrumentation of global variables This feature let us to detect accesses out of bounds of global variables. This will work as for globals in kernel image, so for globals in modules. Currently this won't work for symbols in user-specified sections (e.g. __init, __read_mostly, ...) The idea of this is simple. Compiler increases each global variable by redzone size and add constructors invoking __asan_register_globals() function. Information about global variable (address, size, size with redzone ...) passed to __asan_register_globals() so we could poison variable's redzone. This patch also forces module_alloc() to return 8*PAGE_SIZE aligned address making shadow memory handling ( kasan_module_alloc()/kasan_module_free() ) more simple. Such alignment guarantees that each shadow page backing modules address space correspond to only one module_alloc() allocation. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kasan.txt | 2 +- arch/x86/kernel/module.c | 12 +++++++++- arch/x86/mm/kasan_init_64.c | 2 +- include/linux/compiler-gcc4.h | 4 ++++ include/linux/compiler-gcc5.h | 2 ++ include/linux/kasan.h | 10 +++++++++ kernel/module.c | 2 ++ lib/Kconfig.kasan | 1 + mm/kasan/kasan.c | 52 +++++++++++++++++++++++++++++++++++++++++++ mm/kasan/kasan.h | 25 +++++++++++++++++++++ mm/kasan/report.c | 22 ++++++++++++++++++ scripts/Makefile.kasan | 2 +- 12 files changed, 132 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt index f0645a8a992f..092fc10961fe 100644 --- a/Documentation/kasan.txt +++ b/Documentation/kasan.txt @@ -9,7 +9,7 @@ a fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. KASan uses compile-time instrumentation for checking every memory access, -therefore you will need a certain version of GCC >= 4.9.2 +therefore you will need a certain version of GCC > 4.9.2 Currently KASan is supported only for x86_64 architecture and requires that the kernel be built with the SLUB allocator. diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index e830e61aae05..d1ac80b72c72 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -83,13 +84,22 @@ static unsigned long int get_module_load_offset(void) void *module_alloc(unsigned long size) { + void *p; + if (PAGE_ALIGN(size) > MODULES_LEN) return NULL; - return __vmalloc_node_range(size, 1, + + p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR + get_module_load_offset(), MODULES_END, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, __builtin_return_address(0)); + if (p && (kasan_module_alloc(p, size) < 0)) { + vfree(p); + return NULL; + } + + return p; } #ifdef CONFIG_X86_32 diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 53508708b7aa..4860906c6b9f 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -196,7 +196,7 @@ void __init kasan_init(void) (unsigned long)kasan_mem_to_shadow(_end), NUMA_NO_NODE); - populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_VADDR), + populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END), (void *)KASAN_SHADOW_END); memset(kasan_zero_page, 0, PAGE_SIZE); diff --git a/include/linux/compiler-gcc4.h b/include/linux/compiler-gcc4.h index d1a558239b1a..769e19864632 100644 --- a/include/linux/compiler-gcc4.h +++ b/include/linux/compiler-gcc4.h @@ -85,3 +85,7 @@ #define __HAVE_BUILTIN_BSWAP16__ #endif #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */ + +#if GCC_VERSION >= 40902 +#define KASAN_ABI_VERSION 3 +#endif diff --git a/include/linux/compiler-gcc5.h b/include/linux/compiler-gcc5.h index c8c565952548..efee493714eb 100644 --- a/include/linux/compiler-gcc5.h +++ b/include/linux/compiler-gcc5.h @@ -63,3 +63,5 @@ #define __HAVE_BUILTIN_BSWAP64__ #define __HAVE_BUILTIN_BSWAP16__ #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */ + +#define KASAN_ABI_VERSION 4 diff --git a/include/linux/kasan.h b/include/linux/kasan.h index d5310eef3e38..72ba725ddf9c 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -49,8 +49,15 @@ void kasan_krealloc(const void *object, size_t new_size); void kasan_slab_alloc(struct kmem_cache *s, void *object); void kasan_slab_free(struct kmem_cache *s, void *object); +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) + +int kasan_module_alloc(void *addr, size_t size); +void kasan_module_free(void *addr); + #else /* CONFIG_KASAN */ +#define MODULE_ALIGN 1 + static inline void kasan_unpoison_shadow(const void *address, size_t size) {} static inline void kasan_enable_current(void) {} @@ -74,6 +81,9 @@ static inline void kasan_krealloc(const void *object, size_t new_size) {} static inline void kasan_slab_alloc(struct kmem_cache *s, void *object) {} static inline void kasan_slab_free(struct kmem_cache *s, void *object) {} +static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } +static inline void kasan_module_free(void *addr) {} + #endif /* CONFIG_KASAN */ #endif /* LINUX_KASAN_H */ diff --git a/kernel/module.c b/kernel/module.c index 82dc1f899e6d..8426ad48362c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include #include @@ -1813,6 +1814,7 @@ static void unset_module_init_ro_nx(struct module *mod) { } void __weak module_memfree(void *module_region) { vfree(module_region); + kasan_module_free(module_region); } void __weak module_arch_cleanup(struct module *mod) diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index 4d47d874335c..4fecaedc80a2 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -6,6 +6,7 @@ if HAVE_ARCH_KASAN config KASAN bool "KASan: runtime memory debugger" depends on SLUB_DEBUG + select CONSTRUCTORS help Enables kernel address sanitizer - runtime memory debugger, designed to find out-of-bounds accesses and use-after-free bugs. diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 799c52b9826c..78fee632a7ee 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -395,6 +396,57 @@ void kasan_kfree_large(const void *ptr) KASAN_FREE_PAGE); } +int kasan_module_alloc(void *addr, size_t size) +{ + void *ret; + size_t shadow_size; + unsigned long shadow_start; + + shadow_start = (unsigned long)kasan_mem_to_shadow(addr); + shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, + PAGE_SIZE); + + if (WARN_ON(!PAGE_ALIGNED(shadow_start))) + return -EINVAL; + + ret = __vmalloc_node_range(shadow_size, 1, shadow_start, + shadow_start + shadow_size, + GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, + PAGE_KERNEL, VM_NO_GUARD, NUMA_NO_NODE, + __builtin_return_address(0)); + return ret ? 0 : -ENOMEM; +} + +void kasan_module_free(void *addr) +{ + vfree(kasan_mem_to_shadow(addr)); +} + +static void register_global(struct kasan_global *global) +{ + size_t aligned_size = round_up(global->size, KASAN_SHADOW_SCALE_SIZE); + + kasan_unpoison_shadow(global->beg, global->size); + + kasan_poison_shadow(global->beg + aligned_size, + global->size_with_redzone - aligned_size, + KASAN_GLOBAL_REDZONE); +} + +void __asan_register_globals(struct kasan_global *globals, size_t size) +{ + int i; + + for (i = 0; i < size; i++) + register_global(&globals[i]); +} +EXPORT_SYMBOL(__asan_register_globals); + +void __asan_unregister_globals(struct kasan_global *globals, size_t size) +{ +} +EXPORT_SYMBOL(__asan_unregister_globals); + #define DEFINE_ASAN_LOAD_STORE(size) \ void __asan_load##size(unsigned long addr) \ { \ diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 1fcc1d81a9cf..4986b0acab21 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -11,6 +11,7 @@ #define KASAN_PAGE_REDZONE 0xFE /* redzone for kmalloc_large allocations */ #define KASAN_KMALLOC_REDZONE 0xFC /* redzone inside slub object */ #define KASAN_KMALLOC_FREE 0xFB /* object was freed (kmem_cache_free/kfree) */ +#define KASAN_GLOBAL_REDZONE 0xFA /* redzone for global variable */ /* * Stack redzone shadow values @@ -21,6 +22,10 @@ #define KASAN_STACK_RIGHT 0xF3 #define KASAN_STACK_PARTIAL 0xF4 +/* Don't break randconfig/all*config builds */ +#ifndef KASAN_ABI_VERSION +#define KASAN_ABI_VERSION 1 +#endif struct kasan_access_info { const void *access_addr; @@ -30,6 +35,26 @@ struct kasan_access_info { unsigned long ip; }; +/* The layout of struct dictated by compiler */ +struct kasan_source_location { + const char *filename; + int line_no; + int column_no; +}; + +/* The layout of struct dictated by compiler */ +struct kasan_global { + const void *beg; /* Address of the beginning of the global variable. */ + size_t size; /* Size of the global variable. */ + size_t size_with_redzone; /* Size of the variable + size of the red zone. 32 bytes aligned */ + const void *name; + const void *module_name; /* Name of the module where the global variable is declared. */ + unsigned long has_dynamic_init; /* This needed for C++ */ +#if KASAN_ABI_VERSION >= 4 + struct kasan_source_location *location; +#endif +}; + void kasan_report_error(struct kasan_access_info *info); void kasan_report_user_access(struct kasan_access_info *info); diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 866732ef3db3..680ceedf810a 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -23,6 +23,8 @@ #include #include +#include + #include "kasan.h" #include "../slab.h" @@ -61,6 +63,7 @@ static void print_error_description(struct kasan_access_info *info) break; case KASAN_PAGE_REDZONE: case KASAN_KMALLOC_REDZONE: + case KASAN_GLOBAL_REDZONE: case 0 ... KASAN_SHADOW_SCALE_SIZE - 1: bug_type = "out of bounds access"; break; @@ -80,6 +83,20 @@ static void print_error_description(struct kasan_access_info *info) info->access_size, current->comm, task_pid_nr(current)); } +static inline bool kernel_or_module_addr(const void *addr) +{ + return (addr >= (void *)_stext && addr < (void *)_end) + || (addr >= (void *)MODULES_VADDR + && addr < (void *)MODULES_END); +} + +static inline bool init_task_stack_addr(const void *addr) +{ + return addr >= (void *)&init_thread_union.stack && + (addr <= (void *)&init_thread_union.stack + + sizeof(init_thread_union.stack)); +} + static void print_address_description(struct kasan_access_info *info) { const void *addr = info->access_addr; @@ -107,6 +124,11 @@ static void print_address_description(struct kasan_access_info *info) dump_page(page, "kasan: bad access detected"); } + if (kernel_or_module_addr(addr)) { + if (!init_task_stack_addr(addr)) + pr_err("Address belongs to variable %pS\n", addr); + } + dump_stack(); } diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan index 2163b8cc446e..631619b2b118 100644 --- a/scripts/Makefile.kasan +++ b/scripts/Makefile.kasan @@ -9,7 +9,7 @@ CFLAGS_KASAN_MINIMAL := -fsanitize=kernel-address CFLAGS_KASAN := $(call cc-option, -fsanitize=kernel-address \ -fasan-shadow-offset=$(CONFIG_KASAN_SHADOW_OFFSET) \ - --param asan-stack=1 \ + --param asan-stack=1 --param asan-globals=1 \ --param asan-instrumentation-with-call-threshold=$(call_threshold)) ifeq ($(call cc-option, $(CFLAGS_KASAN_MINIMAL) -Werror),) -- cgit v1.2.3 From 977ad481b66ca91e1f6492b3c5c4748c68fdee9c Mon Sep 17 00:00:00 2001 From: Wang Nan Date: Fri, 13 Feb 2015 14:40:24 -0800 Subject: kprobes: set kprobes_all_disarmed earlier to enable re-optimization. In original code, the probed instruction doesn't get optimized after echo 0 > /sys/kernel/debug/kprobes/enabled echo 1 > /sys/kernel/debug/kprobes/enabled This is because original code checks kprobes_all_disarmed in optimize_kprobe(), but this flag is turned off after calling that function. Therefore, optimize_kprobe() will see kprobes_all_disarmed == true and doesn't do the optimization. This patch simply turns off kprobes_all_disarmed earlier to enable optimization. Signed-off-by: Wang Nan Signed-off-by: Masami Hiramatsu Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kprobes.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 2ca272f8f62e..c39790001854 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2320,6 +2320,12 @@ static void arm_all_kprobes(void) if (!kprobes_all_disarmed) goto already_enabled; + /* + * optimize_kprobe() called by arm_kprobe() checks + * kprobes_all_disarmed, so set kprobes_all_disarmed before + * arm_kprobe. + */ + kprobes_all_disarmed = false; /* Arming kprobes doesn't optimize kprobe itself */ for (i = 0; i < KPROBE_TABLE_SIZE; i++) { head = &kprobe_table[i]; @@ -2328,7 +2334,6 @@ static void arm_all_kprobes(void) arm_kprobe(p); } - kprobes_all_disarmed = false; printk(KERN_INFO "Kprobes globally enabled\n"); already_enabled: -- cgit v1.2.3 From 69d54b916d83872a0a327778a01af2a096923f59 Mon Sep 17 00:00:00 2001 From: Wang Nan Date: Fri, 13 Feb 2015 14:40:26 -0800 Subject: kprobes: makes kprobes/enabled works correctly for optimized kprobes. debugfs/kprobes/enabled doesn't work correctly on optimized kprobes. Masami Hiramatsu has a test report on x86_64 platform: https://lkml.org/lkml/2015/1/19/274 This patch forces it to unoptimize kprobe if kprobes_all_disarmed is set. It also checks the flag in unregistering path for skipping unneeded disarming process when kprobes globally disarmed. Signed-off-by: Wang Nan Signed-off-by: Masami Hiramatsu Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kprobes.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index c39790001854..c90e417bb963 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -869,7 +869,8 @@ static void __disarm_kprobe(struct kprobe *p, bool reopt) { struct kprobe *_p; - unoptimize_kprobe(p, false); /* Try to unoptimize */ + /* Try to unoptimize */ + unoptimize_kprobe(p, kprobes_all_disarmed); if (!kprobe_queued(p)) { arch_disarm_kprobe(p); @@ -1571,7 +1572,13 @@ static struct kprobe *__disable_kprobe(struct kprobe *p) /* Try to disarm and disable this/parent probe */ if (p == orig_p || aggr_kprobe_disabled(orig_p)) { - disarm_kprobe(orig_p, true); + /* + * If kprobes_all_disarmed is set, orig_p + * should have already been disarmed, so + * skip unneed disarming process. + */ + if (!kprobes_all_disarmed) + disarm_kprobe(orig_p, true); orig_p->flags |= KPROBE_FLAG_DISABLED; } } -- cgit v1.2.3 From 060407aed56c00960c9b5f70f5d19b2823adffd7 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 13 Feb 2015 14:49:02 +0100 Subject: timekeeping: Make it safe to use the fast timekeeper while suspended Theoretically, ktime_get_mono_fast_ns() may be executed after timekeeping has been suspended (or before it is resumed) which in turn may lead to undefined behavior, for example, when the clocksource read from timekeeping_get_ns() called by it is not accessible at that time. Prevent that from happening by setting up a dummy readout base for the fast timekeeper during timekeeping_suspend() such that it will always return the same number of cycles. After the last timekeeping_update() in timekeeping_suspend() the clocksource is read and the result is stored as cycles_at_suspend. The readout base from the current timekeeper is copied onto the dummy and the ->read pointer of the dummy is set to a routine unconditionally returning cycles_at_suspend. Next, the dummy is passed to update_fast_timekeeper(). Then, ktime_get_mono_fast_ns() will work until the subsequent timekeeping_resume() and the proper readout base for the fast timekeeper will be restored by the timekeeping_update() called right after clearing timekeeping_suspended. Signed-off-by: Rafael J. Wysocki Acked-by: John Stultz Acked-by: Peter Zijlstra (Intel) --- kernel/time/timekeeping.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index abf08f4366c1..aef5dc722abf 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -332,6 +332,35 @@ u64 notrace ktime_get_mono_fast_ns(void) } EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns); +/* Suspend-time cycles value for halted fast timekeeper. */ +static cycle_t cycles_at_suspend; + +static cycle_t dummy_clock_read(struct clocksource *cs) +{ + return cycles_at_suspend; +} + +/** + * halt_fast_timekeeper - Prevent fast timekeeper from accessing clocksource. + * @tk: Timekeeper to snapshot. + * + * It generally is unsafe to access the clocksource after timekeeping has been + * suspended, so take a snapshot of the readout base of @tk and use it as the + * fast timekeeper's readout base while suspended. It will return the same + * number of cycles every time until timekeeping is resumed at which time the + * proper readout base for the fast timekeeper will be restored automatically. + */ +static void halt_fast_timekeeper(struct timekeeper *tk) +{ + static struct tk_read_base tkr_dummy; + struct tk_read_base *tkr = &tk->tkr; + + memcpy(&tkr_dummy, tkr, sizeof(tkr_dummy)); + cycles_at_suspend = tkr->read(tkr->clock); + tkr_dummy.read = dummy_clock_read; + update_fast_timekeeper(&tkr_dummy); +} + #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD static inline void update_vsyscall(struct timekeeper *tk) @@ -1294,6 +1323,7 @@ static int timekeeping_suspend(void) } timekeeping_update(tk, TK_MIRROR); + halt_fast_timekeeper(tk); write_seqcount_end(&tk_core.seq); raw_spin_unlock_irqrestore(&timekeeper_lock, flags); -- cgit v1.2.3 From 124cf9117c5f93cc5b324530b7e105b09c729d5d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 13 Feb 2015 23:50:43 +0100 Subject: PM / sleep: Make it possible to quiesce timers during suspend-to-idle The efficiency of suspend-to-idle depends on being able to keep CPUs in the deepest available idle states for as much time as possible. Ideally, they should only be brought out of idle by system wakeup interrupts. However, timer interrupts occurring periodically prevent that from happening and it is not practical to chase all of the "misbehaving" timers in a whack-a-mole fashion. A much more effective approach is to suspend the local ticks for all CPUs and the entire timekeeping along the lines of what is done during full suspend, which also helps to keep suspend-to-idle and full suspend reasonably similar. The idea is to suspend the local tick on each CPU executing cpuidle_enter_freeze() and to make the last of them suspend the entire timekeeping. That should prevent timer interrupts from triggering until an IO interrupt wakes up one of the CPUs. It needs to be done with interrupts disabled on all of the CPUs, though, because otherwise the suspended clocksource might be accessed by an interrupt handler which might lead to fatal consequences. Unfortunately, the existing ->enter callbacks provided by cpuidle drivers generally cannot be used for implementing that, because some of them re-enable interrupts temporarily and some idle entry methods cause interrupts to be re-enabled automatically on exit. Also some of these callbacks manipulate local clock event devices of the CPUs which really shouldn't be done after suspending their ticks. To overcome that difficulty, introduce a new cpuidle state callback, ->enter_freeze, that will be guaranteed (1) to keep interrupts disabled all the time (and return with interrupts disabled) and (2) not to touch the CPU timer devices. Modify cpuidle_enter_freeze() to look for the deepest available idle state with ->enter_freeze present and to make the CPU execute that callback with suspended tick (and the last of the online CPUs to execute it with suspended timekeeping). Suggested-by: Thomas Gleixner Signed-off-by: Rafael J. Wysocki Acked-by: Peter Zijlstra (Intel) --- drivers/cpuidle/cpuidle.c | 49 +++++++++++++++++++++++++++++++++++++++++----- include/linux/cpuidle.h | 9 +++++++++ include/linux/tick.h | 6 +++++- kernel/time/tick-common.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/time/timekeeping.c | 4 ++-- kernel/time/timekeeping.h | 2 ++ 6 files changed, 112 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 23a8d6cc8d30..4d534582514e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -69,18 +70,20 @@ int cpuidle_play_dead(void) * cpuidle_find_deepest_state - Find deepest state meeting specific conditions. * @drv: cpuidle driver for the given CPU. * @dev: cpuidle device for the given CPU. + * @freeze: Whether or not the state should be suitable for suspend-to-idle. */ static int cpuidle_find_deepest_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, bool freeze) { unsigned int latency_req = 0; - int i, ret = CPUIDLE_DRIVER_STATE_START - 1; + int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { struct cpuidle_state *s = &drv->states[i]; struct cpuidle_state_usage *su = &dev->states_usage[i]; - if (s->disabled || su->disable || s->exit_latency <= latency_req) + if (s->disabled || su->disable || s->exit_latency <= latency_req + || (freeze && !s->enter_freeze)) continue; latency_req = s->exit_latency; @@ -89,10 +92,31 @@ static int cpuidle_find_deepest_state(struct cpuidle_driver *drv, return ret; } +static void enter_freeze_proper(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int index) +{ + tick_freeze(); + /* + * The state used here cannot be a "coupled" one, because the "coupled" + * cpuidle mechanism enables interrupts and doing that with timekeeping + * suspended is generally unsafe. + */ + drv->states[index].enter_freeze(dev, drv, index); + WARN_ON(!irqs_disabled()); + /* + * timekeeping_resume() that will be called by tick_unfreeze() for the + * last CPU executing it calls functions containing RCU read-side + * critical sections, so tell RCU about that. + */ + RCU_NONIDLE(tick_unfreeze()); +} + /** * cpuidle_enter_freeze - Enter an idle state suitable for suspend-to-idle. * - * Find the deepest state available and enter it. + * If there are states with the ->enter_freeze callback, find the deepest of + * them and enter it with frozen tick. Otherwise, find the deepest state + * available and enter it normally. */ void cpuidle_enter_freeze(void) { @@ -100,7 +124,22 @@ void cpuidle_enter_freeze(void) struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); int index; - index = cpuidle_find_deepest_state(drv, dev); + /* + * Find the deepest state with ->enter_freeze present, which guarantees + * that interrupts won't be enabled when it exits and allows the tick to + * be frozen safely. + */ + index = cpuidle_find_deepest_state(drv, dev, true); + if (index >= 0) { + enter_freeze_proper(drv, dev, index); + return; + } + + /* + * It is not safe to freeze the tick, find the deepest state available + * at all and try to enter it normally. + */ + index = cpuidle_find_deepest_state(drv, dev, false); if (index >= 0) cpuidle_enter(drv, dev, index); else diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index f63aabf4ee90..f551a9299ac9 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -50,6 +50,15 @@ struct cpuidle_state { int index); int (*enter_dead) (struct cpuidle_device *dev, int index); + + /* + * CPUs execute ->enter_freeze with the local tick or entire timekeeping + * suspended, so it must not re-enable interrupts at any point (even + * temporarily) or attempt to change states of clock event devices. + */ + void (*enter_freeze) (struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index); }; /* Idle State Flags */ diff --git a/include/linux/tick.h b/include/linux/tick.h index eda850ca757a..9c085dc12ae9 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -79,6 +79,9 @@ extern void __init tick_init(void); extern int tick_is_oneshot_available(void); extern struct tick_device *tick_get_device(int cpu); +extern void tick_freeze(void); +extern void tick_unfreeze(void); + # ifdef CONFIG_HIGH_RES_TIMERS extern int tick_init_highres(void); extern int tick_program_event(ktime_t expires, int force); @@ -119,6 +122,8 @@ static inline int tick_oneshot_mode_active(void) { return 0; } #else /* CONFIG_GENERIC_CLOCKEVENTS */ static inline void tick_init(void) { } +static inline void tick_freeze(void) { } +static inline void tick_unfreeze(void) { } static inline void tick_cancel_sched_timer(int cpu) { } static inline void tick_clock_notify(void) { } static inline int tick_check_oneshot_change(int allow_nohz) { return 0; } @@ -226,5 +231,4 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } - #endif diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 7efeedf53ebd..f7c515595b42 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -394,6 +394,56 @@ void tick_resume(void) } } +static DEFINE_RAW_SPINLOCK(tick_freeze_lock); +static unsigned int tick_freeze_depth; + +/** + * tick_freeze - Suspend the local tick and (possibly) timekeeping. + * + * Check if this is the last online CPU executing the function and if so, + * suspend timekeeping. Otherwise suspend the local tick. + * + * Call with interrupts disabled. Must be balanced with %tick_unfreeze(). + * Interrupts must not be enabled before the subsequent %tick_unfreeze(). + */ +void tick_freeze(void) +{ + raw_spin_lock(&tick_freeze_lock); + + tick_freeze_depth++; + if (tick_freeze_depth == num_online_cpus()) { + timekeeping_suspend(); + } else { + tick_suspend(); + tick_suspend_broadcast(); + } + + raw_spin_unlock(&tick_freeze_lock); +} + +/** + * tick_unfreeze - Resume the local tick and (possibly) timekeeping. + * + * Check if this is the first CPU executing the function and if so, resume + * timekeeping. Otherwise resume the local tick. + * + * Call with interrupts disabled. Must be balanced with %tick_freeze(). + * Interrupts must not be enabled after the preceding %tick_freeze(). + */ +void tick_unfreeze(void) +{ + raw_spin_lock(&tick_freeze_lock); + + if (tick_freeze_depth == num_online_cpus()) + timekeeping_resume(); + else + tick_resume(); + + tick_freeze_depth--; + + raw_spin_unlock(&tick_freeze_lock); +} + /** * tick_init - initialize the tick control */ diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index aef5dc722abf..91db94136c10 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1197,7 +1197,7 @@ void timekeeping_inject_sleeptime64(struct timespec64 *delta) * xtime/wall_to_monotonic/jiffies/etc are * still managed by arch specific suspend/resume code. */ -static void timekeeping_resume(void) +void timekeeping_resume(void) { struct timekeeper *tk = &tk_core.timekeeper; struct clocksource *clock = tk->tkr.clock; @@ -1278,7 +1278,7 @@ static void timekeeping_resume(void) hrtimers_resume(); } -static int timekeeping_suspend(void) +int timekeeping_suspend(void) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long flags; diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index adc1fc98bde3..1d91416055d5 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -16,5 +16,7 @@ extern int timekeeping_inject_offset(struct timespec *ts); extern s32 timekeeping_get_tai_offset(void); extern void timekeeping_set_tai_offset(s32 tai_offset); extern void timekeeping_clocktai(struct timespec *ts); +extern int timekeeping_suspend(void); +extern void timekeeping_resume(void); #endif -- cgit v1.2.3 From 1cca3385e6d556cd90cdc148c2f26af807fa3600 Mon Sep 17 00:00:00 2001 From: Fabian Frederick Date: Tue, 17 Feb 2015 13:45:39 -0800 Subject: ptrace: remove linux/compat.h inclusion under CONFIG_COMPAT Commit 84c751bd4aeb ("ptrace: add ability to retrieve signals without removing from a queue (v4)") includes globally in ptrace.c This patch removes inclusion under if defined CONFIG_COMPAT. Signed-off-by: Fabian Frederick Acked-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/ptrace.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 1eb9d90c3af9..227fec36b12a 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1077,7 +1077,6 @@ int generic_ptrace_pokedata(struct task_struct *tsk, unsigned long addr, } #if defined CONFIG_COMPAT -#include int compat_ptrace_request(struct task_struct *child, compat_long_t request, compat_ulong_t addr, compat_ulong_t data) -- cgit v1.2.3 From 1df0135588ed4e6048c1608ec046e9a38ea91e8e Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Tue, 17 Feb 2015 13:45:41 -0800 Subject: signal: use current->state helpers Call __set_current_state() instead of assigning the new state directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping track of who changed the state. Signed-off-by: Davidlohr Bueso Acked-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 33a52759cc0e..a390499943e4 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3550,7 +3550,7 @@ SYSCALL_DEFINE2(signal, int, sig, __sighandler_t, handler) SYSCALL_DEFINE0(pause) { while (!signal_pending(current)) { - current->state = TASK_INTERRUPTIBLE; + __set_current_state(TASK_INTERRUPTIBLE); schedule(); } return -ERESTARTNOHAND; @@ -3563,7 +3563,7 @@ int sigsuspend(sigset_t *set) current->saved_sigmask = current->blocked; set_current_blocked(set); - current->state = TASK_INTERRUPTIBLE; + __set_current_state(TASK_INTERRUPTIBLE); schedule(); set_restore_sigmask(); return -ERESTARTNOHAND; -- cgit v1.2.3 From 73d7e3eac01da3cef32ab25cbc6a36a6202c4ea6 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 17 Feb 2015 13:45:44 -0800 Subject: kexec: remove never used member destination in kimage struct kimage has a member destination which is used to store the real destination address of each page when load segment from user space buffer to kernel. But we never retrieve the value stored in kimage->destination, so this member variable in kimage and its assignment operation are redundent code. I guess for_each_kimage_entry just does the work that kimage->destination is expected to do. So in this patch just make a cleanup to remove it. Signed-off-by: Baoquan He Cc: "Eric W. Biederman" Cc: Vivek Goyal Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kexec.h | 2 -- kernel/kexec.c | 4 ---- 2 files changed, 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 9d957b7ae095..10da8e246317 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -122,8 +122,6 @@ struct kimage { kimage_entry_t *entry; kimage_entry_t *last_entry; - unsigned long destination; - unsigned long start; struct page *control_code_page; struct page *swap_page; diff --git a/kernel/kexec.c b/kernel/kexec.c index c85277639b34..35dcac4b5c1c 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -856,8 +856,6 @@ static int kimage_set_destination(struct kimage *image, destination &= PAGE_MASK; result = kimage_add_entry(image, destination | IND_DESTINATION); - if (result == 0) - image->destination = destination; return result; } @@ -869,8 +867,6 @@ static int kimage_add_page(struct kimage *image, unsigned long page) page &= PAGE_MASK; result = kimage_add_entry(image, page | IND_SOURCE); - if (result == 0) - image->destination += PAGE_SIZE; return result; } -- cgit v1.2.3 From ad69934987eb04c8c3f912b19db878f280e55c8f Mon Sep 17 00:00:00 2001 From: Alexander Kuleshov Date: Tue, 17 Feb 2015 13:45:47 -0800 Subject: kexec: fix a typo in comment Signed-off-by: Alexander Kuleshov Acked-by: "Eric W. Biederman" Acked-by: Vivek Goyal Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kexec.c b/kernel/kexec.c index 35dcac4b5c1c..e9a6be4d1ebb 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -444,7 +444,7 @@ arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, } /* - * Free up memory used by kernel, initrd, and comand line. This is temporary + * Free up memory used by kernel, initrd, and command line. This is temporary * memory allocation which is not needed any more after these buffers have * been loaded into separate segments and have been copied elsewhere. */ -- cgit v1.2.3 From 518a0c716377e5f2c6d22957a5937ec5f328ead1 Mon Sep 17 00:00:00 2001 From: Geoff Levand Date: Tue, 17 Feb 2015 13:45:53 -0800 Subject: kexec: simplify conditional Simplify the code around one of the conditionals in the kexec_load syscall routine. The original code was confusing with a redundant check on KEXEC_ON_CRASH and comments outside of the conditional block. This change switches the order of the conditional check, and cleans up the comments for the conditional. There is no functional change to the code. Signed-off-by: Geoff Levand Acked-by: Vivek Goyal Cc: Arnd Bergmann Cc: Benjamin Herrenschmidt Cc: H. Peter Anvin Cc: Maximilian Attems Cc: Michal Marek Cc: Paul Bolle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/kexec.c b/kernel/kexec.c index e9a6be4d1ebb..38c25b1f2fd5 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -1284,19 +1284,22 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments, if (nr_segments > 0) { unsigned long i; - /* Loading another kernel to reboot into */ - if ((flags & KEXEC_ON_CRASH) == 0) - result = kimage_alloc_init(&image, entry, nr_segments, - segments, flags); - /* Loading another kernel to switch to if this one crashes */ - else if (flags & KEXEC_ON_CRASH) { - /* Free any current crash dump kernel before + if (flags & KEXEC_ON_CRASH) { + /* + * Loading another kernel to switch to if this one + * crashes. Free any current crash dump kernel before * we corrupt it. */ + kimage_free(xchg(&kexec_crash_image, NULL)); result = kimage_alloc_init(&image, entry, nr_segments, segments, flags); crash_map_reserved_pages(); + } else { + /* Loading another kernel to reboot into. */ + + result = kimage_alloc_init(&image, entry, nr_segments, + segments, flags); } if (result) goto out; -- cgit v1.2.3 From be02a1862304b126cd6ba4f347fa5db59460a776 Mon Sep 17 00:00:00 2001 From: Jan Kiszka Date: Tue, 17 Feb 2015 13:46:50 -0800 Subject: kernel/module.c: do not inline do_init_module() This provides a reliable breakpoint target, required for automatic symbol loading via the gdb helper command 'lx-symbols'. Signed-off-by: Jan Kiszka Acked-by: Rusty Russell Cc: Thomas Gleixner Cc: Jason Wessel Cc: Andi Kleen Cc: Ben Widawsky Cc: Borislav Petkov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 8426ad48362c..b34813f725e9 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3025,8 +3025,13 @@ static void do_free_init(struct rcu_head *head) kfree(m); } -/* This is where the real work happens */ -static int do_init_module(struct module *mod) +/* + * This is where the real work happens. + * + * Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb + * helper command 'lx-symbols'. + */ +static noinline int do_init_module(struct module *mod) { int ret = 0; struct mod_initfree *freeinit; -- cgit v1.2.3 From 580c57f1076872ebc2427f898b927944ce170f2d Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 17 Feb 2015 13:48:00 -0800 Subject: seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO The value resulting from the SECCOMP_RET_DATA mask could exceed MAX_ERRNO when setting errno during a SECCOMP_RET_ERRNO filter action. This makes sure we have a reliable value being set, so that an invalid errno will not be ignored by userspace. Signed-off-by: Kees Cook Reported-by: Dmitry V. Levin Cc: Andy Lutomirski Cc: Will Drewry Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/seccomp.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 4ef9687ac115..4f44028943e6 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -629,7 +629,9 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd) switch (action) { case SECCOMP_RET_ERRNO: - /* Set the low-order 16-bits as a errno. */ + /* Set low-order bits as an errno, capped at MAX_ERRNO. */ + if (data > MAX_ERRNO) + data = MAX_ERRNO; syscall_set_return_value(current, task_pt_regs(current), -data, 0); goto skip; -- cgit v1.2.3 From 74b8a4cb6ce3685049ee124243a52238c5cabe55 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 17 Feb 2015 13:07:38 +0100 Subject: sched: Clarify ordering between task_rq_lock() and move_queued_task() There was a wee bit of confusion around the exact ordering here; clarify things. Reported-by: Kirill Tkhai Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paul E. McKenney Link: http://lkml.kernel.org/r/20150217121258.GM5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1f37fe7f77a4..fd0a6ed21849 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) raw_spin_lock_irqsave(&p->pi_lock, *flags); rq = task_rq(p); raw_spin_lock(&rq->lock); + /* + * move_queued_task() task_rq_lock() + * + * ACQUIRE (rq->lock) + * [S] ->on_rq = MIGRATING [L] rq = task_rq() + * WMB (__set_task_cpu()) ACQUIRE (rq->lock); + * [S] ->cpu = new_cpu [L] task_rq() + * [L] ->on_rq + * RELEASE (rq->lock) + * + * If we observe the old cpu in task_rq_lock, the acquire of + * the old rq->lock will fully serialize against the stores. + * + * If we observe the new cpu in task_rq_lock, the acquire will + * pair with the WMB to ensure we must then also see migrating. + */ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) return rq; raw_spin_unlock(&rq->lock); -- cgit v1.2.3 From 3960c8c0c7891dfc0f7be687cbdabb0d6916d614 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 17 Feb 2015 13:22:25 +0100 Subject: sched: Make dl_task_time() use task_rq_lock() Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 76 ------------------------------------------------- kernel/sched/deadline.c | 12 ++------ kernel/sched/sched.h | 76 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 79 insertions(+), 85 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fd0a6ed21849..f77acd98f50a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -306,82 +306,6 @@ __read_mostly int scheduler_running; */ int sysctl_sched_rt_runtime = 950000; -/* - * __task_rq_lock - lock the rq @p resides on. - */ -static inline struct rq *__task_rq_lock(struct task_struct *p) - __acquires(rq->lock) -{ - struct rq *rq; - - lockdep_assert_held(&p->pi_lock); - - for (;;) { - rq = task_rq(p); - raw_spin_lock(&rq->lock); - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) - return rq; - raw_spin_unlock(&rq->lock); - - while (unlikely(task_on_rq_migrating(p))) - cpu_relax(); - } -} - -/* - * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. - */ -static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) - __acquires(p->pi_lock) - __acquires(rq->lock) -{ - struct rq *rq; - - for (;;) { - raw_spin_lock_irqsave(&p->pi_lock, *flags); - rq = task_rq(p); - raw_spin_lock(&rq->lock); - /* - * move_queued_task() task_rq_lock() - * - * ACQUIRE (rq->lock) - * [S] ->on_rq = MIGRATING [L] rq = task_rq() - * WMB (__set_task_cpu()) ACQUIRE (rq->lock); - * [S] ->cpu = new_cpu [L] task_rq() - * [L] ->on_rq - * RELEASE (rq->lock) - * - * If we observe the old cpu in task_rq_lock, the acquire of - * the old rq->lock will fully serialize against the stores. - * - * If we observe the new cpu in task_rq_lock, the acquire will - * pair with the WMB to ensure we must then also see migrating. - */ - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) - return rq; - raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, *flags); - - while (unlikely(task_on_rq_migrating(p))) - cpu_relax(); - } -} - -static void __task_rq_unlock(struct rq *rq) - __releases(rq->lock) -{ - raw_spin_unlock(&rq->lock); -} - -static inline void -task_rq_unlock(struct rq *rq, struct task_struct *p, unsigned long *flags) - __releases(rq->lock) - __releases(p->pi_lock) -{ - raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, *flags); -} - /* * this_rq_lock - lock this runqueue and disable interrupts. */ diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a027799ae130..e88847d9fc6a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -511,16 +511,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) struct sched_dl_entity, dl_timer); struct task_struct *p = dl_task_of(dl_se); + unsigned long flags; struct rq *rq; -again: - rq = task_rq(p); - raw_spin_lock(&rq->lock); - if (rq != task_rq(p)) { - /* Task was moved, retrying. */ - raw_spin_unlock(&rq->lock); - goto again; - } + rq = task_rq_lock(current, &flags); /* * We need to take care of several possible races here: @@ -555,7 +549,7 @@ again: push_dl_task(rq); #endif unlock: - raw_spin_unlock(&rq->lock); + task_rq_unlock(rq, current, &flags); return HRTIMER_NORESTART; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0870db23d79c..dc0f435a2779 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1380,6 +1380,82 @@ static inline void sched_avg_update(struct rq *rq) { } extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period); +/* + * __task_rq_lock - lock the rq @p resides on. + */ +static inline struct rq *__task_rq_lock(struct task_struct *p) + __acquires(rq->lock) +{ + struct rq *rq; + + lockdep_assert_held(&p->pi_lock); + + for (;;) { + rq = task_rq(p); + raw_spin_lock(&rq->lock); + if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) + return rq; + raw_spin_unlock(&rq->lock); + + while (unlikely(task_on_rq_migrating(p))) + cpu_relax(); + } +} + +/* + * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. + */ +static inline struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) + __acquires(p->pi_lock) + __acquires(rq->lock) +{ + struct rq *rq; + + for (;;) { + raw_spin_lock_irqsave(&p->pi_lock, *flags); + rq = task_rq(p); + raw_spin_lock(&rq->lock); + /* + * move_queued_task() task_rq_lock() + * + * ACQUIRE (rq->lock) + * [S] ->on_rq = MIGRATING [L] rq = task_rq() + * WMB (__set_task_cpu()) ACQUIRE (rq->lock); + * [S] ->cpu = new_cpu [L] task_rq() + * [L] ->on_rq + * RELEASE (rq->lock) + * + * If we observe the old cpu in task_rq_lock, the acquire of + * the old rq->lock will fully serialize against the stores. + * + * If we observe the new cpu in task_rq_lock, the acquire will + * pair with the WMB to ensure we must then also see migrating. + */ + if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) + return rq; + raw_spin_unlock(&rq->lock); + raw_spin_unlock_irqrestore(&p->pi_lock, *flags); + + while (unlikely(task_on_rq_migrating(p))) + cpu_relax(); + } +} + +static inline void __task_rq_unlock(struct rq *rq) + __releases(rq->lock) +{ + raw_spin_unlock(&rq->lock); +} + +static inline void +task_rq_unlock(struct rq *rq, struct task_struct *p, unsigned long *flags) + __releases(rq->lock) + __releases(p->pi_lock) +{ + raw_spin_unlock(&rq->lock); + raw_spin_unlock_irqrestore(&p->pi_lock, *flags); +} + #ifdef CONFIG_SMP #ifdef CONFIG_PREEMPT -- cgit v1.2.3 From a79ec89fd8459f0de850898f432a2a57d60e64de Mon Sep 17 00:00:00 2001 From: Kirill Tkhai Date: Mon, 16 Feb 2015 15:38:34 +0300 Subject: sched/dl: Prevent enqueue of a sleeping task in dl_task_timer() A deadline task may be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; Later the timer fires, but the task is still dequeued: dl_task_timer() enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */ Someone wakes it up: try_to_wake_up() enqueue_dl_entity() BUG_ON(on_dl_rq()) Patch fixes this problem, it prevents queueing !on_rq tasks on dl_rq. Reported-by: Fengguang Wu Signed-off-by: Kirill Tkhai Signed-off-by: Peter Zijlstra (Intel) [ Wrote comment. ] Cc: Juri Lelli Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state") Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ru Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index e88847d9fc6a..9908c950d776 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -535,6 +535,26 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) sched_clock_tick(); update_rq_clock(rq); + + /* + * If the throttle happened during sched-out; like: + * + * schedule() + * deactivate_task() + * dequeue_task_dl() + * update_curr_dl() + * start_dl_timer() + * __dequeue_task_dl() + * prev->on_rq = 0; + * + * We can be both throttled and !queued. Replenish the counter + * but do not enqueue -- wait for our wakeup to do that. + */ + if (!task_on_rq_queued(p)) { + replenish_dl_entity(dl_se, dl_se); + goto unlock; + } + enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); if (dl_task(rq->curr)) check_preempt_curr_dl(rq, p, 0); -- cgit v1.2.3 From 06b1f8083d6ed379ec1207a96339f23e8f7abfcf Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Mon, 16 Feb 2015 19:20:07 +0100 Subject: sched: Fix preempt_schedule_common() triggering tracing recursion Since the function graph tracer needs to disable preemption, it might call preempt_schedule() after reenabling it if something triggered the need for rescheduling in between. Therefore we can't trace preempt_schedule() itself because we would face a function tracing recursion otherwise as the tracer is always called before PREEMPT_ACTIVE gets set to prevent that recursion. This is why preempt_schedule() is tagged as "notrace". But the same issue applies to every function called by preempt_schedule() before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is one such example. Unfortunately we forgot to tag it as notrace as well and as a result we are encountering tracing recursion since it got introduced by: a18b5d0181923 ("sched: Fix missing preemption opportunity") Let's fix that by applying the appropriate function tag to preempt_schedule_common(). Reported-by: Huang Ying Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f77acd98f50a..c314000f5e52 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2839,7 +2839,7 @@ void __sched schedule_preempt_disabled(void) preempt_disable(); } -static void preempt_schedule_common(void) +static void __sched notrace preempt_schedule_common(void) { do { __preempt_count_add(PREEMPT_ACTIVE); -- cgit v1.2.3 From bc9560155f4063bbc9be71bd69d6726d41b47653 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Thu, 12 Feb 2015 20:59:13 +0100 Subject: sched/completion: Serialize completion_done() with complete() Commit de30ec47302c "Remove unnecessary ->wait.lock serialization when reading completion state" was not correct, without lock/unlock the code like stop_machine_from_inactive_cpu() while (!completion_done()) cpu_relax(); can return before complete() finishes its spin_unlock() which writes to this memory. And spin_unlock_wait(). While at it, change try_wait_for_completion() to use READ_ONCE(). Reported-by: Paul E. McKenney Reported-by: Davidlohr Bueso Tested-by: Paul E. McKenney Signed-off-by: Oleg Nesterov Signed-off-by: Peter Zijlstra (Intel) [ Added a comment with the barrier. ] Cc: Linus Torvalds Cc: Nicholas Mc Guire Cc: raghavendra.kt@linux.vnet.ibm.com Cc: waiman.long@hp.com Fixes: de30ec47302c ("sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state") Link: http://lkml.kernel.org/r/20150212195913.GA30430@redhat.com Signed-off-by: Ingo Molnar --- kernel/sched/completion.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c index 7052d3fd4e7b..8d0f35debf35 100644 --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -274,7 +274,7 @@ bool try_wait_for_completion(struct completion *x) * first without taking the lock so we can * return early in the blocking case. */ - if (!ACCESS_ONCE(x->done)) + if (!READ_ONCE(x->done)) return 0; spin_lock_irqsave(&x->wait.lock, flags); @@ -297,6 +297,21 @@ EXPORT_SYMBOL(try_wait_for_completion); */ bool completion_done(struct completion *x) { - return !!ACCESS_ONCE(x->done); + if (!READ_ONCE(x->done)) + return false; + + /* + * If ->done, we need to wait for complete() to release ->wait.lock + * otherwise we can end up freeing the completion before complete() + * is done referencing it. + * + * The RMB pairs with complete()'s RELEASE of ->wait.lock and orders + * the loads of ->done and ->wait.lock such that we cannot observe + * the lock before complete() acquires it while observing the ->done + * after it's acquired the lock. + */ + smp_rmb(); + spin_unlock_wait(&x->wait.lock); + return true; } EXPORT_SYMBOL(completion_done); -- cgit v1.2.3 From 9cff8adeaa34b5d2802f03f89803da57856b3b72 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 13 Feb 2015 15:49:17 +1100 Subject: sched: Prevent recursion in io_schedule() io_schedule() calls blk_flush_plug() which, depending on the contents of current->plug, can initiate arbitrary blk-io requests. Note that this contrasts with blk_schedule_flush_plug() which requires all non-trivial work to be handed off to a separate thread. This makes it possible for io_schedule() to recurse, and initiating block requests could possibly call mempool_alloc() which, in times of memory pressure, uses io_schedule(). Apart from any stack usage issues, io_schedule() will not behave correctly when called recursively as delayacct_blkio_start() does not allow for repeated calls. So: - use ->in_iowait to detect recursion. Set it earlier, and restore it to the old value. - move the call to "raw_rq" after the call to blk_flush_plug(). As this is some sort of per-cpu thing, we want some chance that we are on the right CPU - When io_schedule() is called recurively, use blk_schedule_flush_plug() which cannot further recurse. - as this makes io_schedule() a lot more complex and as io_schedule() must match io_schedule_timeout(), but all the changes in io_schedule_timeout() and make io_schedule a simple wrapper for that. Signed-off-by: NeilBrown Signed-off-by: Peter Zijlstra (Intel) [ Moved the now rudimentary io_schedule() into sched.h. ] Cc: Jens Axboe Cc: Linus Torvalds Cc: Tony Battersby Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown Signed-off-by: Ingo Molnar --- include/linux/sched.h | 10 +++++++--- kernel/sched/core.c | 31 ++++++++++++------------------- 2 files changed, 19 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..cb5cdc777c8a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -363,9 +363,6 @@ extern void show_regs(struct pt_regs *); */ extern void show_stack(struct task_struct *task, unsigned long *sp); -void io_schedule(void); -long io_schedule_timeout(long timeout); - extern void cpu_init (void); extern void trap_init(void); extern void update_process_times(int user); @@ -422,6 +419,13 @@ extern signed long schedule_timeout_uninterruptible(signed long timeout); asmlinkage void schedule(void); extern void schedule_preempt_disabled(void); +extern long io_schedule_timeout(long timeout); + +static inline void io_schedule(void) +{ + io_schedule_timeout(MAX_SCHEDULE_TIMEOUT); +} + struct nsproxy; struct user_namespace; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c314000f5e52..daaea922f482 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4358,36 +4358,29 @@ EXPORT_SYMBOL_GPL(yield_to); * This task is about to go to sleep on IO. Increment rq->nr_iowait so * that process accounting knows that this is a task in IO wait state. */ -void __sched io_schedule(void) -{ - struct rq *rq = raw_rq(); - - delayacct_blkio_start(); - atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); - current->in_iowait = 1; - schedule(); - current->in_iowait = 0; - atomic_dec(&rq->nr_iowait); - delayacct_blkio_end(); -} -EXPORT_SYMBOL(io_schedule); - long __sched io_schedule_timeout(long timeout) { - struct rq *rq = raw_rq(); + int old_iowait = current->in_iowait; + struct rq *rq; long ret; + current->in_iowait = 1; + if (old_iowait) + blk_schedule_flush_plug(current); + else + blk_flush_plug(current); + delayacct_blkio_start(); + rq = raw_rq(); atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); - current->in_iowait = 1; ret = schedule_timeout(timeout); - current->in_iowait = 0; + current->in_iowait = old_iowait; atomic_dec(&rq->nr_iowait); delayacct_blkio_end(); + return ret; } +EXPORT_SYMBOL(io_schedule_timeout); /** * sys_sched_get_priority_max - return maximum RT priority. -- cgit v1.2.3 From 29183a70b0b828500816bd794b3fe192fce89f73 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Mon, 9 Feb 2015 23:30:36 -0800 Subject: ntp: Fixup adjtimex freq validation on 32-bit systems Additional validation of adjtimex freq values to avoid potential multiplication overflows were added in commit 5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values) Unfortunately the patch used LONG_MAX/MIN instead of LLONG_MAX/MIN, which was fine on 64-bit systems, but being much smaller on 32-bit systems caused false positives resulting in most direct frequency adjustments to fail w/ EINVAL. ntpd only does direct frequency adjustments at startup, so the issue was not as easily observed there, but other time sync applications like ptpd and chrony were more effected by the bug. See bugs: https://bugzilla.kernel.org/show_bug.cgi?id=92481 https://bugzilla.redhat.com/show_bug.cgi?id=1188074 This patch changes the checks to use LLONG_MAX for clarity, and additionally the checks are disabled on 32-bit systems since LLONG_MAX/PPM_SCALE is always larger then the 32-bit long freq value, so multiplication overflows aren't possible there. Reported-by: Josh Boyer Reported-by: George Joseph Tested-by: George Joseph Signed-off-by: John Stultz Signed-off-by: Peter Zijlstra (Intel) Cc: # v3.19+ Cc: Linus Torvalds Cc: Sasha Levin Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org [ Prettified the changelog and the comments a bit. ] Signed-off-by: Ingo Molnar --- kernel/time/ntp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 4b585e0fdd22..0f60b08a4f07 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -633,10 +633,14 @@ int ntp_validate_timex(struct timex *txc) if ((txc->modes & ADJ_SETOFFSET) && (!capable(CAP_SYS_TIME))) return -EPERM; - if (txc->modes & ADJ_FREQUENCY) { - if (LONG_MIN / PPM_SCALE > txc->freq) + /* + * Check for potential multiplication overflows that can + * only happen on 64-bit systems: + */ + if ((txc->modes & ADJ_FREQUENCY) && (BITS_PER_LONG == 64)) { + if (LLONG_MIN / PPM_SCALE > txc->freq) return -EINVAL; - if (LONG_MAX / PPM_SCALE < txc->freq) + if (LLONG_MAX / PPM_SCALE < txc->freq) return -EINVAL; } -- cgit v1.2.3 From 6f1607f1bdb4f9991a8123675a03c1764b2932fe Mon Sep 17 00:00:00 2001 From: Kirill Tkhai Date: Wed, 4 Feb 2015 12:09:32 +0300 Subject: sched/dl: Do update_rq_clock() in yield_task_dl() update_curr_dl() needs actual rq clock. Signed-off-by: Kirill Tkhai Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhai Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9908c950d776..3fa8fa6d9403 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -912,6 +912,7 @@ static void yield_task_dl(struct rq *rq) rq->curr->dl.dl_yielded = 1; p->dl.runtime = 0; } + update_rq_clock(rq); update_curr_dl(rq); } -- cgit v1.2.3 From 1fe89e1b6d270aa0d3452c60d38461ea589594e3 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 9 Feb 2015 11:53:18 +0100 Subject: sched/autogroup: Fix failure to set cpu.rt_runtime_us Because task_group() uses a cache of autogroup_task_group(), whose output depends on sched_class, switching classes can generate problems. In particular, when started as fair, the cache points to the autogroup, so when switching to RT the tg_rt_schedulable() test fails for every cpu.rt_{runtime,period}_us change because now the autogroup has tasks and no runtime. Furthermore, going back to the previous semantics of varying task_group() with sched_class has the down-side that the sched_debug output varies as well, even though the task really is in the autogroup. Therefore add an autogroup exception to tg_has_rt_tasks() -- such that both (all) task_group() usages in sched/core now have one. And remove all the remnants of the variable task_group() output. Reported-by: Zefan Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Stefan Bader Fixes: 8323f26ce342 ("sched: Fix race in task_group()") Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/auto_group.c | 6 +----- kernel/sched/core.c | 6 ++++++ 2 files changed, 7 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/auto_group.c b/kernel/sched/auto_group.c index 8a2e230fb86a..eae160dd669d 100644 --- a/kernel/sched/auto_group.c +++ b/kernel/sched/auto_group.c @@ -87,8 +87,7 @@ static inline struct autogroup *autogroup_create(void) * so we don't have to move tasks around upon policy change, * or flail around trying to allocate bandwidth on the fly. * A bandwidth exception in __sched_setscheduler() allows - * the policy change to proceed. Thereafter, task_group() - * returns &root_task_group, so zero bandwidth is required. + * the policy change to proceed. */ free_rt_sched_group(tg); tg->rt_se = root_task_group.rt_se; @@ -115,9 +114,6 @@ bool task_wants_autogroup(struct task_struct *p, struct task_group *tg) if (tg != &root_task_group) return false; - if (p->sched_class != &fair_sched_class) - return false; - /* * We can only assume the task group can't go away on us if * autogroup_move_group() can see us on ->thread_group list. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index daaea922f482..03a67f09404c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7577,6 +7577,12 @@ static inline int tg_has_rt_tasks(struct task_group *tg) { struct task_struct *g, *p; + /* + * Autogroups do not have RT tasks; see autogroup_create(). + */ + if (task_group_is_autogroup(tg)) + return 0; + for_each_process_thread(g, p) { if (rt_task(p) && task_group(p) == tg) return 1; -- cgit v1.2.3 From 2636ed5f8d15ff9395731593537b4b3fdf2af24d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 9 Feb 2015 12:23:20 +0100 Subject: sched/rt: Avoid obvious configuration fail Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it would disallow the kernel creating RT tasks. One can of course still set it to 1, which will (likely) still wreck your kernel, but at least make it clear that setting it to 0 is not good. Collect both sanity checks into the one place while we're there. Suggested-by: Zefan Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 03a67f09404c..a4869bd426ca 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7675,6 +7675,17 @@ static int tg_set_rt_bandwidth(struct task_group *tg, { int i, err = 0; + /* + * Disallowing the root group RT runtime is BAD, it would disallow the + * kernel creating (and or operating) RT threads. + */ + if (tg == &root_task_group && rt_runtime == 0) + return -EINVAL; + + /* No period doesn't make any sense. */ + if (rt_period == 0) + return -EINVAL; + mutex_lock(&rt_constraints_mutex); read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime); @@ -7731,9 +7742,6 @@ static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us) rt_period = (u64)rt_period_us * NSEC_PER_USEC; rt_runtime = tg->rt_bandwidth.rt_runtime; - if (rt_period == 0) - return -EINVAL; - return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); } -- cgit v1.2.3 From 146755923262037fc4c54abc28c04b1103f3cc51 Mon Sep 17 00:00:00 2001 From: Jay Lan Date: Mon, 29 Sep 2014 15:36:57 -0700 Subject: kdb: fix incorrect counts in KDB summary command output The output of KDB 'summary' command should report MemTotal, MemFree and Buffers output in kB. Current codes report in unit of pages. A define of K(x) as is defined in the code, but not used. This patch would apply the define to convert the values to kB. Please include me on Cc on replies. I do not subscribe to linux-kernel. Signed-off-by: Jay Lan Cc: Signed-off-by: Jason Wessel --- kernel/debug/kdb/kdb_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 7b40c5f07dce..c6d2c0b1565d 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2583,7 +2583,7 @@ static int kdb_summary(int argc, const char **argv) #define K(x) ((x) << (PAGE_SHIFT - 10)) kdb_printf("\nMemTotal: %8lu kB\nMemFree: %8lu kB\n" "Buffers: %8lu kB\n", - val.totalram, val.freeram, val.bufferram); + K(val.totalram), K(val.freeram), K(val.bufferram)); return 0; } -- cgit v1.2.3 From df0036d117e6c9df36324e517728e33543065f9a Mon Sep 17 00:00:00 2001 From: Jason Wessel Date: Thu, 8 Jan 2015 15:46:55 -0600 Subject: kdb: Fix off by one error in kdb_cpu() There was a follow on replacement patch against the prior "kgdb: Timeout if secondary CPUs ignore the roundup". See: https://lkml.org/lkml/2015/1/7/442 This patch is the delta vs the patch that was committed upstream: * Fix an off-by-one error in kdb_cpu(). * Replace NR_CPUS with CONFIG_NR_CPUS to tell checkpatch that we really want a static limit. * Removed the "KGDB: " prefix from the pr_crit() in debug_core.c (kgdb-next contains a patch which introduced pr_fmt() to this file to the tag will now be applied automatically). Cc: Daniel Thompson Cc: Signed-off-by: Jason Wessel --- kernel/debug/debug_core.c | 2 +- kernel/debug/kdb/kdb_main.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 07ce18ca71e0..ac5c0f9c7a20 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -604,7 +604,7 @@ return_normal: online_cpus) cpu_relax(); if (!time_left) - pr_crit("KGDB: Timed out waiting for secondary CPUs.\n"); + pr_crit("Timed out waiting for secondary CPUs.\n"); /* * At this point the primary processor is completely diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index c6d2c0b1565d..60f6bb817f70 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2256,7 +2256,7 @@ static int kdb_cpu(int argc, const char **argv) /* * Validate cpunum */ - if ((cpunum > NR_CPUS) || !kgdb_info[cpunum].enter_kgdb) + if ((cpunum >= CONFIG_NR_CPUS) || !kgdb_info[cpunum].enter_kgdb) return KDB_BADCPUNUM; dbg_switch_cpu = cpunum; -- cgit v1.2.3 From f7d4ca8bbfda23b4f1eae9b6757ff64166b093d5 Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Fri, 7 Nov 2014 18:37:57 +0000 Subject: kdb: Avoid printing KERN_ levels to consoles Currently when kdb traps printk messages then the raw log level prefix (consisting of '\001' followed by a numeral) does not get stripped off before the message is issued to the various I/O handlers supported by kdb. This causes annoying visual noise as well as causing problems grepping for ^. It is also a change of behaviour compared to normal usage of printk() usage. For example -h ends up with different output to that of kdb's "sr h". This patch addresses the problem by stripping log levels from messages before they are issued to the I/O handlers. printk() which can also act as an i/o handler in some cases is special cased; if the caller provided a log level then the prefix will be preserved when sent to printk(). The addition of non-printable characters to the output of kdb commands is a regression, albeit and extremely elderly one, introduced by commit 04d2c8c83d0e ("printk: convert the format for KERN_ to a 2 byte pattern"). Note also that this patch does *not* restore the original behaviour from v3.5. Instead it makes printk() from within a kdb command display the message without any prefix (i.e. like printk() normally does). Signed-off-by: Daniel Thompson Cc: Joe Perches Cc: stable@vger.kernel.org Signed-off-by: Jason Wessel --- include/linux/kdb.h | 8 +++++++- kernel/debug/kdb/kdb_io.c | 22 +++++++++++++--------- kernel/printk/printk.c | 2 +- 3 files changed, 21 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/kdb.h b/include/linux/kdb.h index 75ae2e2631fc..a19bcf9e762e 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -156,8 +156,14 @@ typedef enum { KDB_REASON_SYSTEM_NMI, /* In NMI due to SYSTEM cmd; regs valid */ } kdb_reason_t; +enum kdb_msgsrc { + KDB_MSGSRC_INTERNAL, /* direct call to kdb_printf() */ + KDB_MSGSRC_PRINTK, /* trapped from printk() */ +}; + extern int kdb_trap_printk; -extern __printf(1, 0) int vkdb_printf(const char *fmt, va_list args); +extern __printf(2, 0) int vkdb_printf(enum kdb_msgsrc src, const char *fmt, + va_list args); extern __printf(1, 2) int kdb_printf(const char *, ...); typedef __printf(1, 2) int (*kdb_printf_t)(const char *, ...); diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c index 7c70812caea5..a550afb99ebe 100644 --- a/kernel/debug/kdb/kdb_io.c +++ b/kernel/debug/kdb/kdb_io.c @@ -548,7 +548,7 @@ static int kdb_search_string(char *searched, char *searchfor) return 0; } -int vkdb_printf(const char *fmt, va_list ap) +int vkdb_printf(enum kdb_msgsrc src, const char *fmt, va_list ap) { int diag; int linecount; @@ -691,19 +691,20 @@ kdb_printit: * Write to all consoles. */ retlen = strlen(kdb_buffer); + cp = (char *) printk_skip_level(kdb_buffer); if (!dbg_kdb_mode && kgdb_connected) { - gdbstub_msg_write(kdb_buffer, retlen); + gdbstub_msg_write(cp, retlen - (cp - kdb_buffer)); } else { if (dbg_io_ops && !dbg_io_ops->is_console) { - len = retlen; - cp = kdb_buffer; + len = retlen - (cp - kdb_buffer); + cp2 = cp; while (len--) { - dbg_io_ops->write_char(*cp); - cp++; + dbg_io_ops->write_char(*cp2); + cp2++; } } while (c) { - c->write(c, kdb_buffer, retlen); + c->write(c, cp, retlen - (cp - kdb_buffer)); touch_nmi_watchdog(); c = c->next; } @@ -711,7 +712,10 @@ kdb_printit: if (logging) { saved_loglevel = console_loglevel; console_loglevel = CONSOLE_LOGLEVEL_SILENT; - printk(KERN_INFO "%s", kdb_buffer); + if (printk_get_level(kdb_buffer) || src == KDB_MSGSRC_PRINTK) + printk("%s", kdb_buffer); + else + pr_info("%s", kdb_buffer); } if (KDB_STATE(PAGER)) { @@ -844,7 +848,7 @@ int kdb_printf(const char *fmt, ...) int r; va_start(ap, fmt); - r = vkdb_printf(fmt, ap); + r = vkdb_printf(KDB_MSGSRC_INTERNAL, fmt, ap); va_end(ap); return r; diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 02d6b6d28796..fae29e3ffbf0 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1811,7 +1811,7 @@ int vprintk_default(const char *fmt, va_list args) #ifdef CONFIG_KGDB_KDB if (unlikely(kdb_trap_printk)) { - r = vkdb_printf(fmt, args); + r = vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args); return r; } #endif -- cgit v1.2.3 From 5454388113938d9592d6a6d5424469014da4ee86 Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 6 Nov 2014 15:32:13 +0000 Subject: kdb: Remove stack dump when entering kgdb due to NMI Issuing a stack dump feels ergonomically wrong when entering due to NMI. Entering due to NMI is normally a reaction to a user request, either the NMI button on a server or a "magic knock" on a UART. Therefore the backtrace behaviour on entry due to NMI should be like SysRq-g (no stack dump) rather than like oops. Note also that the stack dump does not offer any information that cannot be trivial retrieved using the 'bt' command. Signed-off-by: Daniel Thompson Signed-off-by: Jason Wessel --- kernel/debug/kdb/kdb_main.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 60f6bb817f70..0a22f455060a 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -1247,7 +1247,6 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs, kdb_printf("due to NonMaskable Interrupt @ " kdb_machreg_fmt "\n", instruction_pointer(regs)); - kdb_dumpregs(regs); break; case KDB_REASON_SSTEP: case KDB_REASON_BREAK: -- cgit v1.2.3 From ab08e464a2cd8242fdc6e4f87f3480808364a97a Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 11 Sep 2014 09:58:29 +0100 Subject: kdb: Fix a prompt management bug when using | grep Currently when the "| grep" feature is used to filter the output of a command then the prompt is not displayed for the subsequent command. Likewise any characters typed by the user are also not echoed to the display. This rather disconcerting problem eventually corrects itself when the user presses Enter and the kdb_grepping_flag is cleared as kdb_parse() tries to make sense of whatever they typed. This patch resolves the problem by moving the clearing of this flag from the middle of command processing to the beginning. Signed-off-by: Daniel Thompson Signed-off-by: Jason Wessel --- kernel/debug/kdb/kdb_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 0a22f455060a..420418360b81 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -915,13 +915,12 @@ int kdb_parse(const char *cmdstr) char *cp; char *cpp, quoted; kdbtab_t *tp; - int i, escaped, ignore_errors = 0, check_grep; + int i, escaped, ignore_errors = 0, check_grep = 0; /* * First tokenize the command string. */ cp = (char *)cmdstr; - kdb_grepping_flag = check_grep = 0; if (KDB_FLAG(CMD_INTERRUPT)) { /* Previous command was interrupted, newline must not @@ -1280,6 +1279,7 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs, */ kdb_nextline = 1; KDB_STATE_CLEAR(SUPPRESS); + kdb_grepping_flag = 0; cmdbuf = cmd_cur; *cmdbuf = '\0'; -- cgit v1.2.3 From fb6daa7520f9d17a97e84a3d5a947819e0313f28 Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 11 Sep 2014 10:37:10 +0100 Subject: kdb: Provide forward search at more prompt Currently kdb allows the output of comamnds to be filtered using the | grep feature. This is useful but does not permit the output emitted shortly after a string match to be examined without wading through the entire unfiltered output of the command. Such a feature is particularly useful to navigate function traces because these traces often have a useful trigger string *before* the point of interest. This patch reuses the existing filtering logic to introduce a simple forward search to kdb that can be triggered from the more prompt. Signed-off-by: Daniel Thompson Signed-off-by: Jason Wessel --- kernel/debug/kdb/kdb_io.c | 22 ++++++++++++++++++++-- kernel/debug/kdb/kdb_main.c | 7 ++++--- kernel/debug/kdb/kdb_private.h | 2 ++ 3 files changed, 26 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c index a550afb99ebe..ca13e6215537 100644 --- a/kernel/debug/kdb/kdb_io.c +++ b/kernel/debug/kdb/kdb_io.c @@ -680,6 +680,12 @@ int vkdb_printf(enum kdb_msgsrc src, const char *fmt, va_list ap) size_avail = sizeof(kdb_buffer) - len; goto kdb_print_out; } + if (kdb_grepping_flag >= KDB_GREPPING_FLAG_SEARCH) + /* + * This was a interactive search (using '/' at more + * prompt) and it has completed. Clear the flag. + */ + kdb_grepping_flag = 0; /* * at this point the string is a full line and * should be printed, up to the null. @@ -798,11 +804,23 @@ kdb_printit: kdb_nextline = linecount - 1; kdb_printf("\r"); suspend_grep = 1; /* for this recursion */ + } else if (buf1[0] == '/' && !kdb_grepping_flag) { + kdb_printf("\r"); + kdb_getstr(kdb_grep_string, KDB_GREP_STRLEN, + kdbgetenv("SEARCHPROMPT") ?: "search> "); + *strchrnul(kdb_grep_string, '\n') = '\0'; + kdb_grepping_flag += KDB_GREPPING_FLAG_SEARCH; + suspend_grep = 1; /* for this recursion */ } else if (buf1[0] && buf1[0] != '\n') { /* user hit something other than enter */ suspend_grep = 1; /* for this recursion */ - kdb_printf("\nOnly 'q' or 'Q' are processed at more " - "prompt, input ignored\n"); + if (buf1[0] != '/') + kdb_printf( + "\nOnly 'q', 'Q' or '/' are processed at " + "more prompt, input ignored\n"); + else + kdb_printf("\n'/' cannot be used during | " + "grep filtering, input ignored\n"); } else if (kdb_grepping_flag) { /* user hit enter */ suspend_grep = 1; /* for this recursion */ diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 420418360b81..4121345498e0 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -50,8 +50,7 @@ static int kdb_cmd_enabled = CONFIG_KDB_DEFAULT_ENABLE; module_param_named(cmd_enable, kdb_cmd_enabled, int, 0600); -#define GREP_LEN 256 -char kdb_grep_string[GREP_LEN]; +char kdb_grep_string[KDB_GREP_STRLEN]; int kdb_grepping_flag; EXPORT_SYMBOL(kdb_grepping_flag); int kdb_grep_leading; @@ -870,7 +869,7 @@ static void parse_grep(const char *str) len = strlen(cp); if (!len) return; - if (len >= GREP_LEN) { + if (len >= KDB_GREP_STRLEN) { kdb_printf("search string too long\n"); return; } @@ -1280,6 +1279,8 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs, kdb_nextline = 1; KDB_STATE_CLEAR(SUPPRESS); kdb_grepping_flag = 0; + /* ensure the old search does not leak into '/' commands */ + kdb_grep_string[0] = '\0'; cmdbuf = cmd_cur; *cmdbuf = '\0'; diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h index eaacd1693954..da93894db7ba 100644 --- a/kernel/debug/kdb/kdb_private.h +++ b/kernel/debug/kdb/kdb_private.h @@ -196,7 +196,9 @@ extern int kdb_main_loop(kdb_reason_t, kdb_reason_t, /* Miscellaneous functions and data areas */ extern int kdb_grepping_flag; +#define KDB_GREPPING_FLAG_SEARCH 0x8000 extern char kdb_grep_string[]; +#define KDB_GREP_STRLEN 256 extern int kdb_grep_leading; extern int kdb_grep_trailing; extern char *kdb_cmds[]; -- cgit v1.2.3 From 32d375f6f24c3e4c9c235672695b4c314cf6b964 Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 11 Sep 2014 10:41:12 +0100 Subject: kdb: Const qualifier for kdb_getstr's prompt argument All current callers of kdb_getstr() can pass constant pointers via the prompt argument. This patch adds a const qualification to make explicit the fact that this is safe. Signed-off-by: Daniel Thompson Signed-off-by: Jason Wessel --- kernel/debug/kdb/kdb_io.c | 2 +- kernel/debug/kdb/kdb_private.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c index ca13e6215537..fc1ef736253c 100644 --- a/kernel/debug/kdb/kdb_io.c +++ b/kernel/debug/kdb/kdb_io.c @@ -439,7 +439,7 @@ poll_again: * substituted for %d, %x or %o in the prompt. */ -char *kdb_getstr(char *buffer, size_t bufsize, char *prompt) +char *kdb_getstr(char *buffer, size_t bufsize, const char *prompt) { if (prompt && kdb_prompt_str != prompt) strncpy(kdb_prompt_str, prompt, CMD_BUFLEN); diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h index da93894db7ba..75014d7f4568 100644 --- a/kernel/debug/kdb/kdb_private.h +++ b/kernel/debug/kdb/kdb_private.h @@ -211,7 +211,7 @@ extern void kdb_ps1(const struct task_struct *p); extern void kdb_print_nameval(const char *name, unsigned long val); extern void kdb_send_sig_info(struct task_struct *p, struct siginfo *info); extern void kdb_meminfo_proc_show(void); -extern char *kdb_getstr(char *, size_t, char *); +extern char *kdb_getstr(char *, size_t, const char *); extern void kdb_gdb_state_pass(char *buf); /* Defines for kdb_symbol_print */ -- cgit v1.2.3 From 5516fd7b92a7c16b9111a88aaa1b13dd937d8ac5 Mon Sep 17 00:00:00 2001 From: Colin Cross Date: Wed, 28 Jan 2015 17:02:14 +0530 Subject: debug: prevent entering debug mode on panic/exception. On non-developer devices, kgdb prevents the device from rebooting after a panic. Incase of panics and exceptions, to allow the device to reboot, prevent entering debug mode to avoid getting stuck waiting for the user to interact with debugger. To avoid entering the debugger on panic/exception without any extra configuration, panic_timeout is being used which can be set via /proc/sys/kernel/panic at run time and CONFIG_PANIC_TIMEOUT sets the default value. Setting panic_timeout indicates that the user requested machine to perform unattended reboot after panic. We dont want to get stuck waiting for the user input incase of panic. Cc: Andrew Morton Cc: kgdb-bugreport@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Cc: Android Kernel Team Cc: John Stultz Cc: Sumit Semwal Signed-off-by: Colin Cross [Kiran: Added context to commit message. panic_timeout is used instead of break_on_panic and break_on_exception to honor CONFIG_PANIC_TIMEOUT Modified the commit as per community feedback] Signed-off-by: Kiran Raparthy Signed-off-by: Daniel Thompson Signed-off-by: Jason Wessel --- kernel/debug/debug_core.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'kernel') diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index ac5c0f9c7a20..0874e2edd275 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -696,6 +696,14 @@ kgdb_handle_exception(int evector, int signo, int ecode, struct pt_regs *regs) if (arch_kgdb_ops.enable_nmi) arch_kgdb_ops.enable_nmi(0); + /* + * Avoid entering the debugger if we were triggered due to an oops + * but panic_timeout indicates the system should automatically + * reboot on panic. We don't want to get stuck waiting for input + * on such systems, especially if its "just" an oops. + */ + if (signo != SIGTRAP && panic_timeout) + return 1; memset(ks, 0, sizeof(struct kgdb_state)); ks->cpu = raw_smp_processor_id(); @@ -828,6 +836,15 @@ static int kgdb_panic_event(struct notifier_block *self, unsigned long val, void *data) { + /* + * Avoid entering the debugger if we were triggered due to a panic + * We don't want to get stuck waiting for input from user in such case. + * panic_timeout indicates the system should automatically + * reboot on panic. + */ + if (panic_timeout) + return NOTIFY_DONE; + if (dbg_kdb_mode) kdb_printf("PANIC: %s\n", (char *)data); kgdb_breakpoint(); -- cgit v1.2.3