summaryrefslogtreecommitdiff
path: root/kernel/cpu.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-30cpu/hotplug: Prevent self deadlock on CPU hot-unplugThomas Gleixner1-1/+23
Xiongfeng reported and debugged a self deadlock of the task which initiates and controls a CPU hot-unplug operation vs. the CFS bandwidth timer. CPU1 CPU2 T1 sets cfs_quota starts hrtimer cfs_bandwidth 'period_timer' T1 is migrated to CPU2 T1 initiates offlining of CPU1 Hotplug operation starts ... 'period_timer' expires and is re-enqueued on CPU1 ... take_cpu_down() CPU1 shuts down and does not handle timers anymore. They have to be migrated in the post dead hotplug steps by the control task. T1 runs the post dead offline operation T1 is scheduled out T1 waits for 'period_timer' to expire T1 waits there forever if it is scheduled out before it can execute the hrtimer offline callback hrtimers_dead_cpu(). Cure this by delegating the hotplug control operation to a worker thread on an online CPU. This takes the initiating user space task, which might be affected by the bandwidth timer, completely out of the picture. Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Yu Liao <liaoyu15@huawei.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx
2023-07-31cpu/SMT: Fix cpu_smt_possible() commentZhang Rui1-1/+1
Commit e1572f1d08be ("cpu/SMT: create and export cpu_smt_possible()") introduces cpu_smt_possible() to represent if SMT is theoretically possible. It returns true when SMT is supported and not forcefully disabled ('nosmt=force'). But the comment of it says "Returns true if SMT is not supported of forcefully (irreversibly) disabled", which is wrong. Fix that comment accordingly. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230728155313.44170-1-rui.zhang@intel.com
2023-07-28cpu/SMT: Allow enabling partial SMT states via sysfsMichael Ellerman1-16/+44
Add support to the /sys/devices/system/cpu/smt/control interface for enabling a specified number of SMT threads per core, including partial SMT states where not all threads are brought online. The current interface accepts "on" and "off", to enable either 1 or all SMT threads per core. This commit allows writing an integer, between 1 and the number of SMT threads supported by the machine. Writing 1 is a synonym for "off", 2 or more enables SMT with the specified number of threads. When reading the file, if all threads are online "on" is returned, to avoid changing behaviour for existing users. If some other number of threads is online then the integer value is returned. Architectures like x86 only supporting 1 thread or all threads, should not define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT states, like PowerPC, should define it. [ ldufour: Slightly reword the commit's description ] [ ldufour: Remove switch() in __store_smt_control() ] [ ldufour: Rix build issue in control_show() ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-8-ldufour@linux.ibm.com
2023-07-28cpu/SMT: Create topology_smt_thread_allowed()Michael Ellerman1-1/+23
Some architectures allows partial SMT states, i.e. when not all SMT threads are brought online. To support that, add an architecture helper which checks whether a given CPU is allowed to be brought online depending on how many SMT threads are currently enabled. Since this is only applicable to architecture supporting partial SMT, only these architectures should select the new configuration variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not supporting the partial SMT states, there is no need to define topology_cpu_smt_allowed(), the generic code assumed that all the threads are allowed or only the primary ones. Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is enabled, to check if the particular thread should be onlined. Notably, also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow offlining some threads to move from a higher to lower number of threads online. [ ldufour: Slightly reword the commit's description ] [ ldufour: Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-7-ldufour@linux.ibm.com
2023-07-28cpu/SMT: Remove topology_smt_supported()Laurent Dufour1-2/+2
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(), checking that value is enough to know whether SMT is supported. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com
2023-07-28cpu/SMT: Store the current/max number of threadsMichael Ellerman1-1/+20
Some architectures allow partial SMT states at boot time, ie. when not all SMT threads are brought online. To support that the SMT code needs to know the maximum number of SMT threads, and also the currently configured number. The architecture code knows the max number of threads, so have the architecture code pass that value to cpu_smt_set_num_threads(). Note that although topology_max_smt_threads() exists, it is not configured early enough to be used here. As architecture, like PowerPC, allows the threads number to be set through the kernel command line, also pass that value. [ ldufour: Slightly reword the commit message ] [ ldufour: Rename cpu_smt_check_topology and add a num_threads argument ] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-5-ldufour@linux.ibm.com
2023-07-28cpu/SMT: Move smt/control simple exit cases earlierMichael Ellerman1-6/+6
Move the simple exit cases, i.e. those which don't depend on the value written, earlier in the function. That makes it clearer that regardless of the input those states cannot be transitioned out of. That does have a user-visible effect, in that the error returned will now always be EPERM/ENODEV for those states, regardless of the value written. Previously writing an invalid value would return EINVAL even when in those states. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-4-ldufour@linux.ibm.com
2023-07-28cpu/SMT: Move SMT prototypes into cpu_smt.hMichael Ellerman1-0/+1
In order to export the cpuhp_smt_control enum as part of the interface between generic and architecture code, the architecture code needs to include asm/topology.h. But that leads to circular header dependencies. So split the enum and related declarations into a separate header. [ ldufour: Reworded the commit's description ] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-3-ldufour@linux.ibm.com
2023-07-28cpu/hotplug: Remove dependancy against cpu_primary_thread_maskLaurent Dufour1-14/+10
The commit 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable cpu_primary_thread_mask exported by the X86 code. This variable is only used when CONFIG_HOTPLUG_PARALLEL is set. Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set. No functional change. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-2-ldufour@linux.ibm.com
2023-05-23cpu/hotplug: Fix off by one in cpuhp_bringup_mask()Thomas Gleixner1-3/+3
cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up to a caller provided upper limit. The limit variable is decremented and checked for 0 before invoking cpu_up(), which is obviously off by one and prevents the bringup of the last CPU when the limit is equal to the number of present CPUs. Move the decrement and check after the cpu_up() invocation. Fixes: 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx
2023-05-15cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATEThomas Gleixner1-5/+98
There is often significant latency in the early stages of CPU bringup, and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86) and then waiting for it to respond before moving on to the next. Allow a platform to enable parallel setup which brings all to be onlined CPUs up to the CPUHP_BP_KICK_AP state. While this state advancement on the control CPU (BP) is single-threaded the important part is the last state CPUHP_BP_KICK_AP which wakes the to be onlined CPUs up. This allows the CPUs to run up to the first sychronization point cpuhp_ap_sync_alive() where they wait for the control CPU to release them one by one for the full onlining procedure. This parallelism depends on the CPU hotplug core sync mechanism which ensures that the parallel brought up CPUs wait for release before touching any state which would make the CPU visible to anything outside the hotplug control mechanism. To handle the SMT constraints of X86 correctly the bringup happens in two iterations when CONFIG_HOTPLUG_SMT is enabled. The control CPU brings up the primary SMT threads of each core first, which can load the microcode without the need to rendevouz with the thread siblings. Once that's completed it brings up the secondary SMT threads. Co-developed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.240231377@linutronix.de
2023-05-15cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanismThomas Gleixner1-2/+68
The bring up logic of a to be onlined CPU consists of several parts, which are considered to be a single hotplug state: 1) Control CPU issues the wake-up 2) To be onlined CPU starts up, does the minimal initialization, reports to be alive and waits for release into the complete bring-up. 3) Control CPU waits for the alive report and releases the upcoming CPU for the complete bring-up. Allow to split this into two states: 1) Control CPU issues the wake-up After that the to be onlined CPU starts up, does the minimal initialization, reports to be alive and waits for release into the full bring-up. As this can run after the control CPU dropped the hotplug locks the code which is executed on the AP before it reports alive has to be carefully audited to not violate any of the hotplug constraints, especially not modifying any of the various cpumasks. This is really only meant to avoid waiting for the AP to react on the wake-up. Of course an architecture can move strict CPU related setup functionality, e.g. microcode loading, with care before the synchronization point to save further pointless waiting time. 2) Control CPU waits for the alive report and releases the upcoming CPU for the complete bring-up. This allows that the two states can be split up to run all to be onlined CPUs up to state #1 on the control CPU and then at a later point run state #2. This spares some of the latencies of the full serialized per CPU bringup by avoiding the per CPU wakeup/wait serialization. The assumption is that the first AP already waits when the last AP has been woken up. This obvioulsy depends on the hardware latencies and depending on the timings this might still not completely eliminate all wait scenarios. This split is just a preparatory step for enabling the parallel bringup later. The boot time bringup is still fully serialized. It has a separate config switch so that architectures which want to support parallel bringup can test the split of the CPUHP_BRINGUG step separately. To enable this the architecture must support the CPU hotplug core sync mechanism and has to be audited that there are no implicit hotplug state dependencies which require a fully serialized bringup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.080801387@linutronix.de
2023-05-15cpu/hotplug: Reset task stack state in _cpu_up()David Woodhouse1-6/+6
Commit dce1ca0525bf ("sched/scs: Reset task stack state in bringup_cpu()") ensured that the shadow call stack and KASAN poisoning were removed from a CPU's stack each time that CPU is brought up, not just once. This is not incorrect. However, with parallel bringup the idle thread setup will happen at a different step. As a consequence the cleanup in bringup_cpu() would be too late. Move the SCS/KASAN cleanup to the generic _cpu_up() function instead, which already ensures that the new CPU's stack is available, purely to allow for early failure. This occurs when the CPU to be brought up is in the CPUHP_OFFLINE state, which should correctly do the cleanup any time the CPU has been taken down to the point where such is needed. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.027075560@linutronix.de
2023-05-15cpu/hotplug: Add CPU state tracking and synchronizationThomas Gleixner1-1/+192
The CPU state tracking and synchronization mechanism in smpboot.c is completely independent of the hotplug code and all logic around it is implemented in architecture specific code. Except for the state reporting of the AP there is absolutely nothing architecture specific and the sychronization and decision functions can be moved into the generic hotplug core code. Provide an integrated variant and add the core synchronization and decision points. This comes in two flavours: 1) DEAD state synchronization Updated by the architecture code once the AP reaches the point where it is ready to be torn down by the control CPU, e.g. by removing power or clocks or tear down via the hypervisor. The control CPU waits for this state to be reached with a timeout. If the state is reached an architecture specific cleanup function is invoked. 2) Full state synchronization This extends #1 with AP alive synchronization. This is new functionality, which allows to replace architecture specific wait mechanims, e.g. cpumasks, completely. It also prevents that an AP which is in a limbo state can be brought up again. This can happen when an AP failed to report dead state during a previous off-line operation. The dead synchronization is what most architectures use. Only x86 makes a bringup decision based on that state at the moment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.476305035@linutronix.de
2023-05-15cpu/hotplug: Rework sparse_irq locking in bringup_cpu()Thomas Gleixner1-10/+24
There is no harm to hold sparse_irq lock until the upcoming CPU completes in cpuhp_online_idle(). This allows to remove cpu_online() synchronization from architecture code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.263722880@linutronix.de
2023-05-15cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __initThomas Gleixner1-1/+1
No point in keeping them around. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-03-28lazy tlb: introduce lazy tlb mm refcount helper functionsNicholas Piggin1-1/+1
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-17cpu/hotplug: move to use bus_get_dev_root()Greg Kroah-Hartman1-6/+17
Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Phil Auld <pauld@redhat.com> Cc: Steven Price <steven.price@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20230313182918.1312597-7-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02cpu/hotplug: Do not bail-out in DYING/STARTING sectionsVincent Donnefort1-16/+40
The DYING/STARTING callbacks are not expected to fail. However, as reported by Derek, buggy drivers such as tboot are still free to return errors within those sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable state. As there is no rollback possible, only log the failures and proceed with the following steps. This restores the hotplug behaviour prior to commit 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Fixes: 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Reported-by: Derek Dolney <z23@posteo.net> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Derek Dolney <z23@posteo.net> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867 Link: https://lore.kernel.org/r/20220927101259.1149636-1-vdonnefort@google.com
2022-12-02cpu/hotplug: Set cpuhp target for boot cpuPhil Auld1-0/+1
Since the boot cpu does not go through the hotplug process it ends up with state == CPUHP_ONLINE but target == CPUHP_OFFLINE. So set the target to match in boot_cpu_hotplug_init(). Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20221117162329.3164999-3-pauld@redhat.com
2022-12-02cpu/hotplug: Make target_store() a nop when target == statePhil Auld1-1/+3
Writing the current state back in hotplug/target calls cpu_down() which will set cpu dying even when it isn't and then nothing will ever clear it. A stress test that reads values and writes them back for all cpu device files in sysfs will trigger the BUG() in select_fallback_rq once all cpus are marked as dying. kernel/cpu.c::target_store() ... if (st->state < target) ret = cpu_up(dev->id, target); else ret = cpu_down(dev->id, target); cpu_down() -> cpu_set_state() bool bringup = st->state < target; ... if (cpu_dying(cpu) != !bringup) set_cpu_dying(cpu, !bringup); Fix this by letting state==target fall through in the target_store() conditional. Also make sure st->target == target in that case. Fixes: 757c989b9994 ("cpu/hotplug: Make target state writeable") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
2022-05-23Merge tag 'x86_tdx_for_v5.19_rc1' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel TDX support from Borislav Petkov: "Intel Trust Domain Extensions (TDX) support. This is the Intel version of a confidential computing solution called Trust Domain Extensions (TDX). This series adds support to run the kernel as part of a TDX guest. It provides similar guest protections to AMD's SEV-SNP like guest memory and register state encryption, memory integrity protection and a lot more. Design-wise, it differs from AMD's solution considerably: it uses a software module which runs in a special CPU mode called (Secure Arbitration Mode) SEAM. As the name suggests, this module serves as sort of an arbiter which the confidential guest calls for services it needs during its lifetime. Just like AMD's SNP set, this series reworks and streamlines certain parts of x86 arch code so that this feature can be properly accomodated" * tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) x86/tdx: Fix RETs in TDX asm x86/tdx: Annotate a noreturn function x86/mm: Fix spacing within memory encryption features message x86/kaslr: Fix build warning in KASLR code in boot stub Documentation/x86: Document TDX kernel architecture ACPICA: Avoid cache flush inside virtual machines x86/tdx/ioapic: Add shared bit for IOAPIC base address x86/mm: Make DMA memory shared for TD guest x86/mm/cpa: Add support for TDX shared memory x86/tdx: Make pages shared in ioremap() x86/topology: Disable CPU online/offline control for TDX guests x86/boot: Avoid #VE during boot for TDX platforms x86/boot: Set CR0.NE early and keep it set during the boot x86/acpi/x86/boot: Add multiprocessor wake-up support x86/boot: Add a trampoline for booting APs via firmware handoff x86/tdx: Wire up KVM hypercalls x86/tdx: Port I/O: Add early boot support x86/tdx: Port I/O: Add runtime hypercalls x86/boot: Port I/O: Add decompression-time support for TDX x86/boot: Port I/O: Allow to hook up alternative helpers ...
2022-04-13cpu/hotplug: Initialise all cpuhp_cpu_state structs earlierSteven Price1-9/+13
Rather than waiting until a CPU is first brought online, do the initialisation of the cpuhp_cpu_state structure for each CPU during the __init phase. This saves a (small) amount of non-__init memory and avoids potential confusion about when the cpuhp_cpu_state struct is valid. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220411152233.474129-3-steven.price@arm.com
2022-04-13cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_stateSteven Price1-18/+18
Currently the setting of the 'cpu' member of struct cpuhp_cpu_state in cpuhp_create() is too late as it is used earlier in _cpu_up(). If kzalloc_node() in __smpboot_create_thread() fails then the rollback will be done with st->cpu==0 causing CPU0 to be erroneously set to be dying, causing the scheduler to get mightily confused and throw its toys out of the pram. However the cpu number is actually available directly, so simply remove the 'cpu' member and avoid the problem in the first place. Fixes: 2ea46c6fc945 ("cpumask/hotplug: Fix cpu_dying() state tracking") Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220411152233.474129-2-steven.price@arm.com
2022-04-07x86/topology: Disable CPU online/offline control for TDX guestsKuppuswamy Sathyanarayanan1-0/+7
Unlike regular VMs, TDX guests use the firmware hand-off wakeup method to wake up the APs during the boot process. This wakeup model uses a mailbox to communicate with firmware to bring up the APs. As per the design, this mailbox can only be used once for the given AP, which means after the APs are booted, the same mailbox cannot be used to offline/online the given AP. More details about this requirement can be found in Intel TDX Virtual Firmware Design Guide, sec titled "AP initialization in OS" and in sec titled "Hotplug Device". Since the architecture does not support any method of offlining the CPUs, disable CPU hotplug support in the kernel. Since this hotplug disable feature can be re-used by other VM guests, add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable the hotplug support. Attempt to offline CPU will fail with -EOPNOTSUPP. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-25-kirill.shutemov@linux.intel.com
2022-03-22Merge tag 'sched-core-2022-03-22' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Cleanups for SCHED_DEADLINE - Tracing updates/fixes - CPU Accounting fixes - First wave of changes to optimize the overhead of the scheduler build, from the fast-headers tree - including placeholder *_api.h headers for later header split-ups. - Preempt-dynamic using static_branch() for ARM64 - Isolation housekeeping mask rework; preperatory for further changes - NUMA-balancing: deal with CPU-less nodes - NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD) - Updates to RSEQ UAPI in preparation for glibc usage - Lots of RSEQ/selftests, for same - Add Suren as PSI co-maintainer * tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) sched/headers: ARM needs asm/paravirt_api_clock.h too sched/numa: Fix boot crash on arm64 systems headers/prep: Fix header to build standalone: <linux/psi.h> sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y cgroup: Fix suspicious rcu_dereference_check() usage warning sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity() sched/deadline,rt: Remove unused functions for !CONFIG_SMP sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file sched/deadline: Remove unused def_dl_bandwidth sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE sched/tracing: Don't re-read p->state when emitting sched_switch event sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race sched/cpuacct: Remove redundant RCU read lock sched/cpuacct: Optimize away RCU read lock sched/cpuacct: Fix charge percpu cpuusage sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies ...
2022-02-21random: clear fast pool, crng, and batches in cpuhp bring upJason A. Donenfeld1-0/+11
For the irq randomness fast pool, rather than having to use expensive atomics, which were visibly the most expensive thing in the entire irq handler, simply take care of the extreme edge case of resetting count to zero in the cpuhp online handler, just after workqueues have been reenabled. This simplifies the code a bit and lets us use vanilla variables rather than atomics, and performance should be improved. As well, very early on when the CPU comes up, while interrupts are still disabled, we clear out the per-cpu crng and its batches, so that it always starts with fresh randomness. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-02-16sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker1-2/+2
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2021-11-24sched/scs: Reset task stack state in bringup_cpu()Mark Rutland1-0/+7
To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
2021-08-10cpu/hotplug: Add debug printks for hotplug callback failuresDongli Zhang1-0/+7
CPU hotplug callbacks can fail and cause a rollback to the previous state. These failures are silent and therefore hard to debug. Add pr_debug() to the up and down paths which provide information about the error code, the CPU and the failed state. The debug printks can be enabled via kernel command line or sysfs. [ tglx: Adopt to current mainline, massage printk and changelog ] Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lore.kernel.org/r/20210409055316.1709-1-dongli.zhang@oracle.com
2021-08-10cpu/hotplug: Use DEVICE_ATTR_*() macroYueHaibing1-27/+23
Use DEVICE_ATTR_*() helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210527141105.2312-1-yuehaibing@huawei.com
2021-08-10cpu/hotplug: Eliminate all kernel-doc warningsRandy Dunlap1-5/+21
kernel/cpu.c:57: warning: cannot understand function prototype: 'struct cpuhp_cpu_state ' kernel/cpu.c:115: warning: cannot understand function prototype: 'struct cpuhp_step ' kernel/cpu.c:146: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * cpuhp_invoke_callback _ Invoke the callbacks for a given state kernel/cpu.c:75: warning: Function parameter or member 'fail' not described in 'cpuhp_cpu_state' kernel/cpu.c:75: warning: Function parameter or member 'cpu' not described in 'cpuhp_cpu_state' kernel/cpu.c:75: warning: Function parameter or member 'node' not described in 'cpuhp_cpu_state' kernel/cpu.c:75: warning: Function parameter or member 'last' not described in 'cpuhp_cpu_state' kernel/cpu.c:130: warning: Function parameter or member 'list' not described in 'cpuhp_step' kernel/cpu.c:130: warning: Function parameter or member 'multi_instance' not described in 'cpuhp_step' kernel/cpu.c:158: warning: No description found for return value of 'cpuhp_invoke_callback' kernel/cpu.c:1188: warning: No description found for return value of 'cpu_device_down' kernel/cpu.c:1400: warning: No description found for return value of 'cpu_device_up' kernel/cpu.c:1425: warning: No description found for return value of 'bringup_hibernate_cpu' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210809223825.24512-1-rdunlap@infradead.org
2021-08-10cpu/hotplug: Fix kernel doc warnings for __cpuhp_setup_state_cpuslocked()Baokun Li1-0/+1
Fixes the following W=1 kernel build warning(s): kernel/cpu.c:1949: warning: Function parameter or member 'name' not described in '__cpuhp_setup_state_cpuslocked' Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210605063003.681049-1-libaokun1@huawei.com
2021-06-29Merge tag 'smp-urgent-2021-06-29' of ↵Linus Torvalds1-0/+49
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Thomas Gleixner: "A fix for the CPU hotplug and cpusets interaction: cpusets delegate the hotplug work to a workqueue to prevent a lock order inversion vs. the CPU hotplug lock. The work is not flushed before the hotplug operation returns which creates user visible inconsistent state. Prevent this by flushing the work after dropping CPU hotplug lock and before releasing the outer mutex which serializes the CPU hotplug related sysfs interface operations" * tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Cure the cpusets trainwreck
2021-06-21cpu/hotplug: Cure the cpusets trainwreckThomas Gleixner1-0/+49
Alexey and Joshua tried to solve a cpusets related hotplug problem which is user space visible and results in unexpected behaviour for some time after a CPU has been plugged in and the corresponding uevent was delivered. cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a workqueue. This is done because the cpusets code has already a lock nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or waiting for the work to finish with cpu_hotplug_lock held can and will deadlock because that results in the reverse lock order. As a consequence the uevent can be delivered before cpusets have consistent state which means that a user space invocation of sched_setaffinity() to move a task to the plugged CPU fails up to the point where the scheduled work has been processed. The same is true for CPU unplug, but that does not create user observable failure (yet). It's still inconsistent to claim that an operation is finished before it actually is and that's the real issue at hand. uevents just make it reliably observable. Obviously the problem should be fixed in cpusets/cgroups, but untangling that is pretty much impossible because according to the changelog of the commit which introduced this 8 years ago: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and the whole code is built around that. So bite the bullet and invoke the relevant cpuset function, which waits for the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and only when tasks are not frozen by suspend/hibernate because that would obviously wait forever. Waiting there with cpu_add_remove_lock, which is protecting the present and possible CPU maps, held is not a problem at all because neither work queues nor cpusets/cgroups have any lockchains related to that lock. Waiting in the hotplug machinery is not problematic either because there are already state callbacks which wait for hardware queues to drain. It makes the operations slightly slower, but hotplug is slow anyway. This ensures that state is consistent before returning from a hotplug up/down operation. It's still inconsistent during the operation, but that's a different story. Add a large comment which explains why this is done and why this is not a dump ground for the hack of the day to work around half thought out locking schemes. Document also the implications vs. hotplug operations and serialization or the lack of it. Thanks to Alexy and Joshua for analyzing why this temporary sched_setaffinity() failure happened. Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") Reported-by: Alexey Klimov <aklimov@redhat.com> Reported-by: Joshua Baker <jobaker@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexey Klimov <aklimov@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
2021-05-25cpu/hotplug: Simplify access to percpu cpuhp_stateYuan ZhaoXiong1-2/+2
It is unnecessary to invoke per_cpu_ptr() everytime to access cpuhp_state. Use the available pointer instead. Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/1621776690-13264-1-git-send-email-yuanzhaoxiong@baidu.com
2021-04-21cpumask/hotplug: Fix cpu_dying() state trackingPeter Zijlstra1-5/+11
Vincent reported that for states with a NULL startup/teardown function we do not call cpuhp_invoke_callback() (because there is none) and as such we'll not update the cpu_dying() state. The stale cpu_dying() can eventually lead to triggering BUG(). Rectify this by updating cpu_dying() in the exact same places the hotplug machinery tracks its directional state, namely cpuhp_set_state() and cpuhp_reset_state(). Reported-by: Vincent Donnefort <vincent.donnefort@arm.com> Suggested-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YH7r+AoQEReSvxBI@hirez.programming.kicks-ass.net
2021-04-16cpumask: Introduce DYING maskPeter Zijlstra1-0/+6
Introduce a cpumask that indicates (for each CPU) what direction the CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when an up fails and we do a roll-back down, it will accurately reflect the direction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.org
2021-03-06cpu/hotplug: Add cpuhp_invoke_callback_range()Vincent Donnefort1-68/+102
Factorizing and unifying cpuhp callback range invocations, especially for the hotunplug path, where two different ways of decrementing were used. The first one, decrements before the callback is called: cpuhp_thread_fun() state = st->state; st->state--; cpuhp_invoke_callback(state); The second one, after: take_down_cpu()|cpuhp_down_callbacks() cpuhp_invoke_callback(st->state); st->state--; This is problematic for rolling back the steps in case of error, as depending on the decrement, the rollback will start from N or N-1. It also makes tracing inconsistent, between steps run in the cpuhp thread and the others. Additionally, avoid useless cpuhp_thread_fun() loops by skipping empty steps. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210216103506.416286-4-vincent.donnefort@arm.com
2021-03-06cpu/hotplug: CPUHP_BRINGUP_CPU failure exceptionVincent Donnefort1-3/+16
The atomic states (between CPUHP_AP_IDLE_DEAD and CPUHP_AP_ONLINE) are triggered by the CPUHP_BRINGUP_CPU step. If the latter fails, no atomic state can be rolled back. DEAD callbacks too can't fail and disallow recovery. As a consequence, during hotunplug, the fail injection interface should prohibit all states from CPUHP_BRINGUP_CPU to CPUHP_ONLINE. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210216103506.416286-3-vincent.donnefort@arm.com
2021-03-06cpu/hotplug: Allowing to reset fail injectionVincent Donnefort1-0/+5
Currently, the only way of resetting the fail injection is to trigger a hotplug, hotunplug or both. This is rather annoying for testing and, as the default value for this file is -1, it seems pretty natural to let a user write it. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210216103506.416286-2-vincent.donnefort@arm.com
2021-01-06cpu/hotplug: Add lockdep_is_cpus_held()Frederic Weisbecker1-0/+7
This commit adds a lockdep_is_cpus_held() function to verify that the proper locks are held and that various operations are running in the correct context. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-14Merge tag 'sched-core-2020-12-14' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) sched/fair: Trivial correction of the newidle_balance() comment sched/fair: Clear SMT siblings after determining the core is not idle sched: Fix kernel-doc markup x86: Print ratio freq_max/freq_base used in frequency invariance calculations x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC x86, sched: Calculate frequency invariance for AMD systems irq_work: Optimize irq_work_single() smp: Cleanup smp_call_function*() irq_work: Cleanup sched: Limit the amount of NUMA imbalance that can exist at fork time sched/numa: Allow a floating imbalance between NUMA nodes sched: Avoid unnecessary calculation of load imbalance at clone time sched/numa: Rename nr_running and break out the magic number sched: Make migrate_disable/enable() independent of RT sched/topology: Condition EAS enablement on FIE support arm64: Rebuild sched domains on invariance status changes sched/topology,schedutil: Wrap sched domains rebuild sched/uclamp: Allow to reset a task uclamp constraint value sched/core: Fix typos in comments Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug ...
2020-11-27kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handlingNicholas Piggin1-1/+5
powerpc/64s keeps a counter in the mm which counts bits set in mm_cpumask as well as other things. This means it can't use generic code to clear bits out of the mask and doesn't adjust the arch specific counter. Add an arch override that allows powerpc/64s to use clear_tasks_mm_cpumask(). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com
2020-11-10sched/hotplug: Consolidate task migration on CPU unplugThomas Gleixner1-1/+8
With the new mechanism which kicks tasks off the outgoing CPU at the end of schedule() the situation on an outgoing CPU right before the stopper thread brings it down completely is: - All user tasks and all unbound kernel threads have either been migrated away or are not running and the next wakeup will move them to a online CPU. - All per CPU kernel threads, except cpu hotplug thread and the stopper thread have either been unbound or parked by the responsible CPU hotplug callback. That means that at the last step before the stopper thread is invoked the cpu hotplug thread is the last legitimate running task on the outgoing CPU. Add a final wait step right before the stopper thread is kicked which ensures that any still running tasks on the way to park or on the way to kick themself of the CPU are either sleeping or gone. This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If sched_cpu_dying() detects that there is still another running task aside of the stopper thread then it will explode with the appropriate fireworks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.547163969@infradead.org
2020-06-03Merge tag 'sched-core-2020-06-02' of ↵Linus Torvalds1-1/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The changes in this cycle are: - Optimize the task wakeup CPU selection logic, to improve scalability and reduce wakeup latency spikes - PELT enhancements - CFS bandwidth handling fixes - Optimize the wakeup path by remove rq->wake_list and replacing it with ->ttwu_pending - Optimize IPI cross-calls by making flush_smp_call_function_queue() process sync callbacks first. - Misc fixes and enhancements" * tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too sched/headers: Split out open-coded prototypes into kernel/sched/smp.h sched: Replace rq::wake_list sched: Add rq::ttwu_pending irq_work, smp: Allow irq_work on call_single_queue smp: Optimize send_call_function_single_ipi() smp: Move irq_work_run() out of flush_smp_call_function_queue() smp: Optimize flush_smp_call_function_queue() sched: Fix smp_call_function_single_async() usage for ILB sched/core: Offload wakee task activation if it the wakee is descheduling sched/core: Optimize ttwu() spinning on p->on_cpu sched: Defend cfs and rt bandwidth quota against overflow sched/cpuacct: Fix charge cpuacct.usage_sys sched/fair: Replace zero-length array with flexible-array sched/pelt: Sync util/runnable_sum with PELT window when propagating sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr() sched/fair: Optimize enqueue_task_fair() sched: Make scheduler_ipi inline sched: Clean up scheduler_ipi() sched/core: Simplify sched_init() ...
2020-05-07cpu/hotplug: Remove __freeze_secondary_cpus()Qais Yousef1-2/+2
The refactored function is no longer required as the codepaths that call freeze_secondary_cpus() are all suspend/resume related now. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Link: https://lkml.kernel.org/r/20200430114004.17477-2-qais.yousef@arm.com
2020-05-07cpu/hotplug: Remove disable_nonboot_cpus()Qais Yousef1-7/+7
The single user could have called freeze_secondary_cpus() directly. Since this function was a source of confusion, remove it as it's just a pointless wrapper. While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to preserve the naming symmetry. Done automatically via: git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g' Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
2020-04-30sched/core: Fix illegal RCU from offline CPUsPeter Zijlstra1-1/+17
In the CPU-offline process, it calls mmdrop() after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaining when mmdrop() uses RCU from either memcg or debugobjects below. Fix it by cleaning up the active_mm state from BP instead. Every arch which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit() from AP. The only exception is parisc because it switches them to &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()), but the patch will still work there because it calls mmgrab(&init_mm) in smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu(). WARNING: suspicious RCU usage ----------------------------- kernel/workqueue.c:710 RCU or wq_pool_mutex should be held! other info that might help us debug this: RCU used illegally from offline CPU! Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 get_work_pool+0x110/0x150 __queue_work+0x1bc/0xca0 queue_work_on+0x114/0x120 css_release+0x9c/0xc0 percpu_ref_put_many+0x204/0x230 free_pcp_prepare+0x264/0x570 free_unref_page+0x38/0xf0 __mmdrop+0x21c/0x2c0 idle_task_exit+0x170/0x1b0 pnv_smp_cpu_kill_self+0x38/0x2e0 cpu_die+0x48/0x64 arch_cpu_idle_dead+0x30/0x50 do_idle+0x2f4/0x470 cpu_startup_entry+0x38/0x40 start_secondary+0x7a8/0xa80 start_secondary_resume+0x10/0x14 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw