summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-07workqueue: Make default affinity_scope dynamically updatableTejun Heo4-13/+52
While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Add "Affinity Scopes and Performance" section to documentationTejun Heo1-5/+179
With affinity scopes and their strictness setting added, unbound workqueues should now be able to cover wide variety of configurations and use cases. Unfortunately, the performance picture is not entirely straight-forward due to a trade-off between efficiency and work-conservation in some situations necessitating manual configuration. This patch adds "Affinity Scopes and Performance" section to Documentation/core-api/workqueue.rst which illustrates the trade-off with a set of experiments and provides some guidelines. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Implement non-strict affinity scope for unbound workqueuesTejun Heo5-20/+132
An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-07workqueue: Add workqueue_attrs->__pod_cpumaskTejun Heo2-37/+53
workqueue_attrs has two uses: * to specify the required unouned workqueue properties by users * to match worker_pool's properties to workqueues by core code For example, if the user wants to restrict a workqueue to run only CPUs 0 and 2, and the two CPUs are on different affinity scopes, the workqueue's attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be associated with two worker_pools, one with attrs->cpumask containing just CPU 0 and the other CPU 2. Workqueue wants to support non-strict affinity scopes where work items are started in their matching affinity scopes but the scheduler is free to migrate them outside the starting scopes, which can enable utilizing the whole machine while maintaining most of the locality benefits from affinity scopes. To enable that, worker_pools need to distinguish the strict affinity that it has to follow (because that's the restriction coming from the user) and the soft affinity that it wants to apply when dispatching work items. Note that two worker_pools with different soft dispatching requirements have to be separate; otherwise, for example, we'd be ping-ponging worker threads across NUMA boundaries constantly. This patch adds workqueue_attrs->__pod_cpumask. The new field is double underscored as it's only used internally to distinguish worker_pools. A worker_pool's ->cpumask is now always the same as the online subset of allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's subset of that ->cpumask. Going back to the example above, both worker_pools would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask would contain 0 while the other's 2. * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask that the pool's workers must stay within. This is currently always ->__pod_cpumask as all boundaries are still strict. * As a workqueue_attrs can now track both the associated workqueues' cpumask and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external out-argument. Drop @cpumask and instead store the result in ->__pod_cpumask. * The above also simplifies apply_wqattrs_prepare() as the same workqueue_attrs can be used to create all pods associated with a workqueue. tmp_attrs is dropped. * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq update is needed instead of only comparing ->cpumask so that ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks but the code is easier to understand and more robust this way. The only user-visible behavior change is that two workqueues with different cpumasks no longer can share worker_pools even when their pod subsets coincide. Going back to the example, let's say there's another workqueue with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped to two worker_pools - one with CPU 0, the other with 2 and 3. The former has the same cpumask as the first pod of the earlier example and would have shared the same worker_pool but that's no longer the case after this patch. The worker_pools would have the same ->__pod_cpumask but their ->cpumask's wouldn't match. While this is necessary to support non-strict affinity scopes, there can be further optimizations to maintain sharing among strict affinity scopes. However, non-strict affinity scopes are going to be preferable for most use cases and we don't see very diverse mixture of unbound workqueue cpumasks anyway, so the additional overhead doesn't seem to justify the extra complexity. v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by using wqattrs_equal() for comparison instead. - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out need_more_worker() check and worker wake-upTejun Heo1-50/+37
Checking need_more_worker() and calling wake_up_worker() is a repeated pattern. Let's add kick_pool(), which checks need_more_worker() and open-code wake_up_worker(), and replace wake_up_worker() uses. The following conversions aren't one-to-one: * __queue_work() was using __need_more_work() because it knows that pool->worklist isn't empty. Switching to kick_pool() adds an extra list_empty() test. * create_worker() always needs to wake up the newly minted worker whether there's more work to do or not to avoid triggering hung task check on the new task. Keep the current wake_up_process() and still add kick_pool(). This may lead to an extra wakeup which isn't harmful. * pwq_adjust_max_active() was explicitly checking whether it needs to wake up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up a worker when necessary, this explicit check is no longer necessary and dropped. * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding a need_more_worker() test. This avoids spurious wakeups and shouldn't break anything. wake_up_worker() is dropped as kick_pool() replaces all its users. After this patch, all paths that wakes up a non-rescuer worker to initiate work item execution use kick_pool(). This will enable future changes to improve locality. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out work to worker assignment and collision handlingTejun Heo1-28/+52
The two work execution paths in worker_thread() and rescuer_thread() use move_linked_works() to claim work items from @pool->worklist. Once claimed, process_schedule_works() is called which invokes process_one_work() on each work item. process_one_work() then uses find_worker_executing_work() to detect and handle collisions - situations where the work item to be executed is still running on another worker. This works fine, but, to improve work execution locality, we want to establish work to worker association earlier and know for sure that the worker is going to excute the work once asssigned, which requires performing collision handling earlier while trying to assign the work item to the worker. This patch introduces assign_work() which assigns a work item to a worker using move_linked_works() and then performs collision handling. As collision handling is handled earlier, process_one_work() no longer needs to worry about them. After the this patch, collision checks for linked work items are skipped, which should be fine as they can't be queued multiple times concurrently. For work items running from rescuers, the timing of collision handling may change but the invariant that the work items go through collision handling before starting execution does not. This patch shouldn't cause noticeable behavior changes, especially given that worker_thread() behavior remains the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Add multiple affinity scopes and interface to select themTejun Heo5-12/+193
Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Modularize wq_pod_type initializationTejun Heo1-34/+50
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM init is hard coded into workqueue_init_topology(). This patch modularizes the init path by introducing init_pod_type() which takes a callback to determine whether two CPUs should share a pod as an argument. init_pod_type() first scans the CPU combinations testing for sharing to assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with cpus_share_numa() which tests whether the CPU belongs to the same NUMA node. This patch may change the pod ID assigned to each NUMA node but that shouldn't cause any behavior changes as the NUMA node to use for allocations are tracked separately in pod_type->pod_node[]. This makes adding new affinty types pretty easy. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue ↵Tejun Heo2-0/+225
configuration Lack of visibility has always been a pain point for workqueues. While the recently added wq_monitor.py improved the situation, it's still difficult to understand what worker pools are active in the system, how workqueues map to them and why. The lack of visibility into how workqueues are configured is going to become more noticeable as workqueue improves locality awareness and provides more mechanisms to customize locality related behaviors. Now that the basic framework for more flexible locality support is in place, this is a good time to improve the situation. This patch adds tools/workqueues/wq_dump.py which prints out the topology configuration, worker pools and how workqueues are mapped to pools. Read the command's help message for more details. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Generalize unbound CPU podsTejun Heo2-65/+137
While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out clearing of workqueue-only attrs fieldsTejun Heo1-6/+14
workqueue_attrs can be used for both workqueues and worker_pools. However, some fields, currently only ->ordered, only apply to workqueues and should be cleared to the default / invalid values. Currently, an unbound workqueue explicitly clears attrs->ordered in get_unbound_pool() after copying the source workqueue attrs, while per-cpu workqueues rely on the fact that zeroing on allocation gives us the desired default value for pool->attrs->ordered. This is fragile. Let's add wqattrs_clear_for_pool() which clears attrs->ordered and is called from both init_worker_pool() and get_unbound_pool(). This will ease adding more workqueue-only attrs fields. In get_unbound_pool(), pool->node initialization is moved upwards for readability. This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Factor out actual cpumask calculation to reduce subtlety in ↵Tejun Heo1-20/+29
wq_update_pod() For an unbound pool, multiple cpumasks are involved. U: The user-specified cpumask (may be filtered with cpu_possible_mask). A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering leaves no CPU, wq_unbound_cpumask is used. P: Per-pod subsets of #A. wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and wq->cpu_pwq[CPU]->pool->attrs->cpumask #P. wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To calculate the new #P for each workqueue, it needs to call wq_calc_pod_cpumask() with @attrs that contains #A. Currently, wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with wq->dfl_pwq->pool->attrs. This is rather fragile because we're calling wq_calc_pod_cpumask() with @attrs of a worker_pool rather than the workqueue's actual attrs when what we want to calculate is the workqueue's cpumask on the pod. While this works fine currently, future changes will add fields which are used differently between workqueues and worker_pools and this subtlety will bite us. This patch factors out #U -> #A calculation from apply_wqattrs_prepare() into wqattrs_actualize_cpumask and updates wq_update_pod() to copy wq->unbound_attrs and use the new helper to obtain #A freshly instead of abusing wq->dfl_pwq->pool_attrs. This shouldn't cause any behavior changes in the current code. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
2023-08-07workqueue: Initialize unbound CPU pods later in the bootTejun Heo3-27/+43
During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Move wq_pod_init() below workqueue_init()Tejun Heo1-38/+40
wq_pod_init() is called from workqueue_init() and responsible for initializing unbound CPU pods according to NUMA node. Workqueue is in the process of improving affinity awareness and wants to use other topology information to initialize unbound CPU pods; however, unlike NUMA nodes, other topology information isn't yet available in workqueue_init(). The next patch will introduce a later stage init function for workqueue which will be responsible for initializing unbound CPU pods. Relocate wq_pod_init() below workqueue_init() where the new init function is going to be located so that the diff can show the content differences. Just a relocation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename NUMA related names to use pod insteadTejun Heo1-85/+76
Workqueue is in the process of improving CPU affinity awareness. It will become more flexible and won't be tied to NUMA node boundaries. This patch renames all NUMA related names in workqueue.c to use "pod" instead. While "pod" isn't a very common term, it short and captures the grouping of CPUs well enough. These names are only going to be used within workqueue implementation proper, so the specific naming doesn't matter that much. * wq_numa_possible_cpumask -> wq_pod_cpus * wq_numa_enabled -> wq_pod_enabled * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf * workqueue_select_cpu_near -> select_numa_node_cpu This rename is different from others. The function is only used by queue_work_node() and specifically tries to find a CPU in the specified NUMA node. As workqueue affinity will become more flexible and untied from NUMA, this function's name should specifically describe that it's for NUMA. * wq_calc_node_cpumask -> wq_calc_pod_cpumask * wq_update_unbound_numa -> wq_update_pod * wq_numa_init -> wq_pod_init * node -> pod in local variables Only renames. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename workqueue_attrs->no_numa to ->orderedTejun Heo2-13/+12
With the recent removal of NUMA related module param and sysfs knob, workqueue_attrs->no_numa is now only used to implement ordered workqueues. Let's rename the field so that it's less confusing especially with the planned CPU affinity awareness improvements. Just a rename. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make unbound workqueues to use per-cpu pool_workqueuesTejun Heo3-158/+89
A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-07workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplugTejun Heo1-9/+25
When a CPU went online or offline, wq_update_unbound_numa() was called only on the CPU which was going up or down. This works fine because all CPUs on the same NUMA node share the same pool_workqueue slot - one CPU updating it updates it for everyone in the node. However, future changes will make each CPU use a separate pool_workqueue even when they're sharing the same worker_pool, which requires updating pool_workqueue's for all CPUs which may be sharing the same pool_workqueue on hotplug. To accommodate the planned changes, this patch updates workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for all CPUs sharing the same NUMA node as the CPU going up or down. In the current code, the second+ calls would be noops and there shouldn't be any behavior changes. * As wq_update_unbound_numa() is now called on multiple CPUs per each hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument is added. The former indicates the CPU being hot[un]plugged and the latter the CPU whose pool_workqueue is being updated. * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency with the new @hotplug_cpu. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Make per-cpu pool_workqueues allocated and released like unbound onesTejun Heo1-34/+40
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly through a per-cpu allocation and thus, unlike unbound workqueues, not reference counted. This difference in lifetime management between the two types is a bit confusing. Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which isn't suitiable for the planned CPU locality related improvements. The plan is to unify pwq handling across per-cpu and unbound workqueues so that they're always accessed through wq->cpu_pwq. In preparation, this patch makes per-cpu pwq's to be allocated, reference counted and released the same way as unbound pwq's. wq->cpu_pwq now holds pointers to pwq's instead of containing them directly. pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now also used for per-cpu work items. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Use a kthread_worker to release pool_workqueuesTejun Heo1-17/+23
pool_workqueue release path is currently bounced to system_wq; however, this is a bit tricky because this bouncing occurs while holding a pool lock and thus has risk of causing a A-A deadlock. This is currently addressed by the fact that only unbound workqueues use this bouncing path and system_wq is a per-cpu workqueue. While this works, it's brittle and requires a work-around like setting the lockdep subclass for the lock of unbound pools. Besides, future changes will use the bouncing path for per-cpu workqueues too making the current approach unusable. Let's just use a dedicated kthread_worker to untangle the dependency. This is just one more kthread for all workqueues and makes the pwq release logic simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numaTejun Heo2-82/+0
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA specific knobs won't make sense anymore. Remove them. Also, the pool_ids knob was used for debugging and not really meaningful given that there is no visibility into the pools associated with those IDs. Remove it too. A future patch will improve overall visibility. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Relocate worker and work management functionsTejun Heo1-172/+168
Collect first_idle_worker(), worker_enter/leave_idle(), find_worker_executing_work(), move_linked_works() and wake_up_worker() into one place. These functions will later be used to implement higher level worker management logic. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Rename wq->cpu_pwqs to wq->cpu_pwqTejun Heo1-7/+7
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue. The field name being plural is unusual and confusing. Rename it to singular. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Not all work insertion needs to wake up a workerTejun Heo1-19/+19
insert_work() always tried to wake up a worker; however, the only time it needs to try to wake up a worker is when a new active work item is queued. When a work item goes on the inactive list or queueing a flush work item, there's no reason to try to wake up a worker. This patch moves the worker wakeup logic out of insert_work() and places it in the active new work item queueing path in __queue_work(). While at it: * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable pool. * Every caller of insert_work() calls debug_work_activate(). Consolidate the invocations into insert_work(). * In __queue_work() pool->watchdog_ts update is relocated slightly. This is to better accommodate future changes. This makes wakeups more precise and will help the planned change to assign work items to workers before waking them up. No behavior changes intended. v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-08-07workqueue: Cleanups around process_scheduled_works()Tejun Heo1-18/+11
* Drop the trivial optimization in worker_thread() where it bypasses calling process_scheduled_works() if the first work item isn't linked. This is a mostly pointless micro optimization and gets in the way of improving the work processing path. * Consolidate pool->watchdog_ts updates in the two callers into process_scheduled_works(). Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Drop the special locking rule for worker->flags and ↵Tejun Heo2-15/+4
worker_pool->flags worker->flags used to be accessed from scheduler hooks without grabbing pool->lock for concurrency management. This is no longer true since 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock"). Also, it's unclear why worker_pool->flags was using the "X" rule. All relevant users are accessing it under the pool lock. Let's drop the special "X" rule and use the "L" rule for these flag fields instead. While at it, replace the CONTEXT comment with lockdep_assert_held(). This allows worker_set/clr_flags() to be used from context which isn't the worker itself. This will be used later to implement assinging work items to workers before waking them up so that workqueue can have better control over which worker executes which work item on which CPU. The only actual changes are sanity checks. There shouldn't be any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: Merge branch 'for-6.5-fixes' into for-6.6Tejun Heo2-2/+43
Unbound workqueue execution locality improvement patchset is about to applied which will cause merge conflicts with changes in for-6.5-fixes. Let's avoid future merge conflict by pulling in for-6.5-fixes. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07workqueue: use LIST_HEAD to initialize cull_listYang Yingliang1-5/+2
Use LIST_HEAD() to initialize cull_list instead of open-coding it. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-25workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000Tejun Heo1-1/+42
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items. Once detected, they're excluded from concurrency management to prevent them from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is enabled, repeat offenders are also reported so that the code can be updated. The default threshold is 10ms which is long enough to do fair bit of work on modern CPUs while short enough to be usually not noticeable. This unfortunately leads to a lot of, arguable spurious, detections on very slow CPUs. Using the same threshold across CPUs whose performance levels may be apart by multiple levels of magnitude doesn't make whole lot of sense. This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS is below 4000. This is obviously very inaccurate but it doesn't have to be accurate to be useful. The mechanism is still useful when the threshold is fully scaled up and the benefits of reports are usually shared with everyone regardless of who's reporting, so as long as there are sufficient number of fast machines reporting, we don't lose much. Some (or is it all?) ARM CPUs systemtically report significantly lower BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs are, it's at least a missed opportunity and it probably would be a good idea to teach workqueue about it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
2023-07-11workqueue: Fix cpu_intensive_thresh_us name in help textGeert Uytterhoeven1-1/+1
There exists no parameter called "cpu_intensive_threshold_us". The actual parameter name is "cpu_intensive_thresh_us". Fixes: 6363845005202148 ("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10workqueue: add cmdline parameter `workqueue.unbound_cpus` to further ↵tiozhang2-0/+24
constrain wq_unbound_cpumask at boot time Motivation of doing this is to better improve boot times for devices when we want to prevent our workqueue works from running on some specific CPUs, e,g, some CPUs are busy with interrupts. Signed-off-by: tiozhang <tiozhang@didiglobal.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10workqueue: Warn attempt to flush system-wide workqueues.Tetsuo Handa2-47/+8
Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro"), all in-tree users stopped flushing system-wide workqueues. Therefore, start emitting runtime message so that all out-of-tree users will understand that they need to update their code. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10Merge tag 'linux-watchdog-6.5-rc2' of ↵Linus Torvalds1-0/+42
git://www.linux-watchdog.org/linux-watchdog Pull watchdog update from Wim Van Sebroeck: - Add Loongson-1 watchdog dt-bindings * tag 'linux-watchdog-6.5-rc2' of git://www.linux-watchdog.org/linux-watchdog: dt-bindings: watchdog: Add Loongson-1 watchdog
2023-07-10Merge tag 'v6.5-p2' of ↵Linus Torvalds3-9/+22
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "Fix a couple of regressions in af_alg and incorrect return values in crypto/asymmetric_keys/public_key" * tag 'v6.5-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: algif_hash - Fix race between MORE and non-MORE sends KEYS: asymmetric: Fix error codes crypto: af_alg - Fix merging of written data into spliced pages
2023-07-09Linux 6.5-rc1Linus Torvalds1-2/+2
2023-07-09MAINTAINERS 2: Electric BoogalooLinus Torvalds1-46/+46
We just sorted the entries and fields last release, so just out of a perverse sense of curiosity, I decided to see if we can keep things ordered for even just one release. The answer is "No. No we cannot". I suggest that all kernel developers will need weekly training sessions, involving a lot of Big Bird and Sesame Street. And at the yearly maintainer summit, we will all sing the alphabet song together. I doubt I will keep doing this. At some point "perverse sense of curiosity" turns into just a cold dark place filled with sadness and despair. Repeats: 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-09Merge tag 'dma-mapping-6.5-2023-07-09' of ↵Linus Torvalds1-11/+35
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - swiotlb area sizing fixes (Petr Tesarik) * tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: reduce the number of areas to match actual memory pool size swiotlb: always set the number of areas before allocating the pool
2023-07-09Merge tag 'irq_urgent_for_v6.5_rc1' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq update from Borislav Petkov: - Optimize IRQ domain's name assignment * tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqdomain: Use return value of strreplace()
2023-07-09Merge tag 'x86_urgent_for_v6.5_rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu fix from Borislav Petkov: - Do FPU AP initialization on Xen PV too which got missed by the recent boot reordering work * tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Fix secondary processors' FPU initialization
2023-07-09Merge tag 'x86-core-2023-07-09' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for the mechanism to park CPUs with an INIT IPI. On shutdown or kexec, the kernel tries to park the non-boot CPUs with an INIT IPI. But the same code path is also used by the crash utility. If the CPU which panics is not the boot CPU then it sends an INIT IPI to the boot CPU which resets the machine. Prevent this by validating that the CPU which runs the stop mechanism is the boot CPU. If not, leave the other CPUs in HLT" * tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Don't send INIT to boot CPU
2023-07-09Merge tag 'mips_6.5_1' of ↵Linus Torvalds10-53/+46
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: - fixes for KVM - fix for loongson build and cpu probing - DT fixes * tag 'mips_6.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: kvm: Fix build error with KVM_MIPS_DEBUG_COP0_COUNTERS enabled MIPS: dts: add missing space before { MIPS: Loongson: Fix build error when make modules_install MIPS: KVM: Fix NULL pointer dereference MIPS: Loongson: Fix cpu_probe_loongson() again
2023-07-09Merge tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull xfs fix from Darrick Wong: "Nothing exciting here, just getting rid of a gcc warning that I got tired of seeing when I turn on gcov" * tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix uninit warning in xfs_growfs_data
2023-07-09Merge tag '6.5-rc-smb3-client-fixes-part2' of ↵Linus Torvalds4-5/+72
git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - fix potential use after free in unmount - minor cleanup - add worker to cleanup stale directory leases * tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: cifs: Add a laundromat thread for cached directories smb: client: remove redundant pointer 'server' cifs: fix session state transition to avoid use-after-free issue
2023-07-09Merge tag 'ntb-6.5' of https://github.com/jonmason/ntbLinus Torvalds10-33/+36
Pull NTB updates from Jon Mason: "Fixes for pci_clean_master, error handling in driver inits, and various other issues/bugs" * tag 'ntb-6.5' of https://github.com/jonmason/ntb: ntb: hw: amd: Fix debugfs_create_dir error checking ntb.rst: Fix copy and paste error ntb_netdev: Fix module_init problem ntb: intel: Remove redundant pci_clear_master ntb: epf: Remove redundant pci_clear_master ntb_hw_amd: Remove redundant pci_clear_master ntb: idt: drop redundant pci_enable_pcie_error_reporting() MAINTAINERS: git://github -> https://github.com for jonmason NTB: EPF: fix possible memory leak in pci_vntb_probe() NTB: ntb_tool: Add check for devm_kcalloc NTB: ntb_transport: fix possible memory leak while device_register() fails ntb: intel: Fix error handling in intel_ntb_pci_driver_init() NTB: amd: Fix error handling in amd_ntb_pci_driver_init() ntb: idt: Fix error handling in idt_pci_driver_init()
2023-07-08mm: lock newly mapped VMA with corrected orderingHugh Dickins1-2/+2
Lockdep is certainly right to complain about (&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f but task is already holding lock: (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db Invert those to the usual ordering. Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible") Cc: stable@vger.kernel.org Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-08Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of ↵Linus Torvalds18-44/+99
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "16 hotfixes. Six are cc:stable and the remainder address post-6.4 issues" The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since it was all hopefully fixed in mainline. * tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib: dhry: fix sleeping allocations inside non-preemptable section kasan, slub: fix HW_TAGS zeroing with slub_debug kasan: fix type cast in memory_is_poisoned_n mailmap: add entries for Heiko Stuebner mailmap: update manpage link bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page MAINTAINERS: add linux-next info mailmap: add Markus Schneider-Pargmann writeback: account the number of pages written back mm: call arch_swap_restore() from do_swap_page() squashfs: fix cache race with migration mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison docs: update ocfs2-devel mailing list address MAINTAINERS: update ocfs2-devel mailing list address mm: disable CONFIG_PER_VMA_LOCK until its fixed fork: lock VMAs of the parent process when forking
2023-07-08fork: lock VMAs of the parent process when forkingSuren Baghdasaryan1-0/+1
When forking a child process, the parent write-protects anonymous pages and COW-shares them with the child being forked using copy_present_pte(). We must not take any concurrent page faults on the source vma's as they are being processed, as we expect both the vma and the pte's behind it to be stable. For example, the anon_vma_fork() expects the parents vma->anon_vma to not change during the vma copy. A concurrent page fault on a page newly marked read-only by the page copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the source vma, defeating the anon_vma_clone() that wasn't done because the parent vma originally didn't have an anon_vma, but we now might end up copying a pte entry for a page that has one. Before the per-vma lock based changes, the mmap_lock guaranteed exclusion with concurrent page faults. But now we need to do a vma_start_write() to make sure no concurrent faults happen on this vma while it is being processed. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Slaby <jirislaby@kernel.org> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ Reported-by: Jacob Young <jacobly.alt@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-08mm: lock newly mapped VMA which can be modified after it becomes visibleSuren Baghdasaryan1-0/+2
mmap_region adds a newly created VMA into VMA tree and might modify it afterwards before dropping the mmap_lock. This poses a problem for page faults handled under per-VMA locks because they don't take the mmap_lock and can stumble on this VMA while it's still being modified. Currently this does not pose a problem since post-addition modifications are done only for file-backed VMAs, which are not handled under per-VMA lock. However, once support for handling file-backed page faults with per-VMA locks is added, this will become a race. Fix this by write-locking the VMA before inserting it into the VMA tree. Other places where a new VMA is added into VMA tree do not modify it after the insertion, so do not need the same locking. Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-08mm: lock a vma before stack expansionSuren Baghdasaryan1-0/+4
With recent changes necessitating mmap_lock to be held for write while expanding a stack, per-VMA locks should follow the same rules and be write-locked to prevent page faults into the VMA being expanded. Add the necessary locking. Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-08Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds12-706/+32
Pull more SCSI updates from James Bottomley: "A few late arriving patches that missed the initial pull request. It's mostly bug fixes (the dt-bindings is a fix for the initial pull)" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: core: Remove unused function declaration scsi: target: docs: Remove tcm_mod_builder.py scsi: target: iblock: Quiet bool conversion warning with pr_preempt use scsi: dt-bindings: ufs: qcom: Fix ICE phandle scsi: core: Simplify scsi_cdl_check_cmd() scsi: isci: Fix comment typo scsi: smartpqi: Replace one-element arrays with flexible-array members scsi: target: tcmu: Replace strlcpy() with strscpy() scsi: ncr53c8xx: Replace strlcpy() with strscpy() scsi: lpfc: Fix lpfc_name struct packing