summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2005-09-10[PATCH] kernel: fix-up schedule_timeout() usageNishanth Aravamudan3-20/+10
Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] add schedule_timeout_{,un}interruptible() interfacesNishanth Aravamudan1-0/+14
Add schedule_timeout_{,un}interruptible() interfaces so that schedule_timeout() callers don't have to worry about forgetting to add the set_current_state() call beforehand. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] kernel/acct: add kerneldocRandy Dunlap1-16/+27
for kernel/acct.c: - fix typos - add kerneldoc for non-static functions Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: allow the load to grow upto its cpu_powerSiddha, Suresh B1-2/+7
Don't pull tasks from a group if that would cause the group's total load to drop below its total cpu_power (ie. cause the group to start going idle). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: don't kick ALB in the presence of pinned taskSiddha, Suresh B1-1/+14
Jack Steiner brought this issue at my OLS talk. Take a scenario where two tasks are pinned to two HT threads in a physical package. Idle packages in the system will keep kicking migration_thread on the busy package with out any success. We will run into similar scenarios in the presence of CMP/NUMA. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: use cached variable in sys_sched_yield()Renaud Lienhart1-1/+1
In sys_sched_yield(), we cache current->array in the "array" variable, thus there's no need to dereference "current" again later. Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: HT optimisationNick Piggin1-6/+28
If an idle sibling of an HT queue encounters a busy sibling, then make higher level load balancing of the non-idle variety. Performance of multiprocessor HT systems with low numbers of tasks (generally < number of virtual CPUs) can be significantly worse than the exact same workloads when running in non-HT mode. The reason is largely due to poor scheduling behaviour. This patch improves the situation, making the performance gap far less significant on one problematic test case (tbench). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less lockingNick Piggin1-7/+2
During periodic load balancing, don't hold this runqueue's lock while scanning remote runqueues, which can take a non trivial amount of time especially on very large systems. Holding the runqueue lock will only help to stabilise ->nr_running, however this doesn't do much to help because tasks being woken will simply get held up on the runqueue lock, so ->nr_running would not provide a really accurate picture of runqueue load in that case anyway. What's more, ->nr_running (and possibly the cpu_load averages) of remote runqueues won't be stable anyway, so load balancing is always an inexact operation. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less newidle lockingNick Piggin1-7/+10
Similarly to the earlier change in load_balance, only lock the runqueue in load_balance_newidle if the busiest queue found has a nr_running > 1. This will reduce frequency of expensive remote runqueue lock aquisitions in the schedule() path on some workloads. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: fix SMT scheduler latency bugIngo Molnar1-4/+15
William Weston reported unusually high scheduling latencies on his x86 HT box, on the -RT kernel. I managed to reproduce it on my HT box and the latency tracer shows the incident in action: _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / ||||| delay cmd pid ||||| time | caller \ / ||||| \ | / du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up) .............................................................. ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ... .............................................................. du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1) du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up) ................................................ ... still on CPU#3, we send an IPI to CPU#1: ... ................................................ du-2803 3Dnh1 0us : resched_task (try_to_wake_up) du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up) du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule) du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up) ............................................... ... 1 usec later, the IPI arrives on CPU#1: ... ............................................... <idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0) So far so good, this is the normal wakeup/preemption mechanism. But here comes the scheduler anomaly on CPU#1: <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq) <idle>-0 1Dnh. 3us : profile_hit (__schedule) <idle>-0 1Dnh1 3us : sched_clock (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule) <idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh1 5us : preempt_schedule (__schedule) <idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh2 6us : find_next_bit (__schedule) <idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh2 8us : preempt_schedule (__schedule) <idle>-0 1Dnh2 8us : find_next_bit (__schedule) <idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule) <idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched) <idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c) <idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched) <idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq) <idle>-0 1Dnh. 11us < (0) we didnt pick up pid 2778! It only gets scheduled much later: <...>-2778 1Dnh2 412us : __switch_to (__schedule) <...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73) <...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule) <...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule) <...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched) <...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1) <...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched) <...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule) the reason for this anomaly is the following code in dependent_sleeper(): /* * If a user task with lower static priority than the * running task on the SMT sibling is trying to schedule, * delay it till there is proportionately less timeslice * left of the sibling task to prevent a lower priority * task from using an unfair proportion of the * physical cpu's resources. -ck */ [...] if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) > task_timeslice(p))) ret = 1; Note that in contrast to the comment above, we dont actually do the check based on static priority, we do the check based on timeslices. But timeslices go up and down, and even highprio tasks can randomly have very low timeslices (just before their next refill) and can thus be judged as 'lowprio' by the above piece of code. This condition is clearly buggy. The correct test is to check for static_prio _and_ to check for the preemption priority. Even on different static priority levels, a higher-prio interactive task should not be delayed due to a higher-static-prio CPU hog. There is a symmetric bug in the 'kick SMT sibling' code of this function as well, which can be solved in a similar way. The patch below (against the current scheduler queue in -mm) fixes both bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks still get properly throttled - so the dependent-sleeper logic is still in action. btw., these bugs pessimised the SMT scheduler because the 'delay wakeup' property was applied too liberally, so this fix is likely a throughput improvement as well. I separated out a smt_slice() function to make the code easier to read. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: TASK_NONINTERACTIVEIngo Molnar1-1/+10
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be used by blocking points to mark the task's wait as "non-interactive". This does not mean the task will be considered a CPU-hog - the wait will simply not have an effect on the waiting task's priority - positive or negative alike. Right now only pipe_wait() will make use of it, because it's a common source of not-so-interactive waits (kernel compilation jobs, etc.). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched cleanupsIngo Molnar1-19/+25
whitespace cleanups. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: make idlest_group/cpu cpus_allowed-awareM.Baris Demiray1-4/+13
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make them return only the groups that have allowed CPUs and allowed CPUs respectively. Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: run SCHED_NORMAL tasks with real time tasks on SMT siblingsCon Kolivas1-18/+47
The hyperthread aware nice handling currently puts to sleep any non real time task when a real time task is running on its sibling cpu. This can lead to prolonged starvation by having the non real time task pegged to the cpu with load balancing not pulling that task away. Currently we force lower priority hyperthread tasks to run a percentage of time difference based on timeslice differences which is meaningless when comparing real time tasks to SCHED_NORMAL tasks. We can allow non real time tasks to run with real time tasks on the sibling up to per_cpu_gain% if we use jiffies as a counter. Cleanups and micro-optimisations to the relevant code section should make it more understandable as well. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] cpuset semaphore depth check deadlock fixPaul Jackson1-40/+60
The cpusets-formalize-intermediate-gfp_kernel-containment patch has a deadlock problem. This patch was part of a set of four patches to make more extensive use of the cpuset 'mem_exclusive' attribute to manage kernel GFP_KERNEL memory allocations and to constrain the out-of-memory (oom) killer. A task that is changing cpusets in particular ways on a system when it is very short of free memory could double trip over the global cpuset_sem semaphore (get the lock and then deadlock trying to get it again). The second attempt to get cpuset_sem would be in the routine cpuset_zone_allowed(). This was discovered by code inspection. I can not reproduce the problem except with an artifically hacked kernel and a specialized stress test. In real life you cannot hit this unless you are manipulating cpusets, and are very unlikely to hit it unless you are rapidly modifying cpusets on a memory tight system. Even then it would be a rare occurence. If you did hit it, the task double tripping over cpuset_sem would deadlock in the kernel, and any other task also trying to manipulate cpusets would deadlock there too, on cpuset_sem. Your batch manager would be wedged solid (if it was cpuset savvy), but classic Unix shells and utilities would work well enough to reboot the system. The unusual condition that led to this bug is that unlike most semaphores, cpuset_sem _can_ be acquired while in the page allocation code, when __alloc_pages() calls cpuset_zone_allowed. So it easy to mistakenly perform the following sequence: 1) task makes system call to alter a cpuset 2) take cpuset_sem 3) try to allocate memory 4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem 5) deadlock The reason that this is not a serious bug for most users is that almost all calls to allocate memory don't require taking cpuset_sem. Only some code paths off the beaten track require taking cpuset_sem -- which is good. Taking a global semaphore on the main code path for allocating memory would not scale well. This patch fixes this deadlock by wrapping the up() and down() calls on cpuset_sem in kernel/cpuset.c with code that tracks the nesting depth of the current task on that semaphore, and only does the real down() if the task doesn't hold the lock already, and only does the real up() if the nesting depth (number of unmatched downs) is exactly one. The previous required use of refresh_mems(), anytime that the cpuset_sem semaphore was acquired and the code executed while holding that semaphore might try to allocate memory, is no longer required. Two refresh_mems() calls were removed thanks to this. This is a good change, as failing to get all the necessary refresh_mems() calls placed was a primary source of bugs in this cpuset code. The only remaining call to refresh_mems() is made while doing a memory allocation, if certain task memory placement data needs to be updated from its cpuset, due to the cpuset having been changed behind the tasks back. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] spinlock consolidationIngo Molnar3-6/+14
This patch (written by me and also containing many suggestions of Arjan van de Ven) does a major cleanup of the spinlock code. It does the following things: - consolidates and enhances the spinlock/rwlock debugging code - simplifies the asm/spinlock.h files - encapsulates the raw spinlock type and moves generic spinlock features (such as ->break_lock) into the generic code. - cleans up the spinlock code hierarchy to get rid of the spaghetti. Most notably there's now only a single variant of the debugging code, located in lib/spinlock_debug.c. (previously we had one SMP debugging variant per architecture, plus a separate generic one for UP builds) Also, i've enhanced the rwlock debugging facility, it will now track write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too. All locks have lockup detection now, which will work for both soft and hard spin/rwlock lockups. The arch-level include files now only contain the minimally necessary subset of the spinlock code - all the rest that can be generalized now lives in the generic headers: include/asm-i386/spinlock_types.h | 16 include/asm-x86_64/spinlock_types.h | 16 I have also split up the various spinlock variants into separate files, making it easier to see which does what. The new layout is: SMP | UP ----------------------------|----------------------------------- asm/spinlock_types_smp.h | linux/spinlock_types_up.h linux/spinlock_types.h | linux/spinlock_types.h asm/spinlock_smp.h | linux/spinlock_up.h linux/spinlock_api_smp.h | linux/spinlock_api_up.h linux/spinlock.h | linux/spinlock.h /* * here's the role of the various spinlock/rwlock related include files: * * on SMP builds: * * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the * initializers * * linux/spinlock_types.h: * defines the generic type and initializers * * asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel * implementations, mostly inline assembly code * * (also included on UP-debug builds:) * * linux/spinlock_api_smp.h: * contains the prototypes for the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. * * on UP builds: * * linux/spinlock_type_up.h: * contains the generic, simplified UP spinlock type. * (which is an empty structure on non-debug builds) * * linux/spinlock_types.h: * defines the generic type and initializers * * linux/spinlock_up.h: * contains the __raw_spin_*()/etc. version of UP * builds. (which are NOPs on non-debug, non-preempt * builds) * * (included on UP-non-debug builds:) * * linux/spinlock_api_up.h: * builds the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. */ All SMP and UP architectures are converted by this patch. arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should be mostly fine. From: Grant Grundler <grundler@parisc-linux.org> Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU). Builds 32-bit SMP kernel (not booted or tested). I did not try to build non-SMP kernels. That should be trivial to fix up later if necessary. I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids some ugly nesting of linux/*.h and asm/*.h files. Those particular locks are well tested and contained entirely inside arch specific code. I do NOT expect any new issues to arise with them. If someone does ever need to use debug/metrics with them, then they will need to unravel this hairball between spinlocks, atomic ops, and bit ops that exist only because parisc has exactly one atomic instruction: LDCW (load and clear word). From: "Luck, Tony" <tony.luck@intel.com> ia64 fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjanv@infradead.org> Signed-off-by: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <willy@debian.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se> Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: files struct with RCUDipankar Sarma2-13/+25
Patch to eliminate struct files_struct.file_lock spinlock on the reader side and use rcu refcounting rcuref_xxx api for the f_count refcounter. The updates to the fdtable are done by allocating a new fdtable structure and setting files->fdt to point to the new structure. The fdtable structure is protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A global list of fdtable to be freed is not scalable, so we use a per-cpu list. If keventd is already handling the current cpu's work, we use a timer to defer queueing of that work. Since the last publication, this patch has been re-written to avoid using explicit memory barriers and use rcu_assign_pointer(), rcu_dereference() premitives instead. This required that the fd information is kept in a separate structure (fdtable) and updated atomically. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: break up files structDipankar Sarma2-40/+63
In order for the RCU to work, the file table array, sets and their sizes must be updated atomically. Instead of ensuring this through too many memory barriers, we put the arrays and their sizes in a separate structure. This patch takes the first step of putting the file table elements in a separate structure fdtable that is embedded withing files_struct. It also changes all the users to refer to the file table using files_fdtable() macro. Subsequent applciation of RCU becomes easier after this. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: rcuref APIsDipankar Sarma1-0/+14
Adds a set of primitives to do reference counting for objects that are looked up without locks using RCU. Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com> Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] fix for cpusets minor problemKUROSAWA Takahiro1-0/+4
This patch fixes minor problem that the CPUSETS have when files in the cpuset filesystem are read after lseek()-ed beyond the EOF. Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] Prefetch kernel stacks to speed up context switchChen, Kenneth W1-0/+1
For architecture like ia64, the switch stack structure is fairly large (currently 528 bytes). For context switch intensive application, we found that significant amount of cache misses occurs in switch_to() function. The following patch adds a hook in the schedule() function to prefetch switch stack structure as soon as 'next' task is determined. This allows maximum overlap in prefetch cache lines for that structure. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] fix disassociate_ctty vs. fork raceJason Baron1-0/+3
Race is as follows. Process A forks process B, both being part of the same session. Then, A calls disassociate_ctty while B forks C: A B ==== ==== fork() copy_signal() dissasociate_ctty() .... attach_pid(p, PIDTYPE_SID, p->signal->session); Now, C can have current->signal->tty pointing to a freed tty structure, as it hasn't yet been added to the session group (to have its controlling tty cleared on the diassociate_ctty() call). This has shown up as an oops but could be even more serious. I haven't tried to create a test case, but a customer has verified that the patch below resolves the issue, which was occuring quite frequently. I'll try and post the test case if i can. The patch simply checks for a NULL tty *after* it has been attached to the proper session group and clears it as necessary. Alternatively, we could simply do the tty assignment after the the process is added to the proper session group. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] Clear task_struct->fs_excl on fork()Giancarlo Formicuccia1-0/+1
An oversight. We don't want to carry the IO scheduler's "we hold exclusive fs resources" hint over to the child across fork(). Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-08Merge linux-2.6 with linux-acpi-2.6Len Brown28-290/+1128
2005-09-07Merge branch 'upstream' of ↵Linus Torvalds1-0/+1
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6
2005-09-07[PATCH] kprobes: fix bug when probed on task and isr functionsKeshavamurthy Anil S1-0/+22
This patch fixes a race condition where in system used to hang or sometime crash within minutes when kprobes are inserted on ISR routine and a task routine. The fix has been stress tested on i386, ia64, pp64 and on x86_64. To reproduce the problem insert kprobes on schedule() and do_IRQ() functions and you should see hang or system crash. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] Kprobes: prevent possible race conditions genericPrasanna S Panchamukhi1-29/+43
There are possible race conditions if probes are placed on routines within the kprobes files and routines used by the kprobes. For example if you put probe on get_kprobe() routines, the system can hang while inserting probes on any routine such as do_fork(). Because while inserting probes on do_fork(), register_kprobes() routine grabs the kprobes spin lock and executes get_kprobe() routine and to handle probe of get_kprobe(), kprobes_handler() gets executed and tries to grab kprobes spin lock, and spins forever. This patch avoids such possible race conditions by preventing probes on routines within the kprobes file and routines used by kprobes. I have modified the patches as per Andi Kleen's suggestion to move kprobes routines and other routines used by kprobes to a seperate section .kprobes.text. Also moved page fault and exception handlers, general protection fault to .kprobes.text section. These patches have been tested on i386, x86_64 and ppc64 architectures, also compiled on ia64 and sparc64 architectures. Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] introduce and use kzallocPekka J Enberg5-10/+6
This patch introduces a kzalloc wrapper and converts kernel/ to use it. It saves a little program text. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] remove duplicated code from proc and ptraceMiklos Szeredi1-13/+28
Extract common code used by ptrace_attach() and may_ptrace_attach() into a separate function. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <viro@parcelfarce.linux.theplanet.co.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: re-enable "dynamic sched domains"John Hawkes1-12/+0
Revert the hack introduced last week. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: fix the "dynamic sched domains" bugJohn Hawkes1-20/+69
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive cpuset that includes only some, but not all, of the CPUs in a node will mangle the sched domain structures. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc; Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: Move the ia64 domain setup code to the generic codeJohn Hawkes1-54/+236
Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: confine oom_killer to mem_exclusive cpusetPaul Jackson1-0/+33
Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: formalize intermediate GFP_KERNEL containmentPaul Jackson1-8/+72
This patch makes use of the previously underutilized cpuset flag 'mem_exclusive' to provide what amounts to another layer of memory placement resolution. With this patch, there are now the following four layers of memory placement available: 1) The whole system (interrupt and GFP_ATOMIC allocations can use this), 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use), 3) The current tasks cpuset (GFP_USER allocations constrained to here), and 4) Specific node placement, using mbind and set_mempolicy. These nest - each layer is a subset (same or within) of the previous. Layer (2) above is new, with this patch. The call used to check whether a zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is extended to take a gfp_mask argument, and its logic is extended, in the case that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if placement is allowed. The definition of GFP_USER, which used to be identical to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous cpuset_gfp_hardwall_flag patch. GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks cpuset, so long as any node therein is not too tight on memory, but will escape to the larger layer, if need be. The intended use is to allow something like a batch manager to handle several jobs, each job in its own cpuset, but using common kernel memory for caches and such. Swapper and oom_kill activity is also constrained to Layer (2). A task in or below one mem_exclusive cpuset should not cause swapping on nodes in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a task in another such cpuset. Heavy use of kernel memory for i/o caching and such by one job should not impact the memory available to jobs in other non-overlapping mem_exclusive cpusets. This patch enables providing hardwall, inescapable cpusets for memory allocations of each job, while sharing kernel memory allocations between several jobs, in an enclosing mem_exclusive cpuset. Like Dinakar's patch earlier to enable administering sched domains using the cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag that had previously done nothing much useful other than restrict what cpuset configurations were allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] futex: remove duplicate codePekka Enberg1-12/+9
This patch cleans up the error path of futex_fd() by removing duplicate code. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] fix send_sigqueue() vs thread exit raceOleg Nesterov2-23/+27
posix_timer_event() first checks that the thread (SIGEV_THREAD_ID case) does not have PF_EXITING flag, then it calls send_sigqueue() which locks task list. But if the thread exits in between the kernel will oops (->sighand == NULL after __exit_sighand). This patch moves the PF_EXITING check into the send_sigqueue(), it must be done atomically under tasklist_lock. When send_sigqueue() detects exiting thread it returns -1. In that case posix_timer_event will send the signal to thread group. Also, this patch fixes task_struct use-after-free in posix_timer_event. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] remove a redundant variable in sys_prctl()Jesper Juhl1-4/+2
The patch removes a redundant variable `sig' from sys_prctl(). For some reason, when sys_prctl is called with option == PR_SET_PDEATHSIG then the value of arg2 is assigned to an int variable named sig. Then sig is tested with valid_signal() and later used to set the value of current->pdeath_signal . There is no reason to use this intermediate variable since valid_signal() takes a unsigned long argument, so it can handle being passed arg2 directly, and if the call to valid_signal is OK, then we know the value of arg2 is in the range zero to _NSIG and thus it'll easily fit in a plain int and thus there's no problem assigning it later to current->pdeath_signal (which is an int). The patch gets rid of the pointless variable `sig'. This reduces the size of kernel/sys.o in 2.6.13-rc6-mm1 by 32 bytes on my system. Patch has been compile tested, boot tested, and just to make damn sure I didn't break anything I wrote a quick test app that calls prctl(PR_SET_PDEATHSIG ...) with the entire range of values for a unsigned long, and it behaves as expected with and without the patch. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] largefile support for accountingPeter Staubach1-1/+1
There is a problem in the accounting subsystem in the kernel can not correctly handle files larger than 2GB. The output file containing the process accounting data can grow very large if the system is large enough and active enough. If the 2GB limit is reached, then the system simply stops storing process accounting data. Another annoying problem is that once the system reaches this 2GB limit, then every process which exits will receive a signal, SIGXFSZ. This signal is generated because an attempt was made to write beyond the limit for the file descriptor. This signal makes it look like every process has exited due to a signal, when in fact, they have not. The solution is to add the O_LARGEFILE flag to the list of flags used to open the accounting file. The rest of the accounting support is already largefile safe. The changes were tested by constructing a large file (just short of 2GB), enabling accounting, and then running enough commands to cause the accounting data generated to increase the size of the file to 2GB. Without the changes, the file grows to 2GB and the last command run in the test script appears to exit due a signal when it has not. With the changes, things work as expected and quietly. There are some user level changes required so that it can deal with largefiles, but those are being handled separately. Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] do_notify_parent_cldstop() cleanupOleg Nesterov1-35/+26
This patch simplifies the usage of do_notify_parent_cldstop(), it lessens the source and .text size slightly, and makes the code (in my opinion) a bit more readable. I am sending this patch now because I'm afraid Paul will touch do_notify_parent_cldstop() really soon, It's better to cleanup first. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ()Karsten Wiese1-1/+1
IRQ_PER_CPU is not used by all architectures. This patch introduces the macros ARCH_HAS_IRQ_PER_CPU and CHECK_IRQ_PER_CPU() to avoid the generation of dead code in __do_IRQ(). ARCH_HAS_IRQ_PER_CPU is defined by architectures using IRQ_PER_CPU in their include/asm_ARCH/irq.h file. Through grepping the tree I found the following architectures currently use IRQ_PER_CPU: cris, ia64, ppc, ppc64 and parisc. Signed-off-by: Karsten Wiese <annabellesgarden@yahoo.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] create_workqueue_thread() signedness fixMika Kukkonen1-1/+1
With "-W -Wno-unused -Wno-sign-compare" I get the following compile warning: CC kernel/workqueue.o kernel/workqueue.c: In function `workqueue_cpu_callback': kernel/workqueue.c:504: warning: ordered comparison of pointer with integer zero On error create_workqueue_thread() returns NULL, not negative pointer, so following trivial patch suggests itself. Signed-off-by: Mika Kukkonen <mikukkon@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] flush icache early when loading moduleThomas Koeller1-14/+19
Change the sequence of operations performed during module loading to flush the instruction cache before module parameters are processed. If a module has parameters of an unusual type that cannot be handled using the standard accessor functions param_set_xxx and param_get_xxx, it has to to provide a set of accessor functions for this type. This requires module code to be executed during parameter processing, which is of course only possible after the icache has been flushed. Signed-off-by: Thomas Koeller <thomas@koeller.dyndns.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] optimize writer path in time_interpolator_get_counter()Alex Williamson1-4/+13
Christoph Lameter <clameter@engr.sgi.com> When using a time interpolator that is susceptible to jitter there's potentially contention over a cmpxchg used to prevent time from going backwards. This is unnecessary when the caller holds the xtime write seqlock as all readers will be blocked from returning until the write is complete. We can therefore allow writers to insert a new value and exit rather than fight with CPUs who only hold a reader lock. Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] Provide better printk() support for SMP machinesDavid Howells1-1/+12
The attached patch prevents oopses interleaving with characters from other printks on other CPUs by only breaking the lock if the oops is happening on the machine holding the lock. It might be better if the oops generator got the lock and then called an inner vprintk routine that assumed the caller holds the lock, thus making oops reports "atomic". Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] detect soft lockupsIngo Molnar4-0/+154
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP. When enabled then per-CPU watchdog threads are started, which try to run once per second. If they get delayed for more than 10 seconds then a callback from the timer interrupt detects this condition and prints out a warning message and a stack dump (once per lockup incident). The feature is otherwise non-intrusive, it doesnt try to unlock the box in any way, it only gets the debug info out, automatically, and on all CPUs affected by the lockup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de> Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedupJakub Jelinek1-0/+116
ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter (which at least on UP usually means an immediate context switch to one of the waiter threads). This waiter wakes up and after a few instructions it attempts to acquire the cv internal lock, but that lock is still held by the thread calling pthread_cond_signal. So it goes to sleep and eventually the signalling thread is scheduled in, unlocks the internal lock and wakes the waiter again. Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal to avoid this performance issue, but it was removed when locks were redesigned to the 3 state scheme (unlocked, locked uncontended, locked contended). Following scenario shows why simply using FUTEX_REQUEUE in pthread_cond_signal together with using lll_mutex_unlock_force in place of lll_mutex_unlock is not enough and probably why it has been disabled at that time: The number is value in cv->__data.__lock. thr1 thr2 thr3 0 pthread_cond_wait 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) 0 lll_futex_wait (&cv->__data.__futex, futexval) 0 pthread_cond_signal 1 lll_mutex_lock (cv->__data.__lock) 1 pthread_cond_signal 2 lll_mutex_lock (cv->__data.__lock) 2 lll_futex_wait (&cv->__data.__lock, 2) 2 lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock) # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE 2 lll_mutex_unlock_force (cv->__data.__lock) 0 cv->__data.__lock = 0 0 lll_futex_wake (&cv->__data.__lock, 1) 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) # Here, lll_mutex_unlock doesn't know there are threads waiting # on the internal cv's lock Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal, but it will cost us not one, but 2 extra syscalls and, what's worse, one of these extra syscalls will be done for every single waiting loop in pthread_cond_*wait. We would need to use lll_mutex_unlock_force in pthread_cond_signal after requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait. Another alternative is to do the unlocking pthread_cond_signal needs to do (the lock can't be unlocked before lll_futex_wake, as that is racy) in the kernel. I have implemented both variants, futex-requeue-glibc.patch is the first one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel. The kernel interface allows userland to specify how exactly an unlocking operation should look like (some atomic arithmetic operation with optional constant argument and comparison of the previous futex value with another constant). It has been implemented just for ppc*, x86_64 and i?86, for other architectures I'm including just a stub header which can be used as a starting point by maintainers to write support for their arches and ATM will just return -ENOSYS for FUTEX_WAKE_OP. The requeue patch has been (lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running 32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL. With the following benchmark on UP x86-64 I get: for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \ for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done time elf/ld.so --library-path .:nptl-orig /tmp/bench real 0m0.655s user 0m0.253s sys 0m0.403s real 0m0.657s user 0m0.269s sys 0m0.388s time elf/ld.so --library-path .:nptl-requeue /tmp/bench real 0m0.496s user 0m0.225s sys 0m0.271s real 0m0.531s user 0m0.242s sys 0m0.288s time elf/ld.so --library-path .:nptl-wake_op /tmp/bench real 0m0.380s user 0m0.176s sys 0m0.204s real 0m0.382s user 0m0.175s sys 0m0.207s The benchmark is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt Older futex-requeue-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt Older futex-wake_op-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt Will post a new version (just x86-64 fixes so that the patch applies against pthread_cond_signal.S) to libc-hacker ml soon. Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded testcase that will not test the atomicity of the operation, but at least check if the threads that should have been woken up are woken up and whether the arithmetic operation in the kernel gave the expected results. Acked-by: Ingo Molnar <mingo@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jamie Lokier <jamie@shareable.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] swsusp: update documentationPavel Machek1-1/+1
This updates documentation a bit (mostly removing obsolete stuff), and marks swsusp as no longer experimental in config. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinityAshok Raj2-2/+16
When handling writes to /proc/irq, current code is re-programming rte entries directly. This is not recommended and could potentially cause chipset's to lockup, or cause missing interrupts. CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the interrupt is pending. The same needs to be done for /proc/irq handling as well. Otherwise user space irq balancers are really not doing the right thing. - Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for lack of a generic name. - added move_irq out of IRQ_BALANCE, and added this same to X86_64 - Added new proc handler for write, so we can do deferred write at irq handling time. - Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead it now shows only active cpu masks, or exactly what was set. - Provided a common move_irq implementation, instead of duplicating when using generic irq framework. Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off. Tested UP builds as well. MSI testing: tbd: I have cards, need to look for a x-over cable, although I did test an earlier version of this patch. Will test in a couple days. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Zwane Mwaikambo <zwane@holomorphy.com> Grudgingly-acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[kernel-doc] fix various DocBook build problems/warningsJeff Garzik1-0/+1
Most serious is fixing include/sound/pcm.h, which breaks the DocBook build. The other stuff is just filling in things that cause warnings.
2005-09-05[PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and ↵Laurent Vivier1-0/+3
general usage Jeff Dike <jdike@addtoit.com>, Paolo 'Blaisorblade' Giarrusso <blaisorblade_spam@yahoo.it>, Bodo Stroesser <bstroesser@fujitsu-siemens.com> Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL except that the kernel does not execute the requested syscall; this is useful to improve performance for virtual environments, like UML, which want to run the syscall on their own. In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry and on exit, and each time you also have two context switches; with SYSEMU you avoid the 2nd stop and so save two context switches per syscall. Also, some architectures don't have support in the host for changing the syscall number via ptrace(), which is currently needed to skip syscall execution (UML turns any syscall into getpid() to avoid it being executed on the host). Fixing that is hard, while SYSEMU is easier to implement. * This version of the patch includes some suggestions of Jeff Dike to avoid adding any instructions to the syscall fast path, plus some other little changes, by myself, to make it work even when the syscall is executed with SYSENTER (but I'm unsure about them). It has been widely tested for quite a lot of time. * Various fixed were included to handle the various switches between various states, i.e. when for instance a syscall entry is traced with one of PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit. Basically, this is done by remembering which one of them was used even after the call to ptrace_notify(). * We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP to make do_syscall_trace() notice that the current syscall was started with SYSEMU on entry, so that no notification ought to be done in the exit path; this is a bit of a hack, so this problem is solved in another way in next patches. * Also, the effects of the patch: "Ptrace - i386: fix Syscall Audit interaction with singlestep" are cancelled; they are restored back in the last patch of this series. Detailed descriptions of the patches doing this kind of processing follow (but I've already summed everything up). * Fix behaviour when changing interception kind #1. In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag only after doing the debugger notification; but the debugger might have changed the status of this flag because he continued execution with PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag status before calling ptrace_notify(). * Fix behaviour when changing interception kind #2: avoid intercepting syscall on return when using SYSCALL again. A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL crashes. The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch inhibits the syscall-handler to be called, but does not prevent do_syscall_trace() to be called after this for syscall completion interception. The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since the flag is unused in the depicted situation. * Fix behaviour when changing interception kind #3: avoid intercepting syscall on return when using SINGLESTEP. When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had problems with singlestepping on UML in SKAS with SYSEMU. It looped receiving SIGTRAPs without moving forward. EIP of the traced process was the same for all SIGTRAPs. What's missing is to handle switching from PTRACE_SYSCALL_EMU to PTRACE_SINGLESTEP in a way very similar to what is done for the change from PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE. I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is notified and then wake ups the process; the syscall is executed (or skipped, when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and do_syscall_trace() is called again. Since we are on the return path of a SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL), we must still avoid notifying the parent of the syscall exit. Now, this behaviour is extended even to resuming with PTRACE_SINGLESTEP. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>