summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2007-12-19Do dirty page accounting when removing a page from the page cacheLinus Torvalds1-0/+12
Krzysztof Oledzki noticed a dirty page accounting leak on some of his machines, causing the machine to eventually lock up when the kernel decided that there was too much dirty data, but nobody could actually write anything out to fix it. The culprit turns out to be filesystems (cough ext3 with data=journal cough) that re-dirty the page when the "->invalidatepage()" callback is called. Fix it up by doing a final dirty page accounting check when we actually remove the page from the page cache. This fixes bugzilla entry 9182: http://bugzilla.kernel.org/show_bug.cgi?id=9182 Tested-by: Ingo Molnar <mingo@elte.hu> Tested-by: Krzysztof Oledzki <olel@ans.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86Linus Torvalds12-90/+102
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: x86: fix "Kernel panic - not syncing: IO-APIC + timer doesn't work!" genirq: revert lazy irq disable for simple irqs x86: also define AT_VECTOR_SIZE_ARCH x86: kprobes bugfix x86: jprobe bugfix timer: kernel/timer.c section fixes genirq: add unlocked version of set_irq_handler() clockevents: fix reprogramming decision in oneshot broadcast oprofile: op_model_athlon.c support for AMD family 10h barcelona performance counters
2007-12-18x86: fix "Kernel panic - not syncing: IO-APIC + timer doesn't work!"Ingo Molnar2-8/+24
this is the tale of a full day spent debugging an ancient but elusive bug. after booting up thousands of random .config kernels, i finally happened to generate a .config that produced the following rare bootup failure on 32-bit x86: | ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 | ..MP-BIOS bug: 8254 timer not connected to IO-APIC | ...trying to set up timer (IRQ0) through the 8259A ... failed. | ...trying to set up timer as Virtual Wire IRQ... failed. | ...trying to set up timer as ExtINT IRQ... failed :(. | Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug | and send a report. Then try booting with the 'noapic' option this bug has been reported many times during the years, but it was never reproduced nor fixed. the bug that i hit was extremely sensitive to .config details. First i did a .config-bisection - suspecting some .config detail. That led to CONFIG_X86_MCE: enabling X86_MCE magically made the bug disappear and the system would boot up just fine. Debugging my way through the MCE code ended up identifying two unlikely candidates: the thing that made a real difference to the hang was that X86_MCE did two printks: Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. Adding the same printks to a !CONFIG_X86_MCE kernel made the bug go away! this left timing as the main suspect: i experimented with adding various udelay()s to the arch/x86/kernel/io_apic_32.c:check_timer() function, and the race window turned out to be narrower than 30 microseconds (!). That made debugging especially funny, debugging without having printk ability before the bug hits is ... interesting ;-) eventually i started suspecting IRQ activities - those are pretty much the only thing that happen this early during bootup and have the timescale of a few dozen microseconds. Also, check_timer() changes the IRQ hardware in various creative ways, so the main candidate became IRQ0 interaction. i've added a counter to track timer irqs (on which core they arrived, at what exact time, etc.) and found that no timer IRQ would arrive after the bug condition hits - even if we re-enable IRQ0 and re-initialize the i8259A, but that we'd get a small number of timer irqs right around the time when we call the check_timer() function. Eventually i got the following backtrace triggered from debug code in the timer interrupt: ...trying to set up timer as Virtual Wire IRQ... failed. ...trying to set up timer as ExtINT IRQ... Pid: 1, comm: swapper Not tainted (2.6.24-rc5 #57) EIP: 0060:[<c044d57e>] EFLAGS: 00000246 CPU: 0 EIP is at _spin_unlock_irqrestore+0x5/0x1c EAX: c0634178 EBX: 00000000 ECX: c4947d63 EDX: 00000246 ESI: 00000002 EDI: 00010031 EBP: c04e0f2e ESP: f7c41df4 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: ffe04000 CR3: 00630000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c05f5784>] setup_IO_APIC+0x9c3/0xc5c the spin_unlock() was called from init_8259A(). Wait ... we have an IRQ0 entry while we are in the middle of setting up the local APIC, the i8259A and the PIT?? That is certainly not how it's supposed to work! check_timer() was supposed to be called with irqs turned off - but this eroded away sometime in the past. This code would still work most of the time because this code runs very quickly, but just the right timing conditions are present and IRQ0 hits in this small, ~30 usecs window, timer irqs stop and the system does not boot up. Also, given how early this is during bootup, the hang is very deterministic - but it would only occur on certain machines (and certain configs). The fix was quite simple: disable/restore interrupts properly in this function. With that in place the test-system now boots up just fine. (64-bit x86 io_apic_64.c had the same bug.) Phew! One down, only 1500 other kernel bugs are left ;-) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18genirq: revert lazy irq disable for simple irqsSteven Rostedt1-7/+2
In commit 76d2160147f43f982dfe881404cfde9fd0a9da21 lazy irq disabling was implemented, and the simple irq handler had a masking set to it. Remy Bohmer discovered that some devices in the ARM architecture would trigger the mask, but never unmask it. His patch to do the unmasking was questioned by Russell King about masking simple irqs to begin with. Looking further, it was discovered that the problems Remy was seeing was due to improper use of the simple handler by devices, and he later submitted patches to fix those. But the issue that was uncovered was that the simple handler should never mask. This patch reverts the masking in the simple handler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-12-18x86: also define AT_VECTOR_SIZE_ARCHJan Beulich1-0/+7
The patch introducing this left out 64-bit x86 despite it also having extra entries. this solves Xen guest troubles. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18x86: kprobes bugfixMasami Hiramatsu1-23/+18
Kprobes for x86-64 may cause a kernel crash if it inserted on "iret" instruction. "call absolute" is invalid on x86-64, so we don't need treat it. - Change the processing order as same as x86-32. - Add "iret"(0xcf) case. - Remove next_rip local variable. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18x86: jprobe bugfixMasami Hiramatsu4-9/+5
jprobe for x86-64 may cause kernel page fault when the jprobe_return() is called from incorrect function. - Use jprobe_saved_regs instead getting it from stack. (Especially on x86-64, it may get incorrect data, because pt_regs can not be get by using container_of(rsp)) - Change the type of stack pointer to unsigned long *. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18timer: kernel/timer.c section fixesAdrian Bunk1-2/+2
This patch fixes the following section mismatches with CONFIG_HOTPLUG=n, CONFIG_HOTPLUG_CPU=y: ... WARNING: vmlinux.o(.text+0x41cd3): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq') WARNING: vmlinux.o(.text+0x41d67): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq') ... Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18genirq: add unlocked version of set_irq_handler()Kevin Hilman1-0/+7
Add unlocked version for use by irq_chip.set_type handlers which may wish to change handler to level or edge handler when IRQ type is changed. The normal set_irq_handler() call cannot be used because it tries to take irq_desc.lock which is already held when the irq_chip.set_type hook is called. Signed-off-by: Kevin Hilman <khilman@mvista.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18clockevents: fix reprogramming decision in oneshot broadcastThomas Gleixner1-35/+21
Resolve the following regression of a choppy, almost unusable laptop: http://lkml.org/lkml/2007/12/7/299 http://bugzilla.kernel.org/show_bug.cgi?id=9525 A previous version of the code did the reprogramming of the broadcast device in the return from idle code. This was removed, but the logic in tick_handle_oneshot_broadcast() was kept the same. When a broadcast interrupt happens we signal the expiry to all CPUs which have an expired event. If none of the CPUs has an expired event, which can happen in dyntick mode, then we reprogram the broadcast device. We do not reprogram otherwise, but this is only correct if all CPUs, which are in the idle broadcast state have been woken up. The code ignores, that there might be pending not yet expired events on other CPUs, which are in the idle broadcast state. So the delivery of those events can be delayed for quite a time. Change the tick_handle_oneshot_broadcast() function to check for CPUs, which are in broadcast state and are not woken up by the current event, and enforce the rearming of the broadcast device for those CPUs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18oprofile: op_model_athlon.c support for AMD family 10h barcelona performance ↵Barry Kasindorf1-6/+16
counters This patch is for controlling the upper 32bits of the event ctrl msrs. This includes the upper 4 bits of the event select and the Guest Only and Host Only bits This patch is necessary to make Event Based Profiling work reliably on a Family 10h processor [akpm@linux-foundation.org: checkpatch.pl fixes] Signed-off-by: Barry Kasindorf <barry.kasindorf@amd.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds5-15/+21
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: do not hurt SCHED_BATCH on wakeup sched: touch softlockup watchdog after idling sched: sysctl, proc_dointvec_minmax() expects int values for sched: mark rwsem functions as __sched for wchan/profiling sched: fix crash on ia64, introduce task_current()
2007-12-18Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds8-173/+100
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Cleanup umem driver: fix most checkpatch warnings, conform to kernel block: let elv_register() return void as-iosched: fix write batch start point as-iosched: fix incorrect comments block: use jiffies conversion functions in scsi_ioctl.c
2007-12-18Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6Linus Torvalds5-15/+12
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: [XFS] Put the correct offset in dirent d_off [XFS] Don't wait for pending I/Os when purging blocks beyond eof.
2007-12-18Merge branch 'for-linus' of ↵Linus Torvalds4-10/+61
git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc: mmc: remove unused 'mode' from the mmc_host structure sdhci: support JMicron JMB38x chips sdhci: use PIO when DMA can't satisfy the request sdhci: don't warn about sdhci 2.0 controllers sdhci: describe quirks
2007-12-18sched: do not hurt SCHED_BATCH on wakeupIngo Molnar1-2/+1
measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit if they run the same place_entity() logic as SCHED_OTHER tasks - so uniformize behavior in this area. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18sched: touch softlockup watchdog after idlingIngo Molnar1-0/+1
touch softlockup watchdog after idling. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18sched: sysctl, proc_dointvec_minmax() expects int values forEric Dumazet1-4/+4
min_sched_granularity_ns, max_sched_granularity_ns, min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared "unsigned long". This is incorrect since proc_dointvec_minmax() expects plain "int" guard values. This bug only triggers on big endian 64 bit arches. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18sched: mark rwsem functions as __sched for wchan/profilingLivio Soares2-3/+4
This following commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fdf8cb0909b531f9ae8f9b9d7e4eb35ba3505f07 un-inlined a low-level rwsem function, but did not mark it as __sched. The result is that it now shows up as thread wchan (which also affects /proc/profile stats). The following simple patch fixes this by properly marking rwsem_down_failed_common() as a __sched function. Also in this patch, which is up for discussion, marks down_read() and down_write() proper as __sched. For profiling, it is pretty much useless to know that a semaphore is beig help - it is necessary to know _which_ one. By going up another frame on the stack, the information becomes much more useful. In summary, the below change to lib/rwsem.c should be applied; the changes to kernel/rwsem.c could be applied if other kernel hackers agree with my proposal that down_read()/down_write() in the profile is not enough. [ akpm@linux-foundation.org: build fix ] Signed-off-by: Livio Soares <livio@eecg.toronto.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-18sched: fix crash on ia64, introduce task_current()Dmitry Adamushko1-6/+11
Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and sched_move_task()) must handle a given task differently in case it's the 'rq->curr' task on its run-queue. The task_running() interface is not suitable for determining such tasks for platforms with one of the following options: #define __ARCH_WANT_UNLOCKED_CTXSW #define __ARCH_WANT_INTERRUPTS_ON_CTXSW Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but such a task is not necessarily 'rq->curr'. The detailed explanation is available here: https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2007-12-18Cleanup umem driver: fix most checkpatch warnings, conform to kernelRandy Dunlap1-156/+81
coding style. linux-2.6.24-rc5-git3> checkpatch.pl-next patches/block-umem-ckpatch.patch total: 0 errors, 5 warnings, 530 lines checked All of these are line-length warnings. Only change in generated object file is due to not initializing a static global variable to 0. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18block: let elv_register() return voidAdrian Bunk6-12/+13
elv_register() always returns 0, and there isn't anything it does where it should return an error (the only error condition is so grave that it's handled with a BUG_ON). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18as-iosched: fix write batch start pointAaron Carroll1-1/+2
New write batches currently start from where the last one completed. We have no idea where the head is after switching batches, so this makes little sense. Instead, start the next batch from the request with the earliest deadline in the hope that we avoid a deadline expiry later on. Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18as-iosched: fix incorrect commentsAaron Carroll1-2/+2
Two comments refer to deadlines applying to reads only. This is not the case. Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18block: use jiffies conversion functions in scsi_ioctl.cTejun Heo1-2/+2
Use msecs_to_jiffies() and jiffies_to_msecs() in scsi_ioctl(). Sometimes callers use very large values for e.g. vendor specific media clear command and calculation can overflow. Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18[XFS] Put the correct offset in dirent d_offLachlan McIlroy4-13/+8
The recent filldir regression fix was not putting the correct d_off in each dirent. This was resulting in incorrect cookies being passed to dmapi ioctls and the wrong offset appearing in the dirents. readdir was unaffected as the filp->f_pos was being updated with the correct offset and this was being written into the last dirent in each buffer. Fix the XFS code to do the right thing. SGI-PV: 973746 SGI-Modid: xfs-linux-melb:xfs-kern:30240a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2007-12-18[XFS] Don't wait for pending I/Os when purging blocks beyond eof.Lachlan McIlroy1-2/+4
On last close of a file we purge blocks beyond eof. The same code is used when we truncate the file size down. In this case we need to wait for any pending I/Os for dirty pages beyond the new eof. For the last close case we are not changing the file size and therefore do not need to wait for any I/Os to complete. This fixes a performance bottleneck where writes into the page cache and cache flushes can become mutually exclusive. SGI-PV: 964002 SGI-Modid: xfs-linux-melb:xfs-kern:30220a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Peter Leckie <pleckie@sgi.com>
2007-12-17Merge branch 'upstream-linus' of ↵Linus Torvalds24-91/+168
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (23 commits) iwlwifi: fix rf_kill state inconsistent during suspend and resume b43: Fix rfkill radio LED bcm43xx_debugfs sscanf fix libertas: select WIRELESS_EXT iwlwifi3945/4965: fix rate control algo reference leak ieee80211_rate: missed unlock wireless/ipw2200.c: add __dev{init,exit} annotations zd1211rw: Fix alignment problems libertas: add Dan Williams as maintainer sis190 endianness ucc_geth: really fix section mismatch pcnet_cs: add new id ixgb: make sure jumbos stay enabled after reset Net: ibm_newemac, remove SPIN_LOCK_UNLOCKED net: smc911x: shut up compiler warnings ucc_geth: minor whitespace fix drivers/net/s2io.c section fixes drivers/net/sis190.c section fix hamachi endianness fixes e100: free IRQ to remove warningwhenrebooting ...
2007-12-17Merge branch 'upstream-linus' of ↵Linus Torvalds8-173/+419
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: libata: fix ATAPI draining libata: update atapi_eh_request_sense() such that lbam/lbah contains buffer size libata-acpi: implement _GTF command filtering libata-acpi: improve _GTF execution error handling and reporting libata-acpi: improve ACPI disabling libata-acpi: implement dev->gtf_cache and evaluate _GTF right after _STM during resume libata-acpi: implement and use ata_acpi_init_gtm() libata-acpi: add new hooks ata_acpi_dissociate() and ata_acpi_on_disable() libata: ata_dev_disable() should be called from EH context libata: add more opcodes to ata.h libata: update ata_*_printk() macros such that level can be a variable libata-acpi: adjust constness in ata_acpi_gtm/stm() parameters sata_mv: improve warnings about Highpoint RocketRAID 23xx cards libata: add ST3160023AS / 3.42 to NCQ blacklist libata: clear link->eh_info.serror from ata_std_postreset() sata_sil: fix spurious IRQ handling
2007-12-17sysctl: fix ax25 checksEric W. Biederman1-1/+6
Fix: sysctl table check failed: /net/ax25/ax0/ax25_default_mode .3.9.1.2 Unknown sysctl binary path Pid: 2936, comm: kissattach Not tainted 2.6.24-rc5 #1 [<c012ca6a>] set_fail+0x3b/0x43 [<c012ce7a>] sysctl_check_table+0x408/0x456 [<c012ce8e>] sysctl_check_table+0x41c/0x456 [<c012ce8e>] sysctl_check_table+0x41c/0x456 ... Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Bernard Pidoux <pidoux@ccr.jussieu.fr> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17quicklist: Set tlb->need_flush if pages are remaining in quicklist 0Christoph Lameter1-0/+4
This ensures that the quicklists are drained. Otherwise draining may only occur when the processor reaches an idle state. Fixes fatal leakage of pgd_t's on 2.6.22 and later. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reported-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17SLUB: remove useless masking of GFP_ZEROChristoph Lameter1-3/+0
Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already masked out in new_slab() (See how it calls allocate_slab). No need to do it twice. This reverts the SLUB parts of 7fd272550bd43cc1d7289ef0ab2fa50de137e767. Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17Fix compilation warning in dquot.cJan Kara1-2/+2
Fix compilation warning about discarded const. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17Documentation: update hugetlb informationNishanth Aravamudan2-6/+48
The hugetlb documentation has gotten a bit out of sync with the current code. Updated the sysctl file to refer to Documentation/vm/hugetlbpage.txt. Update that file to contain the current state of affairs (with the newer named sysctl in place). Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17Revert "hugetlb: Add hugetlb_dynamic_pool sysctl"Nishanth Aravamudan3-14/+0
This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb: Add hugetlb_dynamic_pool sysctl") Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool sysctl is not needed, as its semantics can be expressed by 0 in the overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl (pool enabled). (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change) Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17hugetlb: introduce nr_overcommit_hugepages sysctlNishanth Aravamudan3-6/+70
hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17ecryptfs: fix fsx data corruption problemsEric Sandeen2-17/+41
ecryptfs in 2.6.24-rc3 wasn't surviving fsx for me at all, dying after 4 ops. Generally, encountering problems with stale data and improperly zeroed pages. An extending truncate + write for example would expose stale data. With the changes below I got to a million ops and beyond with all mmap ops disabled - mmap still needs work. (A version of this patch on a RHEL5 kernel ran for over 110 million fsx ops) I added a few comments as well, to the best of my understanding as I read through the code. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17fix bloat-o-meter for ppc64Nathan Lynch1-1/+2
bloat-o-meter assumes that a '.' anywhere in a symbol's name means that it is static and prepends 'static.' to the first part of the symbol name, discarding the portion of the name that follows the '.'. However, the names of function entry points begin with '.' in the ppc64 ABI. This causes all function text size changes to be accounted to a single 'static.' entry in the output when comparing ppc64 kernels. Change getsizes() to ignore the first character of the symbol name when searching for '.'. Signed-off-by: Nathan Lynch <ntl@pobox.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17I/OAT: fix null device in call to dev_err()Shannon Nelson1-1/+1
We can't use the device in a dev_err() after a kzalloc failure or after the kfree, so simplify it to the pdev that was originally passed in. Cc: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17I/OAT: fixups from code commentsShannon Nelson2-66/+78
A few fixups from Andrew's code comments. - removed "static inline" forward-declares - changed use of min() to min_t() - removed some unnecessary NULL initializations - removed a couple of BUG() calls Fixes this: drivers/dma/ioat_dma.c: In function `ioat1_tx_submit': drivers/dma/ioat_dma.c:177: sorry, unimplemented: inlining failed in call to '__ioat1_dma_memcpy_issue_pending': function body not available drivers/dma/ioat_dma.c:268: sorry, unimplemented: called from here Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Cc: "Williams, Dan J" <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17ecryptfs: set s_blocksize from lower fs in sbEric Sandeen1-0/+1
eCryptfs wasn't setting s_blocksize in it's superblock; just pick it up from the lower FS. Having an s_blocksize of 0 made things like "filefrag" which call FIGETBSZ unhappy. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mike Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17mm: fix page allocation for larger I/O segmentsMel Gorman1-0/+11
In some cases the IO subsystem is able to merge requests if the pages are adjacent in physical memory. This was achieved in the allocator by having expand() return pages in physically contiguous order in situations were a large buddy was split. However, list-based anti-fragmentation changed the order pages were returned in to avoid searching in buffered_rmqueue() for a page of the appropriate migrate type. This patch restores behaviour of rmqueue_bulk() preserving the physical order of pages returned by the allocator without incurring increased search costs for anti-fragmentation. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mark Lord <mlord@pobox.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17apm_event{,info}_t are userspace typesAdam Jackson1-3/+3
These types define the size of data read from /dev/apm_bios. They should not be hidden behind #ifdef __KERNEL__. This is killing my xserver compile, apm_event_t is used in the xserver source. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17drivers/cpufreq/cpufreq_stats.c section fixAdrian Bunk1-1/+1
cpufreq_stats_free_table() mustn't be __cpuexit since it's called by the __cpuinit cpufreq_stat_cpu_callback(). This patch fixes the following section mismatch reported by Chris Clayton: WARNING: vmlinux.o(.init.text+0x143dd): Section mismatch: reference to .exit.text:cpufreq_stats_free_table (between 'cpufreq_stat_cpu_callback' and 'cpufreq_stats_init') Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Chris Clayton <chris2553@googlemail.com> Acked-by: Dave Jones <davej@codemonkey.org.uk> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17drivers/macintosh/via-pmu.c: Added a missing iounmapJulia Lawall1-0/+1
The error handling code should undo the ioremap as well. The problem was detected using the following semantic match (http://www.emn.fr/x-info/coccinelle/) // <smpl> @@ type T,T1,T2; identifier E; statement S; expression x1,x2; constant C; int ret; @@ T E; ... * E = ioremap(...); if (E == NULL) S ... when != iounmap(E) when != if (E != NULL) { ... iounmap(E); ...} when != x1 = (T1)E if (...) { ... when != iounmap(E) when != if (E != NULL) { ... iounmap(E); ...} when != x2 = (T2)E ( * return; | * return C; | * return ret; ) } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Olaf Hering <olaf@aepfle.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17pktcdvd: add kobject_put when kobject register failsDave Young1-1/+3
In kobject_register, the kobject reference is get in kobject_init, and then kobject_add. If kobject_add fail, it will only cleanup the reference got by itself. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Greg KH <greg@kroah.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17mm/sparse.c: improve the error handling for sparse_add_one_section()WANG Cong1-6/+12
Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST] Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17mm/sparse.c: check the return value of sparse_index_alloc()WANG Cong1-0/+2
Since sparse_index_alloc() can return NULL on memory allocation failure, we must deal with the failure condition when calling it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17cpufreq: fix missing unlocks in cpufreq_add_dev error paths.Dave Jones1-3/+9
Ingo hit some BUG_ONs that were probably caused by these missing unlocks causing an unbalance. He couldn't reproduce the bug reliably, so it's unknown that it's definitly fixing the problem he hit, but it's a fairly good chance, and this fixes an obvious bug. [ Dave: "Ingo followed up that he hit some lockdep related output with this applied, so it may not be right. I'll look at it after xmas if no-one has it figured out before then." Akpm: "It looks pretty correct to me though." ] Signed-off-by: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17alpha: build fixesIvan Kokshaysky6-13/+17
This fixes some of the alpha-specific build problems, except a) modpost warning about COMMON symbol "saved_config" and b) nasty final link failure with gcc-4.x, -Os and scsi-disk driver configured built-in (due to jump table in .rodata referencing discarded .exit.text). - build failure with gcc-4.2.x: fix up casts in cia_io* routines to avoid warnings ('discards qualifiers from pointer target type'), which are failures, thanks to -Werror; - modpost warnings: add missing __init qualifier for titan and marvel; for non-generic build, move machine vectors from .data to .data.init.refok section; - unbreak CPU-specific optimization: rearrange cpuflags-y assignments so that extended -mcpu value (ev56, pca56, ev67) overrides basic one (ev5, ev6) and not vice versa. Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>