diff options
Diffstat (limited to 'Documentation')
63 files changed, 2174 insertions, 480 deletions
diff --git a/Documentation/ABI/stable/sysfs-class-backlight b/Documentation/ABI/stable/sysfs-class-backlight new file mode 100644 index 000000000000..4d637e1c4ff7 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-class-backlight @@ -0,0 +1,36 @@ +What: /sys/class/backlight/<backlight>/bl_power +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control BACKLIGHT power, values are FB_BLANK_* from fb.h + - FB_BLANK_UNBLANK (0) : power on. + - FB_BLANK_POWERDOWN (4) : power off +Users: HAL + +What: /sys/class/backlight/<backlight>/brightness +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control the brightness for this <backlight>. Values + are between 0 and max_brightness. This file will also + show the brightness level stored in the driver, which + may not be the actual brightness (see actual_brightness). +Users: HAL + +What: /sys/class/backlight/<backlight>/actual_brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Show the actual brightness by querying the hardware. +Users: HAL + +What: /sys/class/backlight/<backlight>/max_brightness +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum brightness for <backlight>. +Users: HAL diff --git a/Documentation/ABI/testing/sysfs-class-lcd b/Documentation/ABI/testing/sysfs-class-lcd new file mode 100644 index 000000000000..35906bf7aa70 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-lcd @@ -0,0 +1,23 @@ +What: /sys/class/lcd/<lcd>/lcd_power +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control LCD power, values are FB_BLANK_* from fb.h + - FB_BLANK_UNBLANK (0) : power on. + - FB_BLANK_POWERDOWN (4) : power off + +What: /sys/class/lcd/<lcd>/contrast +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Current contrast of this LCD device. Value is between 0 and + /sys/class/lcd/<lcd>/max_contrast. + +What: /sys/class/lcd/<lcd>/max_contrast +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum contrast for this LCD device. diff --git a/Documentation/ABI/testing/sysfs-class-led b/Documentation/ABI/testing/sysfs-class-led new file mode 100644 index 000000000000..9e4541d71cb6 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-led @@ -0,0 +1,28 @@ +What: /sys/class/leds/<led>/brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Set the brightness of the LED. Most LEDs don't + have hardware brightness support so will just be turned on for + non-zero brightness settings. The value is between 0 and + /sys/class/leds/<led>/max_brightness. + +What: /sys/class/leds/<led>/max_brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum brightness level for this led, default is 255 (LED_FULL). + +What: /sys/class/leds/<led>/trigger +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Set the trigger for this LED. A trigger is a kernel based source + of led events. + You can change triggers in a similar manner to the way an IO + scheduler is chosen. Trigger specific parameters can appear in + /sys/class/leds/<led> once a given trigger is selected. + diff --git a/Documentation/ABI/testing/sysfs-gpio b/Documentation/ABI/testing/sysfs-gpio index 8aab8092ad35..80f4c94c7bef 100644 --- a/Documentation/ABI/testing/sysfs-gpio +++ b/Documentation/ABI/testing/sysfs-gpio @@ -19,6 +19,7 @@ Description: /gpioN ... for each exported GPIO #N /value ... always readable, writes fail for input GPIOs /direction ... r/w as: in, out (default low); write: high, low + /edge ... r/w as: none, falling, rising, both /gpiochipN ... for each gpiochip; #N is its first GPIO /base ... (r/o) same as N /label ... (r/o) descriptive, not necessarily unique diff --git a/Documentation/ABI/testing/sysfs-platform-asus-laptop b/Documentation/ABI/testing/sysfs-platform-asus-laptop new file mode 100644 index 000000000000..a1cb660c50cf --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-asus-laptop @@ -0,0 +1,52 @@ +What: /sys/devices/platform/asus-laptop/display +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + This file allows display switching. The value + is composed by 4 bits and defined as follow: + 4321 + |||`- LCD + ||`-- CRT + |`--- TV + `---- DVI + Ex: - 0 (0000b) means no display + - 3 (0011b) CRT+LCD. + +What: /sys/devices/platform/asus-laptop/gps +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the gps device. 1 means on, 0 means off. +Users: Lapsus + +What: /sys/devices/platform/asus-laptop/ledd +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Some models like the W1N have a LED display that can be + used to display several informations. + To control the LED display, use the following : + echo 0x0T000DDD > /sys/devices/platform/asus-laptop/ + where T control the 3 letters display, and DDD the 3 digits display. + The DDD table can be found in Documentation/laptops/asus-laptop.txt + +What: /sys/devices/platform/asus-laptop/bluetooth +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the bluetooth device. 1 means on, 0 means off. + This may control the led, the device or both. +Users: Lapsus + +What: /sys/devices/platform/asus-laptop/wlan +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the bluetooth device. 1 means on, 0 means off. + This may control the led, the device or both. +Users: Lapsus diff --git a/Documentation/ABI/testing/sysfs-platform-eeepc-laptop b/Documentation/ABI/testing/sysfs-platform-eeepc-laptop new file mode 100644 index 000000000000..7445dfb321b5 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-eeepc-laptop @@ -0,0 +1,50 @@ +What: /sys/devices/platform/eeepc-laptop/disp +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + This file allows display switching. + - 1 = LCD + - 2 = CRT + - 3 = LCD+CRT + If you run X11, you should use xrandr instead. + +What: /sys/devices/platform/eeepc-laptop/camera +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the camera. 1 means on, 0 means off. + +What: /sys/devices/platform/eeepc-laptop/cardr +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the card reader. 1 means on, 0 means off. + +What: /sys/devices/platform/eeepc-laptop/cpufv +Date: Jun 2009 +KernelVersion: 2.6.31 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Change CPU clock configuration. + On the Eee PC 1000H there are three available clock configuration: + * 0 -> Super Performance Mode + * 1 -> High Performance Mode + * 2 -> Power Saving Mode + On Eee PC 701 there is only 2 available clock configurations. + Available configuration are listed in available_cpufv file. + Reading this file will show the raw hexadecimal value which + is defined as follow: + | 8 bit | 8 bit | + | `---- Current mode + `------------ Availables modes + For example, 0x301 means: mode 1 selected, 3 available modes. + +What: /sys/devices/platform/eeepc-laptop/available_cpufv +Date: Jun 2009 +KernelVersion: 2.6.31 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + List available cpufv modes. diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index 8e145857fc9d..df0d089d0fb9 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -568,7 +568,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) <para> The blocks in which the tables are stored are procteted against accidental access by marking them bad in the memory bad block - table. The bad block table managment functions are allowed + table. The bad block table management functions are allowed to circumvernt this protection. </para> <para> diff --git a/Documentation/DocBook/scsi.tmpl b/Documentation/DocBook/scsi.tmpl index 10a150ae2a7e..d87f4569e768 100644 --- a/Documentation/DocBook/scsi.tmpl +++ b/Documentation/DocBook/scsi.tmpl @@ -317,7 +317,7 @@ <para> The SAS transport class contains common code to deal with SAS HBAs, an aproximated representation of SAS topologies in the driver model, - and various sysfs attributes to expose these topologies and managment + and various sysfs attributes to expose these topologies and management interfaces to userspace. </para> <para> diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 5c555a8b39e5..b7f9d3b4bbf6 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -183,7 +183,7 @@ the MAN-PAGES maintainer (as listed in the MAINTAINERS file) a man-pages patch, or at least a notification of the change, so that some information makes its way into the manual pages. -Even if the maintainer did not respond in step #4, make sure to ALWAYS +Even if the maintainer did not respond in step #5, make sure to ALWAYS copy the maintainer when you change their code. For small patches you may want to CC the Trivial Patch Monkey diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index aa73e72fd793..6e25c2659e0a 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -116,7 +116,7 @@ error: } -int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, +static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, __u8 genl_cmd, __u16 nla_type, void *nla_data, int nla_len) { @@ -160,7 +160,7 @@ int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, * Probe the controller in genetlink to find the family id * for the TASKSTATS family */ -int get_family_id(int sd) +static int get_family_id(int sd) { struct { struct nlmsghdr n; @@ -190,7 +190,7 @@ int get_family_id(int sd) return id; } -void print_delayacct(struct taskstats *t) +static void print_delayacct(struct taskstats *t) { printf("\n\nCPU %15s%15s%15s%15s\n" " %15llu%15llu%15llu%15llu\n" @@ -216,7 +216,7 @@ void print_delayacct(struct taskstats *t) (unsigned long long)t->freepages_delay_total); } -void task_context_switch_counts(struct taskstats *t) +static void task_context_switch_counts(struct taskstats *t) { printf("\n\nTask %15s%15s\n" " %15llu%15llu\n", @@ -224,7 +224,7 @@ void task_context_switch_counts(struct taskstats *t) (unsigned long long)t->nvcsw, (unsigned long long)t->nivcsw); } -void print_cgroupstats(struct cgroupstats *c) +static void print_cgroupstats(struct cgroupstats *c) { printf("sleeping %llu, blocked %llu, running %llu, stopped %llu, " "uninterruptible %llu\n", (unsigned long long)c->nr_sleeping, @@ -235,7 +235,7 @@ void print_cgroupstats(struct cgroupstats *c) } -void print_ioacct(struct taskstats *t) +static void print_ioacct(struct taskstats *t) { printf("%s: read=%llu, write=%llu, cancelled_write=%llu\n", t->ac_comm, diff --git a/Documentation/auxdisplay/cfag12864b-example.c b/Documentation/auxdisplay/cfag12864b-example.c index 2caeea5e4993..1d2c010bae12 100644 --- a/Documentation/auxdisplay/cfag12864b-example.c +++ b/Documentation/auxdisplay/cfag12864b-example.c @@ -62,7 +62,7 @@ unsigned char cfag12864b_buffer[CFAG12864B_SIZE]; * Unable to open: return = -1 * Unable to mmap: return = -2 */ -int cfag12864b_init(char *path) +static int cfag12864b_init(char *path) { cfag12864b_fd = open(path, O_RDWR); if (cfag12864b_fd == -1) @@ -81,7 +81,7 @@ int cfag12864b_init(char *path) /* * exit a cfag12864b framebuffer device */ -void cfag12864b_exit(void) +static void cfag12864b_exit(void) { munmap(cfag12864b_mem, CFAG12864B_SIZE); close(cfag12864b_fd); @@ -90,7 +90,7 @@ void cfag12864b_exit(void) /* * set (x, y) pixel */ -void cfag12864b_set(unsigned char x, unsigned char y) +static void cfag12864b_set(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] |= @@ -100,7 +100,7 @@ void cfag12864b_set(unsigned char x, unsigned char y) /* * unset (x, y) pixel */ -void cfag12864b_unset(unsigned char x, unsigned char y) +static void cfag12864b_unset(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] &= @@ -113,7 +113,7 @@ void cfag12864b_unset(unsigned char x, unsigned char y) * Pixel off: return = 0 * Pixel on: return = 1 */ -unsigned char cfag12864b_isset(unsigned char x, unsigned char y) +static unsigned char cfag12864b_isset(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) if (cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] & @@ -126,7 +126,7 @@ unsigned char cfag12864b_isset(unsigned char x, unsigned char y) /* * not (x, y) pixel */ -void cfag12864b_not(unsigned char x, unsigned char y) +static void cfag12864b_not(unsigned char x, unsigned char y) { if (cfag12864b_isset(x, y)) cfag12864b_unset(x, y); @@ -137,7 +137,7 @@ void cfag12864b_not(unsigned char x, unsigned char y) /* * fill (set all pixels) */ -void cfag12864b_fill(void) +static void cfag12864b_fill(void) { unsigned short i; @@ -148,7 +148,7 @@ void cfag12864b_fill(void) /* * clear (unset all pixels) */ -void cfag12864b_clear(void) +static void cfag12864b_clear(void) { unsigned short i; @@ -162,7 +162,7 @@ void cfag12864b_clear(void) * Pixel off: src[i] = 0 * Pixel on: src[i] > 0 */ -void cfag12864b_format(unsigned char * matrix) +static void cfag12864b_format(unsigned char * matrix) { unsigned char i, j, n; @@ -182,7 +182,7 @@ void cfag12864b_format(unsigned char * matrix) /* * blit buffer to lcd */ -void cfag12864b_blit(void) +static void cfag12864b_blit(void) { memcpy(cfag12864b_mem, cfag12864b_buffer, CFAG12864B_SIZE); } @@ -198,7 +198,7 @@ void cfag12864b_blit(void) #define EXAMPLES 6 -void example(unsigned char n) +static void example(unsigned char n) { unsigned short i, j; unsigned char matrix[CFAG12864B_WIDTH * CFAG12864B_HEIGHT]; diff --git a/Documentation/fb/ep93xx-fb.txt b/Documentation/fb/ep93xx-fb.txt new file mode 100644 index 000000000000..5af1bd9effae --- /dev/null +++ b/Documentation/fb/ep93xx-fb.txt @@ -0,0 +1,135 @@ +================================ +Driver for EP93xx LCD controller +================================ + +The EP93xx LCD controller can drive both standard desktop monitors and +embedded LCD displays. If you have a standard desktop monitor then you +can use the standard Linux video mode database. In your board file: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = EP93XXFB_USE_MODEDB, + .bpp = 16, + }; + +If you have an embedded LCD display then you need to define a video +mode for it as follows: + + static struct fb_videomode some_board_video_modes[] = { + { + .name = "some_lcd_name", + /* Pixel clock, porches, etc */ + }, + }; + +Note that the pixel clock value is in pico-seconds. You can use the +KHZ2PICOS macro to convert the pixel clock value. Most other values +are in pixel clocks. See Documentation/fb/framebuffer.txt for further +details. + +The ep93xxfb_mach_info structure for your board should look like the +following: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = ARRAY_SIZE(some_board_video_modes), + .modes = some_board_video_modes, + .default_mode = &some_board_video_modes[0], + .bpp = 16, + }; + +The framebuffer device can be registered by adding the following to +your board initialisation function: + + ep93xx_register_fb(&some_board_fb_info); + +===================== +Video Attribute Flags +===================== + +The ep93xxfb_mach_info structure has a flags field which can be used +to configure the controller. The video attributes flags are fully +documented in section 7 of the EP93xx users' guide. The following +flags are available: + +EP93XXFB_PCLK_FALLING Clock data on the falling edge of the + pixel clock. The default is to clock + data on the rising edge. + +EP93XXFB_SYNC_BLANK_HIGH Blank signal is active high. By + default the blank signal is active low. + +EP93XXFB_SYNC_HORIZ_HIGH Horizontal sync is active high. By + default the horizontal sync is active low. + +EP93XXFB_SYNC_VERT_HIGH Vertical sync is active high. By + default the vertical sync is active high. + +The physical address of the framebuffer can be controlled using the +following flags: + +EP93XXFB_USE_SDCSN0 Use SDCSn[0] for the framebuffer. This + is the default setting. + +EP93XXFB_USE_SDCSN1 Use SDCSn[1] for the framebuffer. + +EP93XXFB_USE_SDCSN2 Use SDCSn[2] for the framebuffer. + +EP93XXFB_USE_SDCSN3 Use SDCSn[3] for the framebuffer. + +================== +Platform callbacks +================== + +The EP93xx framebuffer driver supports three optional platform +callbacks: setup, teardown and blank. The setup and teardown functions +are called when the framebuffer driver is installed and removed +respectively. The blank function is called whenever the display is +blanked or unblanked. + +The setup and teardown devices pass the platform_device structure as +an argument. The fb_info and ep93xxfb_mach_info structures can be +obtained as follows: + + static int some_board_fb_setup(struct platform_device *pdev) + { + struct ep93xxfb_mach_info *mach_info = pdev->dev.platform_data; + struct fb_info *fb_info = platform_get_drvdata(pdev); + + /* Board specific framebuffer setup */ + } + +====================== +Setting the video mode +====================== + +The video mode is set using the following syntax: + + video=XRESxYRES[-BPP][@REFRESH] + +If the EP93xx video driver is built-in then the video mode is set on +the Linux kernel command line, for example: + + video=ep93xx-fb:800x600-16@60 + +If the EP93xx video driver is built as a module then the video mode is +set when the module is installed: + + modprobe ep93xx-fb video=320x240 + +============== +Screenpage bug +============== + +At least on the EP9315 there is a silicon bug which causes bit 27 of +the VIDSCRNPAGE (framebuffer physical offset) to be tied low. There is +an unofficial errata for this bug at: + http://marc.info/?l=linux-arm-kernel&m=110061245502000&w=2 + +By default the EP93xx framebuffer driver checks if the allocated physical +address has bit 27 set. If it does, then the memory is freed and an +error is returned. The check can be disabled by adding the following +option when loading the driver: + + ep93xx-fb.check_screenpage_bug=0 + +In some cases it may be possible to reconfigure your SDRAM layout to +avoid this bug. See section 13 of the EP93xx users' guide for details. diff --git a/Documentation/fb/matroxfb.txt b/Documentation/fb/matroxfb.txt index ad7a67707d62..e5ce8a1a978b 100644 --- a/Documentation/fb/matroxfb.txt +++ b/Documentation/fb/matroxfb.txt @@ -186,9 +186,7 @@ noinverse - show true colors on screen. It is default. dev:X - bind driver to device X. Driver numbers device from 0 up to N, where device 0 is first `known' device found, 1 second and so on. lspci lists devices in this order. - Default is `every' known device for driver with multihead support - and first working device (usually dev:0) for driver without - multihead support. + Default is `every' known device. nohwcursor - disables hardware cursor (use software cursor instead). hwcursor - enables hardware cursor. It is default. If you are using non-accelerated mode (`noaccel' or `fbset -accel false'), software diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt index f12c30c93f2f..5af164f4b37b 100644 --- a/Documentation/filesystems/ncpfs.txt +++ b/Documentation/filesystems/ncpfs.txt @@ -7,6 +7,6 @@ ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors will have it as well. Related products are linware and mars_nwe, which will give Linux partial -NetWare server functionality. Linware's home site is -klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on -ftp.gwdg.de/pub/linux/misc/ncpfs. +NetWare server functionality. + +mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs41-server.txt index 05d81cbcb2e1..5920fe26e6ff 100644 --- a/Documentation/filesystems/nfs41-server.txt +++ b/Documentation/filesystems/nfs41-server.txt @@ -11,6 +11,11 @@ the /proc/fs/nfsd/versions control file. Note that to write this control file, the nfsd service must be taken down. Use your user-mode nfs-utils to set this up; see rpc.nfsd(8) +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based on the latest NFSv4.1 Internet Draft: http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 @@ -25,6 +30,49 @@ are still under development out of tree. See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design for more information. +The current implementation is intended for developers only: while it +does support ordinary file operations on clients we have tested against +(including the linux client), it is incomplete in ways which may limit +features unexpectedly, cause known bugs in rare cases, or cause +interoperability problems with future clients. Known issues: + + - gss support is questionable: currently mounts with kerberos + from a linux client are possible, but we aren't really + conformant with the spec (for example, we don't use kerberos + on the backchannel correctly). + - no trunking support: no clients currently take advantage of + trunking, but this is a mandatory failure, and its use is + recommended to clients in a number of places. (E.g. to ensure + timely renewal in case an existing connection's retry timeouts + have gotten too long; see section 8.3 of the draft.) + Therefore, lack of this feature may cause future clients to + fail. + - Incomplete backchannel support: incomplete backchannel gss + support and no support for BACKCHANNEL_CTL mean that + callbacks (hence delegations and layouts) may not be + available and clients confused by the incomplete + implementation may fail. + - Server reboot recovery is unsupported; if the server reboots, + clients may fail. + - We do not support SSV, which provides security for shared + client-server state (thus preventing unauthorized tampering + with locks and opens, for example). It is mandatory for + servers to support this, though no clients use it yet. + - Mandatory operations which we do not support, such as + DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and + TEST_STATEID, are not currently used by clients, but will be + (and the spec recommends their uses in common cases), and + clients should not be expected to know how to recover from the + case where they are not supported. This will eventually cause + interoperability failures. + +In addition, some limitations are inherited from the current NFSv4 +implementation: + + - Incomplete delegation enforcement: if a file is renamed or + unlinked, a client holding a delegation may continue to + indefinitely allow opens of the file under the old name. + The table below, taken from the NFSv4.1 document, lists the operations that are mandatory to implement (REQ), optional (OPT), and NFSv4.0 operations that are required not to implement (MNI) @@ -142,6 +190,12 @@ NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | Implementation notes: +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + EXCHANGE_ID: * only SP4_NONE state protection supported * implementation ids are ignored diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfsroot.txt index 68baddf3c3e0..3ba0b945aaf8 100644 --- a/Documentation/filesystems/nfsroot.txt +++ b/Documentation/filesystems/nfsroot.txt @@ -105,7 +105,7 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> the client address and this parameter is NOT empty only replies from the specified server are accepted. - Only required for for NFS root. That is autoconfiguration + Only required for NFS root. That is autoconfiguration will not be triggered if it is missing and NFS root is not in operation. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ffead13f9443..b5aee7838a00 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -176,6 +176,7 @@ read the file /proc/PID/status: CapBnd: ffffffffffffffff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1 + Stack usage: 12 kB This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its @@ -229,6 +230,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) Mems_allowed_list Same as previous, but in "list format" voluntary_ctxt_switches number of voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches + Stack usage: stack usage high water mark (round up to page size) .............................................................................. Table 1-3: Contents of the statm files (as of 2.6.8-rc3) @@ -307,7 +309,7 @@ address perms offset dev inode pathname 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7cb1000-a7cb2000 ---p 00000000 00:00 0 -a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 [threadstack:001ff4b4] a7eb2000-a7eb3000 ---p 00000000 00:00 0 a7eb3000-a7ed5000 rw-p 00000000 00:00 0 a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 @@ -343,6 +345,7 @@ is not associated with a file: [stack] = the stack of the main process [vdso] = the "virtual dynamic shared object", the kernel system call handler + [threadstack:xxxxxxxx] = the stack of the thread, xxxxxxxx is the stack size or if empty, the mapping is anonymous. @@ -375,6 +378,19 @@ of memory currently marked as referenced or accessed. This file is only present if the CONFIG_MMU kernel configuration option is enabled. +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process. +To clear the bits for all the pages associated with the process + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process + > echo 3 > /proc/PID/clear_refs +Any other value written to /proc/PID/clear_refs will have no effect. + + 1.2 Kernel data --------------- @@ -1032,9 +1048,9 @@ Various pieces of information about kernel activity are available in the since the system first booted. For a quick look, simply cat the file: > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 - cpu1 1123 0 849 11313845 2614 0 18 0 + cpu 2255 34 2290 22625563 6290 127 456 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] ctxt 1990473 btime 1062191376 @@ -1056,6 +1072,7 @@ second). The meanings of the columns are as follows, from left to right: - irq: servicing interrupts - softirq: servicing softirqs - steal: involuntary wait +- guest: running a guest The "intr" line gives counts of interrupts serviced since boot time, for each of the possible system interrupts. The first column is the total of all @@ -1191,7 +1208,7 @@ The following heuristics are then applied: * if the task was reniced, its score doubles * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE or CAP_SYS_RAWIO) have their score divided by 4 - * if oom condition happened in one cpuset and checked task does not belong + * if oom condition happened in one cpuset and checked process does not belong to it, its score is divided by 8 * the resulting score is multiplied by two to the power of oom_adj, i.e. points <<= oom_adj when it is positive and diff --git a/Documentation/gcov.txt b/Documentation/gcov.txt index 40ec63352760..e7ca6478cd93 100644 --- a/Documentation/gcov.txt +++ b/Documentation/gcov.txt @@ -47,7 +47,7 @@ Possible uses: Configure the kernel with: - CONFIG_DEBUGFS=y + CONFIG_DEBUG_FS=y CONFIG_GCOV_KERNEL=y and to get coverage data for the entire kernel: diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index e4b6985044a2..fa4dc077ae0e 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt @@ -524,6 +524,13 @@ and have the following read/write attributes: is configured as an output, this value may be written; any nonzero value is treated as high. + "edge" ... reads as either "none", "rising", "falling", or + "both". Write these strings to select the signal edge(s) + that will make poll(2) on the "value" file return. + + This file exists only if the pin can be configured as an + interrupt generating input pin. + GPIO controllers have paths like /sys/class/gpio/chipchip42/ (for the controller implementing GPIOs starting at #42) and have the following read-only attributes: @@ -555,6 +562,11 @@ requested using gpio_request(): /* reverse gpio_export() */ void gpio_unexport(); + /* create a sysfs link to an exported GPIO node */ + int gpio_export_link(struct device *dev, const char *name, + unsigned gpio) + + After a kernel driver requests a GPIO, it may only be made available in the sysfs interface by gpio_export(). The driver can control whether the signal direction may change. This helps drivers prevent userspace code @@ -563,3 +575,8 @@ from accidentally clobbering important system state. This explicit exporting can help with debugging (by making some kinds of experiments easier), or can provide an always-there interface that's suitable for documenting as part of a board support package. + +After the GPIO has been exported, gpio_export_link() allows creating +symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can +use this to provide the interface under their own device in sysfs with +a descriptive name. diff --git a/Documentation/hwmon/acpi_power_meter b/Documentation/hwmon/acpi_power_meter new file mode 100644 index 000000000000..c80399a00c50 --- /dev/null +++ b/Documentation/hwmon/acpi_power_meter @@ -0,0 +1,51 @@ +Kernel driver power_meter +========================= + +This driver talks to ACPI 4.0 power meters. + +Supported systems: + * Any recent system with ACPI 4.0. + Prefix: 'power_meter' + Datasheet: http://acpi.info/, section 10.4. + +Author: Darrick J. Wong + +Description +----------- + +This driver implements sensor reading support for the power meters exposed in +the ACPI 4.0 spec (Chapter 10.4). These devices have a simple set of +features--a power meter that returns average power use over a configurable +interval, an optional capping mechanism, and a couple of trip points. The +sysfs interface conforms with the specification outlined in the "Power" section +of Documentation/hwmon/sysfs-interface. + +Special Features +---------------- + +The power[1-*]_is_battery knob indicates if the power supply is a battery. +Both power[1-*]_average_{min,max} must be set before the trip points will work. +When both of them are set, an ACPI event will be broadcast on the ACPI netlink +socket and a poll notification will be sent to the appropriate +power[1-*]_average sysfs file. + +The power[1-*]_{model_number, serial_number, oem_info} fields display arbitrary +strings that ACPI provides with the meter. The measures/ directory contains +symlinks to the devices that this meter measures. + +Some computers have the ability to enforce a power cap in hardware. If this is +the case, the power[1-*]_cap and related sysfs files will appear. When the +average power consumption exceeds the cap, an ACPI event will be broadcast on +the netlink event socket and a poll notification will be sent to the +appropriate power[1-*]_alarm file to indicate that capping has begun, and the +hardware has taken action to reduce power consumption. Most likely this will +result in reduced performance. + +There are a few other ACPI notifications that can be sent by the firmware. In +all cases the ACPI event will be broadcast on the ACPI netlink event socket as +well as sent as a poll notification to a sysfs file. The events are as +follows: + +power[1-*]_cap will be notified if the firmware changes the power cap. +power[1-*]_interval will be notified if the firmware changes the averaging +interval. diff --git a/Documentation/hwmon/hpfall.c b/Documentation/hwmon/hpfall.c index bbea1ccfd46a..681ec22b9d0e 100644 --- a/Documentation/hwmon/hpfall.c +++ b/Documentation/hwmon/hpfall.c @@ -16,6 +16,34 @@ #include <stdint.h> #include <errno.h> #include <signal.h> +#include <sys/mman.h> +#include <sched.h> + +char unload_heads_path[64]; + +int set_unload_heads_path(char *device) +{ + char devname[64]; + + if (strlen(device) <= 5 || strncmp(device, "/dev/", 5) != 0) + return -EINVAL; + strncpy(devname, device + 5, sizeof(devname)); + + snprintf(unload_heads_path, sizeof(unload_heads_path), + "/sys/block/%s/device/unload_heads", devname); + return 0; +} +int valid_disk(void) +{ + int fd = open(unload_heads_path, O_RDONLY); + if (fd < 0) { + perror(unload_heads_path); + return 0; + } + + close(fd); + return 1; +} void write_int(char *path, int i) { @@ -40,7 +68,7 @@ void set_led(int on) void protect(int seconds) { - write_int("/sys/block/sda/device/unload_heads", seconds*1000); + write_int(unload_heads_path, seconds*1000); } int on_ac(void) @@ -57,45 +85,62 @@ void ignore_me(void) { protect(0); set_led(0); - } -int main(int argc, char* argv[]) +int main(int argc, char **argv) { - int fd, ret; + int fd, ret; + struct sched_param param; + + if (argc == 1) + ret = set_unload_heads_path("/dev/sda"); + else if (argc == 2) + ret = set_unload_heads_path(argv[1]); + else + ret = -EINVAL; + + if (ret || !valid_disk()) { + fprintf(stderr, "usage: %s <device> (default: /dev/sda)\n", + argv[0]); + exit(1); + } + + fd = open("/dev/freefall", O_RDONLY); + if (fd < 0) { + perror("/dev/freefall"); + return EXIT_FAILURE; + } - fd = open("/dev/freefall", O_RDONLY); - if (fd < 0) { - perror("open"); - return EXIT_FAILURE; - } + daemon(0, 0); + param.sched_priority = sched_get_priority_max(SCHED_FIFO); + sched_setscheduler(0, SCHED_FIFO, ¶m); + mlockall(MCL_CURRENT|MCL_FUTURE); signal(SIGALRM, ignore_me); - for (;;) { - unsigned char count; - - ret = read(fd, &count, sizeof(count)); - alarm(0); - if ((ret == -1) && (errno == EINTR)) { - /* Alarm expired, time to unpark the heads */ - continue; - } - - if (ret != sizeof(count)) { - perror("read"); - break; - } - - protect(21); - set_led(1); - if (1 || on_ac() || lid_open()) { - alarm(2); - } else { - alarm(20); - } - } - - close(fd); - return EXIT_SUCCESS; + for (;;) { + unsigned char count; + + ret = read(fd, &count, sizeof(count)); + alarm(0); + if ((ret == -1) && (errno == EINTR)) { + /* Alarm expired, time to unpark the heads */ + continue; + } + + if (ret != sizeof(count)) { + perror("read"); + break; + } + + protect(21); + set_led(1); + if (1 || on_ac() || lid_open()) + alarm(2); + else + alarm(20); + } + + close(fd); + return EXIT_SUCCESS; } diff --git a/Documentation/hwmon/pc87427 b/Documentation/hwmon/pc87427 index d1ebbe510f35..db5cc1227a83 100644 --- a/Documentation/hwmon/pc87427 +++ b/Documentation/hwmon/pc87427 @@ -34,5 +34,5 @@ Fan rotation speeds are reported as 14-bit values from a gated clock signal. Speeds down to 83 RPM can be measured. An alarm is triggered if the rotation speed drops below a programmable -limit. Another alarm is triggered if the speed is too low to to be measured +limit. Another alarm is triggered if the speed is too low to be measured (including stalled or missing fan). diff --git a/Documentation/i2c/busses/i2c-piix4 b/Documentation/i2c/busses/i2c-piix4 index f889481762b5..c5b37c570554 100644 --- a/Documentation/i2c/busses/i2c-piix4 +++ b/Documentation/i2c/busses/i2c-piix4 @@ -8,6 +8,8 @@ Supported adapters: Datasheet: Only available via NDA from ServerWorks * ATI IXP200, IXP300, IXP400, SB600, SB700 and SB800 southbridges Datasheet: Not publicly available + * AMD SB900 + Datasheet: Not publicly available * Standard Microsystems (SMSC) SLC90E66 (Victory66) southbridge Datasheet: Publicly available at the SMSC website http://www.smsc.com diff --git a/Documentation/i2c/chips/pca9539 b/Documentation/i2c/chips/pca9539 deleted file mode 100644 index 6aff890088b1..000000000000 --- a/Documentation/i2c/chips/pca9539 +++ /dev/null @@ -1,58 +0,0 @@ -Kernel driver pca9539 -===================== - -NOTE: this driver is deprecated and will be dropped soon, use -drivers/gpio/pca9539.c instead. - -Supported chips: - * Philips PCA9539 - Prefix: 'pca9539' - Addresses scanned: none - Datasheet: - http://www.semiconductors.philips.com/acrobat/datasheets/PCA9539_2.pdf - -Author: Ben Gardner <bgardner@wabtec.com> - - -Description ------------ - -The Philips PCA9539 is a 16 bit low power I/O device. -All 16 lines can be individually configured as an input or output. -The input sense can also be inverted. -The 16 lines are split between two bytes. - - -Detection ---------- - -The PCA9539 is difficult to detect and not commonly found in PC machines, -so you have to pass the I2C bus and address of the installed PCA9539 -devices explicitly to the driver at load time via the force=... parameter. - - -Sysfs entries -------------- - -Each is a byte that maps to the 8 I/O bits. -A '0' suffix is for bits 0-7, while '1' is for bits 8-15. - -input[01] - read the current value -output[01] - sets the output value -direction[01] - direction of each bit: 1=input, 0=output -invert[01] - toggle the input bit sense - -input reads the actual state of the line and is always available. -The direction defaults to input for all channels. - - -General Remarks ---------------- - -Note that each output, direction, and invert entry controls 8 lines. -You should use the read, modify, write sequence. -For example. to set output bit 0 of 1. - val=$(cat output0) - val=$(( $val | 1 )) - echo $val > output0 - diff --git a/Documentation/i2c/chips/pcf8574 b/Documentation/i2c/chips/pcf8574 deleted file mode 100644 index 235815c075ff..000000000000 --- a/Documentation/i2c/chips/pcf8574 +++ /dev/null @@ -1,65 +0,0 @@ -Kernel driver pcf8574 -===================== - -Supported chips: - * Philips PCF8574 - Prefix: 'pcf8574' - Addresses scanned: none - Datasheet: Publicly available at the Philips Semiconductors website - http://www.semiconductors.philips.com/pip/PCF8574P.html - - * Philips PCF8574A - Prefix: 'pcf8574a' - Addresses scanned: none - Datasheet: Publicly available at the Philips Semiconductors website - http://www.semiconductors.philips.com/pip/PCF8574P.html - -Authors: - Frodo Looijaard <frodol@dds.nl>, - Philip Edelbrock <phil@netroedge.com>, - Dan Eaton <dan.eaton@rocketlogix.com>, - Aurelien Jarno <aurelien@aurel32.net>, - Jean Delvare <khali@linux-fr.org>, - - -Description ------------ -The PCF8574(A) is an 8-bit I/O expander for the I2C bus produced by Philips -Semiconductors. It is designed to provide a byte I2C interface to up to 16 -separate devices (8 x PCF8574 and 8 x PCF8574A). - -This device consists of a quasi-bidirectional port. Each of the eight I/Os -can be independently used as an input or output. To setup an I/O as an -input, you have to write a 1 to the corresponding output. - -For more informations see the datasheet. - - -Accessing PCF8574(A) via /sys interface -------------------------------------- - -The PCF8574(A) is plainly impossible to detect ! Stupid chip. -So, you have to pass the I2C bus and address of the installed PCF857A -and PCF8574A devices explicitly to the driver at load time via the -force=... parameter. - -On detection (i.e. insmod, modprobe et al.), directories are being -created for each detected PCF8574(A): - -/sys/bus/i2c/devices/<0>-<1>/ -where <0> is the bus the chip was detected on (e. g. i2c-0) -and <1> the chip address ([20..27] or [38..3f]): - -(example: /sys/bus/i2c/devices/1-0020/) - -Inside these directories, there are two files each: -read and write (and one file with chip name). - -The read file is read-only. Reading gives you the current I/O input -if the corresponding output is set as 1, otherwise the current output -value, that is to say 0. - -The write file is read/write. Writing a value outputs it on the I/O -port. Reading returns the last written value. As it is not possible -to read this value from the chip, you need to write at least once to -this file before you can read back from it. diff --git a/Documentation/i2c/chips/pcf8575 b/Documentation/i2c/chips/pcf8575 deleted file mode 100644 index 40b268eb276f..000000000000 --- a/Documentation/i2c/chips/pcf8575 +++ /dev/null @@ -1,69 +0,0 @@ -About the PCF8575 chip and the pcf8575 kernel driver -==================================================== - -The PCF8575 chip is produced by the following manufacturers: - - * Philips NXP - http://www.nxp.com/#/pip/cb=[type=product,path=50807/41735/41850,final=PCF8575_3]|pip=[pip=PCF8575_3][0] - - * Texas Instruments - http://focus.ti.com/docs/prod/folders/print/pcf8575.html - - -Some vendors sell small PCB's with the PCF8575 mounted on it. You can connect -such a board to a Linux host via e.g. an USB to I2C interface. Examples of -PCB boards with a PCF8575: - - * SFE Breakout Board for PCF8575 I2C Expander by RobotShop - http://www.robotshop.ca/home/products/robot-parts/electronics/adapters-converters/sfe-pcf8575-i2c-expander-board.html - - * Breakout Board for PCF8575 I2C Expander by Spark Fun Electronics - http://www.sparkfun.com/commerce/product_info.php?products_id=8130 - - -Description ------------ -The PCF8575 chip is a 16-bit I/O expander for the I2C bus. Up to eight of -these chips can be connected to the same I2C bus. You can find this -chip on some custom designed hardware, but you won't find it on PC -motherboards. - -The PCF8575 chip consists of a 16-bit quasi-bidirectional port and an I2C-bus -interface. Each of the sixteen I/O's can be independently used as an input or -an output. To set up an I/O pin as an input, you have to write a 1 to the -corresponding output. - -For more information please see the datasheet. - - -Detection ---------- - -There is no method known to detect whether a chip on a given I2C address is -a PCF8575 or whether it is any other I2C device, so you have to pass the I2C -bus and address of the installed PCF8575 devices explicitly to the driver at -load time via the force=... parameter. - -/sys interface --------------- - -For each address on which a PCF8575 chip was found or forced the following -files will be created under /sys: -* /sys/bus/i2c/devices/<bus>-<address>/read -* /sys/bus/i2c/devices/<bus>-<address>/write -where bus is the I2C bus number (0, 1, ...) and address is the four-digit -hexadecimal representation of the 7-bit I2C address of the PCF8575 -(0020 .. 0027). - -The read file is read-only. Reading it will trigger an I2C read and will hence -report the current input state for the pins configured as inputs, and the -current output value for the pins configured as outputs. - -The write file is read-write. Writing a value to it will configure all pins -as output for which the corresponding bit is zero. Reading the write file will -return the value last written, or -EAGAIN if no value has yet been written to -the write file. - -On module initialization the configuration of the chip is not changed -- the -chip is left in the state it was already configured in through either power-up -or through previous I2C write actions. diff --git a/Documentation/ia64/aliasing-test.c b/Documentation/ia64/aliasing-test.c index d23610fb2ff9..3dfb76ca6931 100644 --- a/Documentation/ia64/aliasing-test.c +++ b/Documentation/ia64/aliasing-test.c @@ -24,7 +24,7 @@ int sum; -int map_mem(char *path, off_t offset, size_t length, int touch) +static int map_mem(char *path, off_t offset, size_t length, int touch) { int fd, rc; void *addr; @@ -62,7 +62,7 @@ int map_mem(char *path, off_t offset, size_t length, int touch) return 0; } -int scan_tree(char *path, char *file, off_t offset, size_t length, int touch) +static int scan_tree(char *path, char *file, off_t offset, size_t length, int touch) { struct dirent **namelist; char *name, *path2; @@ -119,7 +119,7 @@ skip: char buf[1024]; -int read_rom(char *path) +static int read_rom(char *path) { int fd, rc; size_t size = 0; @@ -146,7 +146,7 @@ int read_rom(char *path) return size; } -int scan_rom(char *path, char *file) +static int scan_rom(char *path, char *file) { struct dirent **namelist; char *name, *path2; diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0f17d16dc101..6fa7292947e5 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -671,7 +671,7 @@ and is between 256 and 4096 characters. It is defined in the file earlyprintk= [X86,SH,BLACKFIN] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] - earlyprintk=dbgp + earlyprintk=dbgp[debugController#] Append ",keep" to not disable it when the real console takes over. @@ -933,7 +933,7 @@ and is between 256 and 4096 characters. It is defined in the file 1 -- enable informational integrity auditing messages. ima_hash= [IMA] - Formt: { "sha1" | "md5" } + Format: { "sha1" | "md5" } default: "sha1" ima_tcb [IMA] diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt index 363044609dad..c28f82895d6b 100644 --- a/Documentation/kmemcheck.txt +++ b/Documentation/kmemcheck.txt @@ -43,26 +43,7 @@ feature. 1. Downloading ============== -kmemcheck can only be downloaded using git. If you want to write patches -against the current code, you should use the kmemcheck development branch of -the tip tree. It is also possible to use the linux-next tree, which also -includes the latest version of kmemcheck. - -Assuming that you've already cloned the linux-2.6.git repository, all you -have to do is add the -tip tree as a remote, like this: - - $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git - -To actually download the tree, fetch the remote: - - $ git fetch tip - -And to check out a new local branch with the kmemcheck code: - - $ git checkout -b kmemcheck tip/kmemcheck - -General instructions for the -tip tree can be found here: -http://people.redhat.com/mingo/tip.git/readme.txt +As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. 2. Configuring and compiling diff --git a/Documentation/laptops/asus-laptop.txt b/Documentation/laptops/asus-laptop.txt new file mode 100644 index 000000000000..c1c5be84e4b1 --- /dev/null +++ b/Documentation/laptops/asus-laptop.txt @@ -0,0 +1,258 @@ +Asus Laptop Extras + +Version 0.1 +August 6, 2009 + +Corentin Chary <corentincj@iksaif.net> +http://acpi4asus.sf.net/ + + This driver provides support for extra features of ACPI-compatible ASUS laptops. + It may also support some MEDION, JVC or VICTOR laptops (such as MEDION 9675 or + VICTOR XP7210 for example). It makes all the extra buttons generate standard + ACPI events that go through /proc/acpi/events and input events (like keyboards). + On some models adds support for changing the display brightness and output, + switching the LCD backlight on and off, and most importantly, allows you to + blink those fancy LEDs intended for reporting mail and wireless status. + +This driver supercedes the old asus_acpi driver. + +Requirements +------------ + + Kernel 2.6.X sources, configured for your computer, with ACPI support. + You also need CONFIG_INPUT and CONFIG_ACPI. + +Status +------ + + The features currently supported are the following (see below for + detailed description): + + - Fn key combinations + - Bluetooth enable and disable + - Wlan enable and disable + - GPS enable and disable + - Video output switching + - Ambient Light Sensor on and off + - LED control + - LED Display control + - LCD brightness control + - LCD on and off + + A compatibility table by model and feature is maintained on the web + site, http://acpi4asus.sf.net/. + +Usage +----- + + Try "modprobe asus_acpi". Check your dmesg (simply type dmesg). You should + see some lines like this : + + Asus Laptop Extras version 0.42 + L2D model detected. + + If it is not the output you have on your laptop, send it (and the laptop's + DSDT) to me. + + That's all, now, all the events generated by the hotkeys of your laptop + should be reported in your /proc/acpi/event entry. You can check with + "acpi_listen". + + Hotkeys are also reported as input keys (like keyboards) you can check + which key are supported using "xev" under X11. + + You can get informations on the version of your DSDT table by reading the + /sys/devices/platform/asus-laptop/infos entry. If you have a question or a + bug report to do, please include the output of this entry. + +LEDs +---- + + You can modify LEDs be echoing values to /sys/class/leds/asus::*/brightness : + echo 1 > /sys/class/leds/asus::mail/brightness + will switch the mail LED on. + You can also know if they are on/off by reading their content and use + kernel triggers like ide-disk or heartbeat. + +Backlight +--------- + + You can control lcd backlight power and brightness with + /sys/class/backlight/asus-laptop/. Brightness Values are between 0 and 15. + +Wireless devices +--------------- + + You can turn the internal Bluetooth adapter on/off with the bluetooth entry + (only on models with Bluetooth). This usually controls the associated LED. + Same for Wlan adapter. + +Display switching +----------------- + + Note: the display switching code is currently considered EXPERIMENTAL. + + Switching works for the following models: + L3800C + A2500H + L5800C + M5200N + W1000N (albeit with some glitches) + M6700R + A6JC + F3J + + Switching doesn't work for the following: + M3700N + L2X00D (locks the laptop under certain conditions) + + To switch the displays, echo values from 0 to 15 to + /sys/devices/platform/asus-laptop/display. The significance of those values + is as follows: + + +-------+-----+-----+-----+-----+-----+ + | Bin | Val | DVI | TV | CRT | LCD | + +-------+-----+-----+-----+-----+-----+ + + 0000 + 0 + + + + + + +-------+-----+-----+-----+-----+-----+ + + 0001 + 1 + + + + X + + +-------+-----+-----+-----+-----+-----+ + + 0010 + 2 + + + X + + + +-------+-----+-----+-----+-----+-----+ + + 0011 + 3 + + + X + X + + +-------+-----+-----+-----+-----+-----+ + + 0100 + 4 + + X + + + + +-------+-----+-----+-----+-----+-----+ + + 0101 + 5 + + X + + X + + +-------+-----+-----+-----+-----+-----+ + + 0110 + 6 + + X + X + + + +-------+-----+-----+-----+-----+-----+ + + 0111 + 7 + + X + X + X + + +-------+-----+-----+-----+-----+-----+ + + 1000 + 8 + X + + + + + +-------+-----+-----+-----+-----+-----+ + + 1001 + 9 + X + + + X + + +-------+-----+-----+-----+-----+-----+ + + 1010 + 10 + X + + X + + + +-------+-----+-----+-----+-----+-----+ + + 1011 + 11 + X + + X + X + + +-------+-----+-----+-----+-----+-----+ + + 1100 + 12 + X + X + + + + +-------+-----+-----+-----+-----+-----+ + + 1101 + 13 + X + X + + X + + +-------+-----+-----+-----+-----+-----+ + + 1110 + 14 + X + X + X + + + +-------+-----+-----+-----+-----+-----+ + + 1111 + 15 + X + X + X + X + + +-------+-----+-----+-----+-----+-----+ + + In most cases, the appropriate displays must be plugged in for the above + combinations to work. TV-Out may need to be initialized at boot time. + + Debugging: + 1) Check whether the Fn+F8 key: + a) does not lock the laptop (try disabling CONFIG_X86_UP_APIC or boot with + noapic / nolapic if it does) + b) generates events (0x6n, where n is the value corresponding to the + configuration above) + c) actually works + Record the disp value at every configuration. + 2) Echo values from 0 to 15 to /sys/devices/platform/asus-laptop/display. + Record its value, note any change. If nothing changes, try a broader range, + up to 65535. + 3) Send ANY output (both positive and negative reports are needed, unless your + machine is already listed above) to the acpi4asus-user mailing list. + + Note: on some machines (e.g. L3C), after the module has been loaded, only 0x6n + events are generated and no actual switching occurs. In such a case, a line + like: + + echo $((10#$arg-60)) > /sys/devices/platform/asus-laptop/display + + will usually do the trick ($arg is the 0000006n-like event passed to acpid). + + Note: there is currently no reliable way to read display status on xxN + (Centrino) models. + +LED display +----------- + + Some models like the W1N have a LED display that can be used to display + several informations. + + LED display works for the following models: + W1000N + W1J + + To control the LED display, use the following : + + echo 0x0T000DDD > /sys/devices/platform/asus-laptop/ + + where T control the 3 letters display, and DDD the 3 digits display, + according to the tables below. + + DDD (digits) + 000 to 999 = display digits + AAA = --- + BBB to FFF = turn-off + + T (type) + 0 = off + 1 = dvd + 2 = vcd + 3 = mp3 + 4 = cd + 5 = tv + 6 = cpu + 7 = vol + + For example "echo 0x01000001 >/sys/devices/platform/asus-laptop/ledd" + would display "DVD001". + +Driver options: +--------------- + + Options can be passed to the asus-laptop driver using the standard + module argument syntax (<param>=<value> when passing the option to the + module or asus-laptop.<param>=<value> on the kernel boot line when + asus-laptop is statically linked into the kernel). + + wapf: WAPF defines the behavior of the Fn+Fx wlan key + The significance of values is yet to be found, but + most of the time: + - 0x0 should do nothing + - 0x1 should allow to control the device with Fn+Fx key. + - 0x4 should send an ACPI event (0x88) while pressing the Fn+Fx key + - 0x5 like 0x1 or 0x4 + + The default value is 0x1. + +Unsupported models +------------------ + + These models will never be supported by this module, as they use a completely + different mechanism to handle LEDs and extra stuff (meaning we have no clue + how it works): + + - ASUS A1300 (A1B), A1370D + - ASUS L7300G + - ASUS L8400 + +Patches, Errors, Questions: +-------------------------- + + I appreciate any success or failure + reports, especially if they add to or correct the compatibility table. + Please include the following information in your report: + + - Asus model name + - a copy of your ACPI tables, using the "acpidump" utility + - a copy of /sys/devices/platform/asus-laptop/infos + - which driver features work and which don't + - the observed behavior of non-working features + + Any other comments or patches are also more than welcome. + + acpi4asus-user@lists.sourceforge.net + http://sourceforge.net/projects/acpi4asus + diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index e2ddcdeb61b6..6d03487ef1c7 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -219,7 +219,7 @@ The following commands can be written to the /proc/acpi/ibm/hotkey file: echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys ... any other 8-hex-digit mask ... - echo reset > /proc/acpi/ibm/hotkey -- restore the original mask + echo reset > /proc/acpi/ibm/hotkey -- restore the recommended mask The following commands have been deprecated and will cause the kernel to log a warning: @@ -240,9 +240,13 @@ sysfs notes: Returns 0. hotkey_bios_mask: + DEPRECATED, DON'T USE, WILL BE REMOVED IN THE FUTURE. + Returns the hot keys mask when thinkpad-acpi was loaded. Upon module unload, the hot keys mask will be restored - to this value. + to this value. This is always 0x80c, because those are + the hotkeys that were supported by ancient firmware + without mask support. hotkey_enable: DEPRECATED, WILL BE REMOVED SOON. diff --git a/Documentation/leds-class.txt b/Documentation/leds-class.txt index 6399557cdab3..8fd5ca2ae32d 100644 --- a/Documentation/leds-class.txt +++ b/Documentation/leds-class.txt @@ -1,3 +1,4 @@ + LED handling under Linux ======================== @@ -5,10 +6,10 @@ If you're reading this and thinking about keyboard leds, these are handled by the input subsystem and the led class is *not* needed. In its simplest form, the LED class just allows control of LEDs from -userspace. LEDs appear in /sys/class/leds/. The brightness file will -set the brightness of the LED (taking a value 0-255). Most LEDs don't -have hardware brightness support so will just be turned on for non-zero -brightness settings. +userspace. LEDs appear in /sys/class/leds/. The maximum brightness of the +LED is defined in max_brightness file. The brightness file will set the brightness +of the LED (taking a value 0-max_brightness). Most LEDs don't have hardware +brightness support so will just be turned on for non-zero brightness settings. The class also introduces the optional concept of an LED trigger. A trigger is a kernel based source of led events. Triggers can either be simple or diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 950cde6d6e58..ba9373f82ab5 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -42,6 +42,7 @@ #include <signal.h> #include "linux/lguest_launcher.h" #include "linux/virtio_config.h" +#include <linux/virtio_ids.h> #include "linux/virtio_net.h" #include "linux/virtio_blk.h" #include "linux/virtio_console.h" @@ -133,6 +134,9 @@ struct device { /* Is it operational */ bool running; + /* Does Guest want an intrrupt on empty? */ + bool irq_on_empty; + /* Device-specific data. */ void *priv; }; @@ -623,10 +627,13 @@ static void trigger_irq(struct virtqueue *vq) return; vq->pending_used = 0; - /* If they don't want an interrupt, don't send one, unless empty. */ - if ((vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) - && lg_last_avail(vq) != vq->vring.avail->idx) - return; + /* If they don't want an interrupt, don't send one... */ + if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) { + /* ... unless they've asked us to force one on empty. */ + if (!vq->dev->irq_on_empty + || lg_last_avail(vq) != vq->vring.avail->idx) + return; + } /* Send the Guest an interrupt tell them we used something up. */ if (write(lguest_fd, buf, sizeof(buf)) != 0) @@ -1042,6 +1049,15 @@ static void create_thread(struct virtqueue *vq) close(vq->eventfd); } +static bool accepted_feature(struct device *dev, unsigned int bit) +{ + const u8 *features = get_feature_bits(dev) + dev->feature_len; + + if (dev->feature_len < bit / CHAR_BIT) + return false; + return features[bit / CHAR_BIT] & (1 << (bit % CHAR_BIT)); +} + static void start_device(struct device *dev) { unsigned int i; @@ -1055,6 +1071,8 @@ static void start_device(struct device *dev) verbose(" %02x", get_feature_bits(dev) [dev->feature_len+i]); + dev->irq_on_empty = accepted_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY); + for (vq = dev->vq; vq; vq = vq->next) { if (vq->service) create_thread(vq); diff --git a/Documentation/memory.txt b/Documentation/memory.txt index 2b3dedd39538..802efe58647c 100644 --- a/Documentation/memory.txt +++ b/Documentation/memory.txt @@ -1,18 +1,7 @@ There are several classic problems related to memory on Linux systems. - 1) There are some buggy motherboards which cannot properly - deal with the memory above 16MB. Consider exchanging - your motherboard. - - 2) You cannot do DMA on the ISA bus to addresses above - 16M. Most device drivers under Linux allow the use - of bounce buffers which work around this problem. Drivers - that don't use bounce buffers will be unstable with - more than 16M installed. Drivers that use bounce buffers - will be OK, but may have slightly higher overhead. - - 3) There are some motherboards that will not cache above + 1) There are some motherboards that will not cache above a certain quantity of memory. If you have one of these motherboards, your system will be SLOWER, not faster as you add more memory. Consider exchanging your @@ -24,7 +13,7 @@ It can also tell Linux to use less memory than is actually installed. If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid physical address space collisions. -See the documentation of your boot loader (LILO, loadlin, etc.) about +See the documentation of your boot loader (LILO, grub, loadlin, etc.) about how to pass options to the kernel. There are other memory problems which Linux cannot deal with. Random @@ -42,19 +31,3 @@ Try: with the vendor. Consider testing it with memtest86 yourself. * Exchanging your CPU, cache, or motherboard for one that works. - - * Disabling the cache from the BIOS. - - * Try passing the "mem=4M" option to the kernel to limit - Linux to using a very small amount of memory. Use "memmap="-option - together with "mem=" on systems with PCI to avoid physical address - space collisions. - - -Other tricks: - - * Try passing the "no-387" option to the kernel to ignore - a buggy FPU. - - * Try passing the "no-hlt" option to disable the potentially - buggy HLT instruction in your CPU. diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt index eaa1a25946c1..ee31369e9e5b 100644 --- a/Documentation/networking/regulatory.txt +++ b/Documentation/networking/regulatory.txt @@ -96,7 +96,7 @@ Example code - drivers hinting an alpha2: This example comes from the zd1211rw device driver. You can start by having a mapping of your device's EEPROM country/regulatory -domain value to to a specific alpha2 as follows: +domain value to a specific alpha2 as follows: static struct zd_reg_alpha2_map reg_alpha2_map[] = { { ZD_REGDOMAIN_FCC, "US" }, diff --git a/Documentation/numastat.txt b/Documentation/numastat.txt index 80133ace1eb2..9fcc9a608dc0 100644 --- a/Documentation/numastat.txt +++ b/Documentation/numastat.txt @@ -7,10 +7,10 @@ All units are pages. Hugepages have separate counters. numa_hit A process wanted to allocate memory from this node, and succeeded. -numa_miss A process wanted to allocate memory from this node, - but ended up with memory from another. -numa_foreign A process wanted to allocate on another node, - but ended up with memory from this one. +numa_miss A process wanted to allocate memory from another node, + but ended up with memory from this node. +numa_foreign A process wanted to allocate on this node, + but ended up with memory from another one. local_node A process ran on this node and got memory from it. other_node A process ran on this node and got memory from another node. interleave_hit Interleaving wanted to allocate from this node diff --git a/Documentation/pcmcia/crc32hash.c b/Documentation/pcmcia/crc32hash.c index 4210e5abab8a..44f8beea7260 100644 --- a/Documentation/pcmcia/crc32hash.c +++ b/Documentation/pcmcia/crc32hash.c @@ -8,7 +8,7 @@ $ ./crc32hash "Dual Speed" #include <ctype.h> #include <stdlib.h> -unsigned int crc32(unsigned char const *p, unsigned int len) +static unsigned int crc32(unsigned char const *p, unsigned int len) { int i; unsigned int crc = 0; diff --git a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt index 3ed3797b5086..8a0040738969 100644 --- a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt +++ b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt @@ -10,6 +10,8 @@ Required properties: - interrupts : should contain eSDHC interrupt. - interrupt-parent : interrupt source phandle. - clock-frequency : specifies eSDHC base clock frequency. + - sdhci,wp-inverted : (optional) specifies that eSDHC controller + reports inverted write-protect state; - sdhci,1-bit-only : (optional) specifies that a controller can only handle 1-bit data transfers. diff --git a/Documentation/powerpc/dts-bindings/marvell.txt b/Documentation/powerpc/dts-bindings/marvell.txt index 3708a2fd4747..f1533d91953a 100644 --- a/Documentation/powerpc/dts-bindings/marvell.txt +++ b/Documentation/powerpc/dts-bindings/marvell.txt @@ -32,7 +32,7 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd. devices. This field represents the number of cells needed to represent the address of the memory-mapped registers of devices within the system controller chip. - - #size-cells : Size representation for for the memory-mapped + - #size-cells : Size representation for the memory-mapped registers within the system controller chip. - #interrupt-cells : Defines the width of cells used to represent interrupts. diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index 8deffcd68cb8..9104c1062084 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -135,6 +135,30 @@ a high functionality RTC is integrated into the SOC. That system might read the system clock from the discrete RTC, but use the integrated one for all other tasks, because of its greater functionality. +SYSFS INTERFACE +--------------- + +The sysfs interface under /sys/class/rtc/rtcN provides access to various +rtc attributes without requiring the use of ioctls. All dates and times +are in the RTC's timezone, rather than in system time. + +date: RTC-provided date +hctosys: 1 if the RTC provided the system time at boot via the + CONFIG_RTC_HCTOSYS kernel option, 0 otherwise +max_user_freq: The maximum interrupt rate an unprivileged user may request + from this RTC. +name: The name of the RTC corresponding to this sysfs directory +since_epoch: The number of seconds since the epoch according to the RTC +time: RTC-provided time +wakealarm: The time at which the clock will generate a system wakeup + event. This is a one shot wakeup event, so must be reset + after wake if a daily wakeup is required. Format is either + seconds since the epoch or, if there's a leading +, seconds + in the future. + +IOCTL INTERFACE +--------------- + The ioctl() calls supported by /dev/rtc are also supported by the RTC class framework. However, because the chips and systems are not standardized, some PC/AT functionality might not be provided. And in the same way, some @@ -185,6 +209,8 @@ driver returns ENOIOCTLCMD. Some common examples: hardware in the irq_set_freq function. If it isn't, return -EINVAL. If you cannot actually change the frequency, do not define irq_set_freq. + * RTC_PIE_ON, RTC_PIE_OFF: the irq_set_state function will be called. + If all else fails, check out the rtc-test.c driver! diff --git a/Documentation/scsi/ChangeLog.megaraid b/Documentation/scsi/ChangeLog.megaraid index eaa4801f2ce6..38e9e7cadc90 100644 --- a/Documentation/scsi/ChangeLog.megaraid +++ b/Documentation/scsi/ChangeLog.megaraid @@ -514,7 +514,7 @@ iv. Remove yield() while mailbox handshake in synchronous commands v. Remove redundant __megaraid_busywait_mbox routine -vi. Fix bug in the managment module, which causes a system lockup when the +vi. Fix bug in the management module, which causes a system lockup when the IO module is loaded and then unloaded, followed by executing any management utility. The current version of management module does not handle the adapter unregister properly. diff --git a/Documentation/scsi/scsi_fc_transport.txt b/Documentation/scsi/scsi_fc_transport.txt index d7f181701dc2..aec6549ab097 100644 --- a/Documentation/scsi/scsi_fc_transport.txt +++ b/Documentation/scsi/scsi_fc_transport.txt @@ -378,7 +378,7 @@ Vport Disable/Enable: int vport_disable(struct fc_vport *vport, bool disable) where: - vport: Is vport to to be enabled or disabled + vport: Is vport to be enabled or disabled disable: If "true", the vport is to be disabled. If "false", the vport is to be enabled. diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 97eebd63bedc..f1708b79f963 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -387,7 +387,7 @@ STAC92HD73* STAC92HD83* =========== ref Reference board - mic-ref Reference board with power managment for ports + mic-ref Reference board with power management for ports dell-s14 Dell laptop auto BIOS setup (default) diff --git a/Documentation/spi/spi-summary b/Documentation/spi/spi-summary index 4a02d2508bc8..deab51ddc33e 100644 --- a/Documentation/spi/spi-summary +++ b/Documentation/spi/spi-summary @@ -350,7 +350,7 @@ SPI protocol drivers somewhat resemble platform device drivers: .resume = CHIP_resume, }; -The driver core will autmatically attempt to bind this driver to any SPI +The driver core will automatically attempt to bind this driver to any SPI device whose board_info gave a modalias of "CHIP". Your probe() code might look like this unless you're creating a device which is managing a bus (appearing under /sys/class/spi_master). diff --git a/Documentation/spi/spidev_test.c b/Documentation/spi/spidev_test.c index c1a5aad3c75a..10abd3773e49 100644 --- a/Documentation/spi/spidev_test.c +++ b/Documentation/spi/spidev_test.c @@ -69,7 +69,7 @@ static void transfer(int fd) puts(""); } -void print_usage(const char *prog) +static void print_usage(const char *prog) { printf("Usage: %s [-DsbdlHOLC3]\n", prog); puts(" -D --device device to use (default /dev/spidev1.1)\n" @@ -85,7 +85,7 @@ void print_usage(const char *prog) exit(1); } -void parse_opts(int argc, char *argv[]) +static void parse_opts(int argc, char *argv[]) { while (1) { static const struct option lopts[] = { diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 2dbff53369d0..b3d8b4922740 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -313,31 +313,43 @@ send before ratelimiting kicks in. ============================================================== +printk_delay: + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + +============================================================== + randomize-va-space: This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature. -0 - Turn the process address space randomization off by default. +0 - Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. 1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the location - of code start is randomized. + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. + +2 - Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. - With heap randomization, the situation is a little bit more - complicated. - There a few legacy applications out there (such as some ancient + There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. However there is - a CONFIG_COMPAT_BRK option for systems with ancient and/or broken - binaries, that makes heap non-randomized, but keeps all other - parts of process address space randomized if randomize_va_space - sysctl is turned on. + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. ============================================================== diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index c4de6359d440..e6fb1ec2744b 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -585,7 +585,9 @@ caching of directory and inode objects. At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes. ============================================================== diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt new file mode 100644 index 000000000000..6ef2a8652e17 --- /dev/null +++ b/Documentation/trace/events-kmem.txt @@ -0,0 +1,107 @@ + Subsystem Trace Points: kmem + +The tracing system kmem captures events related to object and page allocation +within the kernel. Broadly speaking there are four major subheadings. + + o Slab allocation of small objects of unknown type (kmalloc) + o Slab allocation of small objects of known type + o Page allocation + o Per-CPU Allocator Activity + o External Fragmentation + +This document will describe what each of the tracepoints are and why they +might be useful. + +1. Slab allocation of small objects of unknown type +=================================================== +kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kfree call_site=%lx ptr=%p + +Heavy activity for these events may indicate that a specific cache is +justified, particularly if kmalloc slab pages are getting significantly +internal fragmented as a result of the allocation pattern. By correlating +kmalloc with kfree, it may be possible to identify memory leaks and where +the allocation sites were. + + +2. Slab allocation of small objects of known type +================================================= +kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kmem_cache_free call_site=%lx ptr=%p + +These events are similar in usage to the kmalloc-related events except that +it is likely easier to pin the event down to a specific cache. At the time +of writing, no information is available on what slab is being allocated from, +but the call_site can usually be used to extrapolate that information + +3. Page allocation +================== +mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_free_direct page=%p pfn=%lu order=%d +mm_pagevec_free page=%p pfn=%lu order=%d cold=%d + +These four events deal with page allocation and freeing. mm_page_alloc is +a simple indicator of page allocator activity. Pages may be allocated from +the per-CPU allocator (high performance) or the buddy allocator. + +If pages are allocated directly from the buddy allocator, the +mm_page_alloc_zone_locked event is triggered. This event is important as high +amounts of activity imply high activity on the zone->lock. Taking this lock +impairs performance by disabling interrupts, dirtying cache lines between +CPUs and serialising many CPUs. + +When a page is freed directly by the caller, the mm_page_free_direct event +is triggered. Significant amounts of activity here could indicate that the +callers should be batching their activities. + +When pages are freed using a pagevec, the mm_pagevec_free is +triggered. Broadly speaking, pages are taken off the LRU lock in bulk and +freed in batch with a pagevec. Significant amounts of activity here could +indicate that the system is under memory pressure and can also indicate +contention on the zone->lru_lock. + +4. Per-CPU Allocator Activity +============================= +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d + +In front of the page allocator is a per-cpu page allocator. It exists only +for order-0 pages, reduces contention on the zone->lock and reduces the +amount of writing on struct page. + +When a per-CPU list is empty or pages of the wrong type are allocated, +the zone->lock will be taken once and the per-CPU list refilled. The event +triggered is mm_page_alloc_zone_locked for each page allocated with the +event indicating whether it is for a percpu_refill or not. + +When the per-CPU list is too full, a number of pages are freed, each one +which triggers a mm_page_pcpu_drain event. + +The individual nature of the events are so that pages can be tracked +between allocation and freeing. A number of drain or refill pages that occur +consecutively imply the zone->lock being taken once. Large amounts of PCP +refills and drains could imply an imbalance between CPUs where too much work +is being concentrated in one place. It could also indicate that the per-CPU +lists should be a larger size. Finally, large amounts of refills on one CPU +and drains on another could be a factor in causing large amounts of cache +line bounces due to writes between CPUs and worth investigating if pages +can be allocated and freed on the same CPU through some algorithm change. + +5. External Fragmentation +========================= +mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d + +External fragmentation affects whether a high-order allocation will be +successful or not. For some types of hardware, this is important although +it is avoided where possible. If the system is using huge pages and needs +to be able to resize the pool over the lifetime of the system, this value +is important. + +Large numbers of this event implies that memory is fragmenting and +high-order allocations will start failing at some time in the future. One +means of reducing the occurange of this event is to increase the size of +min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where +pageblock_size is usually the size of the default hugepage size. diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index 78c45a87be57..02ac6ed38b2d 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -72,7 +72,7 @@ To enable all events in sched subsystem: # echo 1 > /sys/kernel/debug/tracing/events/sched/enable -To eanble all events: +To enable all events: # echo 1 > /sys/kernel/debug/tracing/events/enable diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 1b6292bbdd6d..957b22fde2df 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -133,7 +133,7 @@ of ftrace. Here is a list of some of the key files: than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size - due to buffer managment overhead. ) + due to buffer management overhead. ) This can only be updated when the current_tracer is set to "nop". diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 000000000000..7df50e8cf4d9 --- /dev/null +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl @@ -0,0 +1,418 @@ +#!/usr/bin/perl +# This is a POC (proof of concept or piece of crap, take your pick) for reading the +# text representation of trace output related to page allocation. It makes an attempt +# to extract some high-level information on what is going on. The accuracy of the parser +# may vary considerably +# +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --prepend-parent Report on the parent proc and PID +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_PAGE_ALLOC => 1; +use constant MM_PAGE_FREE_DIRECT => 2; +use constant MM_PAGEVEC_FREE => 3; +use constant MM_PAGE_PCPU_DRAIN => 4; +use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; +use constant MM_PAGE_ALLOC_EXTFRAG => 6; +use constant EVENT_UNKNOWN => 7; + +# Constants used to track state +use constant STATE_PCPU_PAGES_DRAINED => 8; +use constant STATE_PCPU_PAGES_REFILLED => 9; + +# High-level events extrapolated from tracepoints +use constant HIGH_PCPU_DRAINS => 10; +use constant HIGH_PCPU_REFILLS => 11; +use constant HIGH_EXT_FRAGMENT => 12; +use constant HIGH_EXT_FRAGMENT_SEVERE => 13; +use constant HIGH_EXT_FRAGMENT_MODERATE => 14; +use constant HIGH_EXT_FRAGMENT_CHANGED => 15; + +my %perprocesspid; +my %perprocess; +my $opt_ignorepid; +my $opt_read_procstat; +my $opt_prepend_parent; + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, + 'prepend-parent' => \$opt_prepend_parent, +); + +# Defaults for dynamically discovered regex's +my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; + +# Dyanically discovered regex +my $regex_fragdetails; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + $regex = $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + if ($line =~ /^print fmt:\s"(.*)",.*/) { + $regex = $1; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected '$key' != '$expected'"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} +$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", + $regex_fragdetails_default, + "page", "pfn", + "alloc_order", "fallback_order", "pageblock_order", + "alloc_migratetype", "fallback_migratetype", + "fragmenting", "change_ownership"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +sub parent_info($$) { + my $pid = $_[0]; + my $statline = $_[1]; + my $ppid; + + if ($pid == 0) { + return "NOPARENT-0"; + } + + if ($statline !~ /$regex_statppid/o) { + die("Failed to match stat line process ppid:: $statline"); + } + + # Read the ppid stat line + $ppid = $1; + return guess_process_pid($ppid, read_statline($ppid)); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $tracepoint = $4; + + if ($opt_read_procstat || $opt_prepend_parent) { + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + $statline = read_statline($pid); + + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + + if ($opt_prepend_parent) { + $process_pid = parent_info($pid, $statline) . " :: $process_pid"; + } + } + + # Unnecessary in this script. Uncomment if required + # $cpus = $2; + # $timestamp = $3; + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_page_alloc") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; + } elsif ($tracepoint eq "mm_page_free_direct") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++; + } elsif ($tracepoint eq "mm_pagevec_free") { + $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++; + } elsif ($tracepoint eq "mm_page_pcpu_drain") { + $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; + } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; + } elsif ($tracepoint eq "mm_page_alloc_extfrag") { + + # Extract the details of the event now + $details = $5; + + my ($page, $pfn); + my ($alloc_order, $fallback_order, $pageblock_order); + my ($alloc_migratetype, $fallback_migratetype); + my ($fragmenting, $change_ownership); + + if ($details !~ /$regex_fragdetails/o) { + print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; + next; + } + + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; + $page = $1; + $pfn = $2; + $alloc_order = $3; + $fallback_order = $4; + $pageblock_order = $5; + $alloc_migratetype = $6; + $fallback_migratetype = $7; + $fragmenting = $8; + $change_ownership = $9; + + if ($fragmenting) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; + if ($fallback_order <= 3) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; + } else { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; + } + } + if ($change_ownership) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + # Catch a full pcpu drain event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && + $tracepoint ne "mm_page_pcpu_drain") { + + $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + + # Catch a full pcpu refill event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && + $tracepoint ne "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); + + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); + + foreach $process_pid (keys %stats) { + # Dump final aggregates + if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { + $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { + $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", + $process_pid, + $stats{$process_pid}->{MM_PAGE_ALLOC}, + $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, + $stats{$process_pid}->{MM_PAGE_FREE_DIRECT}, + $stats{$process_pid}->{MM_PAGEVEC_FREE}, + $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, + $stats{$process_pid}->{HIGH_PCPU_DRAINS}, + $stats{$process_pid}->{HIGH_PCPU_REFILLS}, + $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, + $stats{$process_pid}->{EVENT_UNKNOWN}); + } +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; + $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; + $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}; + $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}; + $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; + $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; + $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; + $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; + $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; + $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 000000000000..5eb4e487e667 --- /dev/null +++ b/Documentation/trace/tracepoint-analysis.txt @@ -0,0 +1,327 @@ + Notes on Analysing Behaviour Using Events and Tracepoints + + Documentation written by Mel Gorman + PCL information heavily based on email from Ingo Molnar + +1. Introduction +=============== + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without +creating custom kernel modules to register probe functions using the event +tracing infrastructure. + +Simplistically, tracepoints will represent an important event that when can +be taken in conjunction with other tracepoints to build a "Big Picture" of +what is going on within the system. There are a large number of methods for +gathering and interpreting these events. Lacking any current Best Practises, +this document describes some of the methods that can be used. + +This document assumes that debugfs is mounted on /sys/kernel/debug and that +the appropriate tracing options have been configured into the kernel. It is +assumed that the PCL tool tools/perf has been installed and is in your path. + +2. Listing Available Events +=========================== + +2.1 Standard Utilities +---------------------- + +All possible events are visible from /sys/kernel/debug/tracing/events. Simply +calling + + $ find /sys/kernel/debug/tracing/events -type d + +will give a fair indication of the number of events available. + +2.2 PCL +------- + +Discovery and enumeration of all counters and events, including tracepoints +are available with the perf tool. Getting a list of available events is a +simple case of + + $ perf list 2>&1 | grep Tracepoint + ext4:ext4_free_inode [Tracepoint event] + ext4:ext4_request_inode [Tracepoint event] + ext4:ext4_allocate_inode [Tracepoint event] + ext4:ext4_write_begin [Tracepoint event] + ext4:ext4_ordered_write_end [Tracepoint event] + [ .... remaining output snipped .... ] + + +2. Enabling Events +================== + +2.1 System-Wide Event Enabling +------------------------------ + +See Documentation/trace/events.txt for a proper description on how events +can be enabled system-wide. A short example of enabling all events related +to page allocation would look something like + + $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done + +2.2 System-Wide Event Enabling with SystemTap +--------------------------------------------- + +In SystemTap, tracepoints are accessible using the kernel.trace() function +call. The following is an example that reports every 5 seconds what processes +were allocating the pages. + + global page_allocs + + probe kernel.trace("mm_page_alloc") { + page_allocs[execname()]++ + } + + function print_count() { + printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") + foreach (proc in page_allocs-) + printf("%-25d %s\n", page_allocs[proc], proc) + printf ("\n") + delete page_allocs + } + + probe timer.s(5) { + print_count() + } + +2.3 System-Wide Event Enabling with PCL +--------------------------------------- + +By specifying the -a switch and analysing sleep, the system-wide events +for a duration of time can be examined. + + $ perf stat -a \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + sleep 10 + Performance counter stats for 'sleep 10': + + 9630 kmem:mm_page_alloc + 2143 kmem:mm_page_free_direct + 7424 kmem:mm_pagevec_free + + 10.002577764 seconds time elapsed + +Similarly, one could execute a shell and exit it as desired to get a report +at that point. + +2.4 Local Event Enabling +------------------------ + +Documentation/trace/ftrace.txt describes how to enable events on a per-thread +basis using set_ftrace_pid. + +2.5 Local Event Enablement with PCL +----------------------------------- + +Events can be activate and tracked for the duration of a process on a local +basis using PCL such as follows. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free ./hackbench 10 + Time: 0.909 + + Performance counter stats for './hackbench 10': + + 17803 kmem:mm_page_alloc + 12398 kmem:mm_page_free_direct + 4827 kmem:mm_pagevec_free + + 0.973913387 seconds time elapsed + +3. Event Filtering +================== + +Documentation/trace/ftrace.txt covers in-depth how to filter events in +ftrace. Obviously using grep and awk of trace_pipe is an option as well +as any script reading trace_pipe. + +4. Analysing Event Variances with PCL +===================================== + +Any workload can exhibit variances between runs and it can be important +to know what the standard deviation in. By and large, this is left to the +performance analyst to do it by hand. In the event that the discrete event +occurrences are useful to the performance analyst, then perf can be used. + + $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct + -e kmem:mm_pagevec_free ./hackbench 10 + Time: 0.890 + Time: 0.895 + Time: 0.915 + Time: 1.001 + Time: 0.899 + + Performance counter stats for './hackbench 10' (5 runs): + + 16630 kmem:mm_page_alloc ( +- 3.542% ) + 11486 kmem:mm_page_free_direct ( +- 4.771% ) + 4730 kmem:mm_pagevec_free ( +- 2.325% ) + + 0.982653002 seconds time elapsed ( +- 1.448% ) + +In the event that some higher-level event is required that depends on some +aggregation of discrete events, then a script would need to be developed. + +Using --repeat, it is also possible to view how events are fluctuating over +time on a system wide basis using -a and sleep. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + -a --repeat 10 \ + sleep 1 + Performance counter stats for 'sleep 1' (10 runs): + + 1066 kmem:mm_page_alloc ( +- 26.148% ) + 182 kmem:mm_page_free_direct ( +- 5.464% ) + 890 kmem:mm_pagevec_free ( +- 30.079% ) + + 1.002251757 seconds time elapsed ( +- 0.005% ) + +5. Higher-Level Analysis with Helper Scripts +============================================ + +When events are enabled the events that are triggering can be read from +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary +options exist as well. By post-processing the output, further information can +be gathered on-line as appropriate. Examples of post-processing might include + + o Reading information from /proc for the PID that triggered the event + o Deriving a higher-level event from a series of lower-level events. + o Calculate latencies between two events + +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example +script that can read trace_pipe from STDIN or a copy of a trace. When used +on-line, it can be interrupted once to generate a report without existing +and twice to exit. + +Simplistically, the script just reads STDIN and counts up events but it +also can do more such as + + o Derive high-level events from many low-level events. If a number of pages + are freed to the main allocator from the per-CPU lists, it recognises + that as one per-CPU drain even though there is no specific tracepoint + for that event + o It can aggregate based on PID or individual process number + o In the event memory is getting externally fragmented, it reports + on whether the fragmentation event was severe or moderate. + o When receiving an event about a PID, it can record who the parent was so + that if large numbers of events are coming from very short-lived + processes, the parent process responsible for creating all the helpers + can be identified + +6. Lower-Level Analysis with PCL +================================ + +There may also be a requirement to identify what functions with a program +were generating events within the kernel. To begin this sort of analysis, the +data must be recorded. At the time of writing, this required root + + $ perf record -c 1 \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + ./hackbench 10 + Time: 0.894 + [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] + +Note the use of '-c 1' to set the event period to sample. The default sample +period is quite high to minimise overhead but the information collected can be +very coarse as a result. + +This record outputted a file called perf.data which can be analysed using +perf report. + + $ perf report + # Samples: 30922 + # + # Overhead Command Shared Object + # ........ ......... ................................ + # + 87.27% hackbench [vdso] + 6.85% hackbench /lib/i686/cmov/libc-2.9.so + 2.62% hackbench /lib/ld-2.9.so + 1.52% perf [vdso] + 1.22% hackbench ./hackbench + 0.48% hackbench [kernel] + 0.02% perf /lib/i686/cmov/libc-2.9.so + 0.01% perf /usr/bin/perf + 0.01% perf /lib/ld-2.9.so + 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +According to this, the vast majority of events occured triggered on events +within the VDSO. With simple binaries, this will often be the case so lets +take a slightly different example. In the course of writing this, it was +noticed that X was generating an insane amount of page allocations so lets look +at it + + $ perf record -c 1 -f \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + -p `pidof X` + +This was interrupted after a few seconds and + + $ perf report + # Samples: 27666 + # + # Overhead Command Shared Object + # ........ ....... ....................................... + # + 51.95% Xorg [vdso] + 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so + 0.01% Xorg [kernel] + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +So, almost half of the events are occuring in a library. To get an idea which +symbol. + + $ perf report --sort comm,dso,symbol + # Samples: 27666 + # + # Overhead Command Shared Object Symbol + # ........ ....... ....................................... ...... + # + 51.95% Xorg [vdso] [.] 0x000000ffffe424 + 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f + 0.01% Xorg [kernel] [k] read_hpet + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path + 0.00% Xorg [kernel] [k] ftrace_trace_userstack + +To see where within the function pixmanFillsse2 things are going wrong + + $ perf annotate pixmanFillsse2 + [ ... ] + 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) + : } + : + : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ + : _mm_store_si128 (__m128i *__P, __m128i __B) : { + : *__P = __B; + 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) + 0.00 : 34ef5: ff + 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) + 0.00 : 34efd: ff + 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) + 0.00 : 34f05: ff + 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) + 0.00 : 34f0d: ff + 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) + 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) + 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) + 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) + +At a glance, it looks like the time is being spent copying pixmaps to +the card. Further investigation would be needed to determine why pixmaps +are being copied around so much but a starting point would be to take an +ancient build of libpixmap out of the library path where it was totally +forgotten about from months ago! diff --git a/Documentation/usb/authorization.txt b/Documentation/usb/authorization.txt index 381b22ee7834..c069b6884c77 100644 --- a/Documentation/usb/authorization.txt +++ b/Documentation/usb/authorization.txt @@ -16,20 +16,20 @@ Usage: Authorize a device to connect: -$ echo 1 > /sys/usb/devices/DEVICE/authorized +$ echo 1 > /sys/bus/usb/devices/DEVICE/authorized Deauthorize a device: -$ echo 0 > /sys/usb/devices/DEVICE/authorized +$ echo 0 > /sys/bus/usb/devices/DEVICE/authorized Set new devices connected to hostX to be deauthorized by default (ie: lock down): -$ echo 0 > /sys/bus/devices/usbX/authorized_default +$ echo 0 > /sys/bus/usb/devices/usbX/authorized_default Remove the lock down: -$ echo 1 > /sys/bus/devices/usbX/authorized_default +$ echo 1 > /sys/bus/usb/devices/usbX/authorized_default By default, Wired USB devices are authorized by default to connect. Wireless USB hosts deauthorize by default all new connected @@ -47,7 +47,7 @@ USB port): boot up rc.local -> - for host in /sys/bus/devices/usb* + for host in /sys/bus/usb/devices/usb* do echo 0 > $host/authorized_default done diff --git a/Documentation/usb/usbmon.txt b/Documentation/usb/usbmon.txt index 6c3c625b7f30..66f92d1194c1 100644 --- a/Documentation/usb/usbmon.txt +++ b/Documentation/usb/usbmon.txt @@ -33,7 +33,7 @@ if usbmon is built into the kernel. Verify that bus sockets are present. -# ls /sys/kernel/debug/usbmon +# ls /sys/kernel/debug/usb/usbmon 0s 0u 1s 1t 1u 2s 2t 2u 3s 3t 3u 4s 4t 4u # @@ -58,11 +58,11 @@ Bus=03 means it's bus 3. 3. Start 'cat' -# cat /sys/kernel/debug/usbmon/3u > /tmp/1.mon.out +# cat /sys/kernel/debug/usb/usbmon/3u > /tmp/1.mon.out to listen on a single bus, otherwise, to listen on all buses, type: -# cat /sys/kernel/debug/usbmon/0u > /tmp/1.mon.out +# cat /sys/kernel/debug/usb/usbmon/0u > /tmp/1.mon.out This process will be reading until killed. Naturally, the output can be redirected to a desirable location. This is preferred, because it is going @@ -305,7 +305,7 @@ Before the call, hdr, data, and alloc should be filled. Upon return, the area pointed by hdr contains the next event structure, and the data buffer contains the data, if any. The event is removed from the kernel buffer. -The MON_IOCX_GET copies 48 bytes, MON_IOCX_GETX copies 64 bytes. +The MON_IOCX_GET copies 48 bytes to hdr area, MON_IOCX_GETX copies 64 bytes. MON_IOCX_MFETCH, defined as _IOWR(MON_IOC_MAGIC, 7, struct mon_mfetch_arg) diff --git a/Documentation/video4linux/v4lgrab.c b/Documentation/video4linux/v4lgrab.c index 05769cff1009..c8ded175796e 100644 --- a/Documentation/video4linux/v4lgrab.c +++ b/Documentation/video4linux/v4lgrab.c @@ -89,7 +89,7 @@ } \ } -int get_brightness_adj(unsigned char *image, long size, int *brightness) { +static int get_brightness_adj(unsigned char *image, long size, int *brightness) { long i, tot = 0; for (i=0;i<size*3;i++) tot += image[i]; diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 2f77ced35df7..e57d6a9dd32b 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -6,6 +6,8 @@ balance - various information on memory balancing. hugetlbpage.txt - a brief summary of hugetlbpage support in the Linux kernel. +ksm.txt + - how to use the Kernel Samepage Merging feature. locking - info on how locking and synchronization is done in the Linux vm code. numa @@ -20,3 +22,5 @@ slabinfo.c - source code for a tool to get reports about slabs. slub.txt - a short users guide for SLUB. +map_hugetlb.c + - an example program that uses the MAP_HUGETLB mmap flag. diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index ea8714fcc3ad..82a7bd1800b2 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -18,13 +18,13 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with hugepage support should show the number of configured -hugepages in the system by running the "cat /proc/meminfo" command. +The kernel built with huge page support should show the number of configured +huge pages in the system by running the "cat /proc/meminfo" command. /proc/meminfo also provides information about the total number of hugetlb pages configured in the kernel. It also displays information about the number of free hugetlb pages at any time. It also displays information about -the configured hugepage size - this is needed for generating the proper +the configured huge page size - this is needed for generating the proper alignment and size of the arguments to the above system calls. The output of "cat /proc/meminfo" will have lines like: @@ -37,25 +37,27 @@ HugePages_Surp: yyy Hugepagesize: zzz kB where: -HugePages_Total is the size of the pool of hugepages. -HugePages_Free is the number of hugepages in the pool that are not yet -allocated. -HugePages_Rsvd is short for "reserved," and is the number of hugepages -for which a commitment to allocate from the pool has been made, but no -allocation has yet been made. It's vaguely analogous to overcommit. -HugePages_Surp is short for "surplus," and is the number of hugepages in -the pool above the value in /proc/sys/vm/nr_hugepages. The maximum -number of surplus hugepages is controlled by -/proc/sys/vm/nr_overcommit_hugepages. +HugePages_Total is the size of the pool of huge pages. +HugePages_Free is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp is short for "surplus," and is the number of huge pages in + the pool above the value in /proc/sys/vm/nr_hugepages. The + maximum number of surplus huge pages is controlled by + /proc/sys/vm/nr_overcommit_hugepages. /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb pages in the kernel. Super user can dynamically request more (or free some -pre-configured) hugepages. +pre-configured) huge pages. The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of hugepages is +enough physically contiguous free pages in system (freeing of huge pages is possible only if there are enough hugetlb pages free that can be transferred back to regular memory pool). @@ -67,43 +69,82 @@ use either the mmap system call or shared memory system calls to start using the huge pages. It is required that the system administrator preallocate enough memory for huge page purposes. -Use the following command to dynamically allocate/deallocate hugepages: +The administrator can preallocate huge pages on the kernel boot command line by +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages +requested. This is the most reliable method for preallocating huge pages as +memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To preallocate huge pages +of a specific size, one must preceed the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=<size>". <size> must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=<size>" boot parameter. + +/proc/sys/vm/nr_hugepages indicates the current number of configured [default +size] hugetlb pages in the kernel. Super user can dynamically request more +(or free some pre-configured) huge pages. + +Use the following command to dynamically allocate/deallocate default sized +huge pages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 hugepages in the system. The success -or failure of allocation depends on the amount of physically contiguous -memory that is preset in system at this time. System administrators may want -to put this command in one of the local rc init files. This will enable the -kernel to request huge pages early in the boot process (when the possibility -of getting physical contiguous pages is still very high). In either -case, administrators will want to verify the number of hugepages actually -allocated by checking the sysctl or meminfo. - -/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of -hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are -requested by applications. echo'ing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain -hugepages from the buddy allocator, if the normal pool is exhausted. As -these surplus hugepages go out of use, they are freed back to the buddy +This command will try to configure 20 default sized huge pages in the system. +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over the all on-line nodes. These huge pages, allocated when nr_hugepages +is increased, are called "persistent huge pages". + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is preset in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc init +files. This will enable the kernel to request huge pages early in the boot +process when the possibility of getting physical contiguous pages is still +very high. Administrators can verify the number of huge pages actually +allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of +huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain "surplus" +huge pages from the buddy allocator, when the normal pool is exhausted. As +these surplus huge pages go out of use, they are freed back to the buddy allocator. +When increasing the huge page pool size via nr_hugepages, any surplus +pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new huge page pool size. + +The administrator may shrink the pool of preallocated huge pages for +the default huge page size by setting the nr_hugepages sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all on-line nodes. Any free huge pages on the selected nodes will +be freed back to the buddy allocator. + Caveat: Shrinking the pool via nr_hugepages such that it becomes less -than the number of hugepages in use will convert the balance to surplus +than the number of huge pages in use will convert the balance to surplus huge pages even if it would exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. -With support for multiple hugepage pools at run-time available, much of -the hugepage userspace interface has been duplicated in sysfs. The above -information applies to the default hugepage size (which will be -controlled by the proc interfaces for backwards compatibility). The root -hugepage control directory is +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface has been duplicated in sysfs. The above +information applies to the default huge page size which will be +controlled by the /proc interfaces for backwards compatibility. The root +huge page control directory in sysfs is: /sys/kernel/mm/hugepages -For each hugepage size supported by the running kernel, a subdirectory +For each huge page size supported by the running kernel, a subdirectory will exist, of the form hugepages-${size}kB @@ -116,9 +157,9 @@ Inside each of these directories, the same set of files will exist: resv_hugepages surplus_hugepages -which function as described above for the default hugepage-sized case. +which function as described above for the default huge page-sized case. -If the user applications are going to request hugepages using mmap system +If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs: @@ -127,7 +168,7 @@ type hugetlbfs: none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid +/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 0777. This value is given in octal. @@ -146,24 +187,26 @@ Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. Also, it is important to note that no such mount command is required if the -applications are going to use only shmat/shmget system calls. Users who -wish to use hugetlb page via shared memory segment should be a member of -a supplementary group and system admin needs to configure that gid into -/proc/sys/vm/hugetlb_shm_group. It is possible for same or different -applications to use any combination of mmaps and shm* calls, though the -mount of filesystem will be required for using mmap calls. +applications are going to use only shmat/shmget system calls or mmap with +MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment +should be a member of a supplementary group and system admin needs to +configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for +same or different applications to use any combination of mmaps and shm* +calls, though the mount of filesystem will be required for using mmap calls +without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +map_hugetlb.c. ******************************************************************* /* - * Example of using hugepage memory in a user application using Sys V shared + * Example of using huge page memory in a user application using Sys V shared * memory system calls. In this example the app is requesting 256MB of * memory that is backed by huge pages. The application uses the flag * SHM_HUGETLB in the shmget system call to inform the kernel that it is - * requesting hugepages. + * requesting huge pages. * * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * hugepages. That means the addresses starting with 0x800000... will need + * huge pages. That means the addresses starting with 0x800000... will need * to be specified. Specifying a fixed address is not required on ppc64, * i386 or x86_64. * @@ -252,14 +295,14 @@ int main(void) ******************************************************************* /* - * Example of using hugepage memory in a user application using the mmap + * Example of using huge page memory in a user application using the mmap * system call. Before running this application, make sure that the * administrator has mounted the hugetlbfs filesystem (on some directory * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this * example, the app is requesting memory of size 256MB that is backed by * huge pages. * - * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. + * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. * That means the addresses starting with 0x800000... will need to be * specified. Specifying a fixed address is not required on ppc64, i386 * or x86_64. diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt new file mode 100644 index 000000000000..72a22f65960e --- /dev/null +++ b/Documentation/vm/ksm.txt @@ -0,0 +1,89 @@ +How to use the Kernel Samepage Merging feature +---------------------------------------------- + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +The KSM daemon ksmd periodically scans those areas of user memory which +have been registered with it, looking for pages of identical content which +can be replaced by a single write-protected page (which is automatically +copied if a process later wants to update its content). + +KSM was originally developed for use with KVM (where it was known as +Kernel Shared Memory), to fit more virtual machines into physical memory, +by sharing the data common between them. But it can be useful to any +application which generates many instances of the same data. + +KSM only merges anonymous (private) pages, never pagecache (file) pages. +KSM's merged pages are at present locked into kernel memory for as long +as they are shared: so cannot be swapped out like the user pages they +replace (but swapping KSM pages should follow soon in a later release). + +KSM only operates on those areas of address space which an application +has advised to be likely candidates for merging, by using the madvise(2) +system call: int madvise(addr, length, MADV_MERGEABLE). + +The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel +that advice and restore unshared pages: whereupon KSM unmerges whatever +it merged in that range. Note: this unmerging call may suddenly require +more memory than is available - possibly failing with EAGAIN, but more +probably arousing the Out-Of-Memory killer. + +If KSM is not configured into the running kernel, madvise MADV_MERGEABLE +and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was +built with CONFIG_KSM=y, those calls will normally succeed: even if the +the KSM daemon is not currently running, MADV_MERGEABLE still registers +the range for whenever the KSM daemon is started; even if the range +cannot contain any pages which KSM could actually merge; even if +MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. + +Like other madvise calls, they are intended for use on mapped areas of +the user address space: they will report ENOMEM if the specified range +includes unmapped gaps (though working on the intervening mapped areas), +and might fail with EAGAIN if not enough memory for internal structures. + +Applications should be considerate in their use of MADV_MERGEABLE, +restricting its use to areas likely to benefit. KSM's scans may use +a lot of processing power, and its kernel-resident pages are a limited +resource. Some installations will disable KSM for these reasons. + +The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, +readable by all but writable only by root: + +max_kernel_pages - set to maximum number of kernel pages that KSM may use + e.g. "echo 2000 > /sys/kernel/mm/ksm/max_kernel_pages" + Value 0 imposes no limit on the kernel pages KSM may use; + but note that any process using MADV_MERGEABLE can cause + KSM to allocate these pages, unswappable until it exits. + Default: 2000 (chosen for demonstration purposes) + +pages_to_scan - how many present pages to scan before ksmd goes to sleep + e.g. "echo 200 > /sys/kernel/mm/ksm/pages_to_scan" + Default: 200 (chosen for demonstration purposes) + +sleep_millisecs - how many milliseconds ksmd should sleep before next scan + e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" + Default: 20 (chosen for demonstration purposes) + +run - set 0 to stop ksmd from running but keep merged pages, + set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", + set 2 to stop ksmd and unmerge all pages currently merged, + but leave mergeable areas registered for next run + Default: 1 (for immediate use by apps which register) + +The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: + +pages_shared - how many shared unswappable kernel pages KSM is using +pages_sharing - how many more sites are sharing them i.e. how much saved +pages_unshared - how many pages unique but repeatedly checked for merging +pages_volatile - how many pages changing too fast to be placed in a tree +full_scans - how many times all mergeable areas have been scanned + +A high ratio of pages_sharing to pages_shared indicates good sharing, but +a high ratio of pages_unshared to pages_sharing indicates wasted effort. +pages_volatile embraces several different kinds of activity, but a high +proportion there would also indicate poor use of madvise MADV_MERGEABLE. + +Izik Eidus, +Hugh Dickins, 30 July 2009 diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c new file mode 100644 index 000000000000..e2bdae37f499 --- /dev/null +++ b/Documentation/vm/map_hugetlb.c @@ -0,0 +1,77 @@ +/* + * Example of using hugepage memory in a user application using the mmap + * system call with MAP_HUGETLB flag. Before running this program make + * sure the administrator has allocated enough default sized huge pages + * to cover the 256 MB allocation. + * + * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. + * That means the addresses starting with 0x800000... will need to be + * specified. Specifying a fixed address is not required on ppc64, i386 + * or x86_64. + */ +#include <stdlib.h> +#include <stdio.h> +#include <unistd.h> +#include <sys/mman.h> +#include <fcntl.h> + +#define LENGTH (256UL*1024*1024) +#define PROTECTION (PROT_READ | PROT_WRITE) + +#ifndef MAP_HUGETLB +#define MAP_HUGETLB 0x40 +#endif + +/* Only ia64 requires this */ +#ifdef __ia64__ +#define ADDR (void *)(0x8000000000000000UL) +#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED) +#else +#define ADDR (void *)(0x0UL) +#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) +#endif + +void check_bytes(char *addr) +{ + printf("First hex is %x\n", *((unsigned int *)addr)); +} + +void write_bytes(char *addr) +{ + unsigned long i; + + for (i = 0; i < LENGTH; i++) + *(addr + i) = (char)i; +} + +void read_bytes(char *addr) +{ + unsigned long i; + + check_bytes(addr); + for (i = 0; i < LENGTH; i++) + if (*(addr + i) != (char)i) { + printf("Mismatch at %lu\n", i); + break; + } +} + +int main(void) +{ + void *addr; + + addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0); + if (addr == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + printf("Returned address is %p\n", addr); + check_bytes(addr); + write_bytes(addr); + read_bytes(addr); + + munmap(addr, LENGTH); + + return 0; +} diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 0833f44ba16b..3eda8ea00852 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -158,12 +158,12 @@ static uint64_t page_flags[HASH_SIZE]; type __min2 = (y); \ __min1 < __min2 ? __min1 : __min2; }) -unsigned long pages2mb(unsigned long pages) +static unsigned long pages2mb(unsigned long pages) { return (pages * page_size) >> 20; } -void fatal(const char *x, ...) +static void fatal(const char *x, ...) { va_list ap; @@ -178,7 +178,7 @@ void fatal(const char *x, ...) * page flag names */ -char *page_flag_name(uint64_t flags) +static char *page_flag_name(uint64_t flags) { static char buf[65]; int present; @@ -197,7 +197,7 @@ char *page_flag_name(uint64_t flags) return buf; } -char *page_flag_longname(uint64_t flags) +static char *page_flag_longname(uint64_t flags) { static char buf[1024]; int i, n; @@ -221,7 +221,7 @@ char *page_flag_longname(uint64_t flags) * page list and summary */ -void show_page_range(unsigned long offset, uint64_t flags) +static void show_page_range(unsigned long offset, uint64_t flags) { static uint64_t flags0; static unsigned long index; @@ -241,12 +241,12 @@ void show_page_range(unsigned long offset, uint64_t flags) count = 1; } -void show_page(unsigned long offset, uint64_t flags) +static void show_page(unsigned long offset, uint64_t flags) { printf("%lu\t%s\n", offset, page_flag_name(flags)); } -void show_summary(void) +static void show_summary(void) { int i; @@ -272,7 +272,7 @@ void show_summary(void) * page flag filters */ -int bit_mask_ok(uint64_t flags) +static int bit_mask_ok(uint64_t flags) { int i; @@ -289,7 +289,7 @@ int bit_mask_ok(uint64_t flags) return 1; } -uint64_t expand_overloaded_flags(uint64_t flags) +static uint64_t expand_overloaded_flags(uint64_t flags) { /* SLOB/SLUB overload several page flags */ if (flags & BIT(SLAB)) { @@ -308,7 +308,7 @@ uint64_t expand_overloaded_flags(uint64_t flags) return flags; } -uint64_t well_known_flags(uint64_t flags) +static uint64_t well_known_flags(uint64_t flags) { /* hide flags intended only for kernel hacker */ flags &= ~KPF_HACKERS_BITS; @@ -325,7 +325,7 @@ uint64_t well_known_flags(uint64_t flags) * page frame walker */ -int hash_slot(uint64_t flags) +static int hash_slot(uint64_t flags) { int k = HASH_KEY(flags); int i; @@ -352,7 +352,7 @@ int hash_slot(uint64_t flags) exit(EXIT_FAILURE); } -void add_page(unsigned long offset, uint64_t flags) +static void add_page(unsigned long offset, uint64_t flags) { flags = expand_overloaded_flags(flags); @@ -371,7 +371,7 @@ void add_page(unsigned long offset, uint64_t flags) total_pages++; } -void walk_pfn(unsigned long index, unsigned long count) +static void walk_pfn(unsigned long index, unsigned long count) { unsigned long batch; unsigned long n; @@ -404,7 +404,7 @@ void walk_pfn(unsigned long index, unsigned long count) } } -void walk_addr_ranges(void) +static void walk_addr_ranges(void) { int i; @@ -428,7 +428,7 @@ void walk_addr_ranges(void) * user interface */ -const char *page_flag_type(uint64_t flag) +static const char *page_flag_type(uint64_t flag) { if (flag & KPF_HACKERS_BITS) return "(r)"; @@ -437,7 +437,7 @@ const char *page_flag_type(uint64_t flag) return " "; } -void usage(void) +static void usage(void) { int i, j; @@ -482,7 +482,7 @@ void usage(void) "(r) raw mode bits (o) overloaded bits\n"); } -unsigned long long parse_number(const char *str) +static unsigned long long parse_number(const char *str) { unsigned long long n; @@ -494,16 +494,16 @@ unsigned long long parse_number(const char *str) return n; } -void parse_pid(const char *str) +static void parse_pid(const char *str) { opt_pid = parse_number(str); } -void parse_file(const char *name) +static void parse_file(const char *name) { } -void add_addr_range(unsigned long offset, unsigned long size) +static void add_addr_range(unsigned long offset, unsigned long size) { if (nr_addr_ranges >= MAX_ADDR_RANGES) fatal("too much addr ranges\n"); @@ -513,7 +513,7 @@ void add_addr_range(unsigned long offset, unsigned long size) nr_addr_ranges++; } -void parse_addr_range(const char *optarg) +static void parse_addr_range(const char *optarg) { unsigned long offset; unsigned long size; @@ -547,7 +547,7 @@ void parse_addr_range(const char *optarg) add_addr_range(offset, size); } -void add_bits_filter(uint64_t mask, uint64_t bits) +static void add_bits_filter(uint64_t mask, uint64_t bits) { if (nr_bit_filters >= MAX_BIT_FILTERS) fatal("too much bit filters\n"); @@ -557,7 +557,7 @@ void add_bits_filter(uint64_t mask, uint64_t bits) nr_bit_filters++; } -uint64_t parse_flag_name(const char *str, int len) +static uint64_t parse_flag_name(const char *str, int len) { int i; @@ -577,7 +577,7 @@ uint64_t parse_flag_name(const char *str, int len) return parse_number(str); } -uint64_t parse_flag_names(const char *str, int all) +static uint64_t parse_flag_names(const char *str, int all) { const char *p = str; uint64_t flags = 0; @@ -596,7 +596,7 @@ uint64_t parse_flag_names(const char *str, int all) return flags; } -void parse_bits_mask(const char *optarg) +static void parse_bits_mask(const char *optarg) { uint64_t mask; uint64_t bits; @@ -621,7 +621,7 @@ void parse_bits_mask(const char *optarg) } -struct option opts[] = { +static struct option opts[] = { { "raw" , 0, NULL, 'r' }, { "pid" , 1, NULL, 'p' }, { "file" , 1, NULL, 'f' }, diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c index df3227605d59..92e729f4b676 100644 --- a/Documentation/vm/slabinfo.c +++ b/Documentation/vm/slabinfo.c @@ -87,7 +87,7 @@ int page_size; regex_t pattern; -void fatal(const char *x, ...) +static void fatal(const char *x, ...) { va_list ap; @@ -97,7 +97,7 @@ void fatal(const char *x, ...) exit(EXIT_FAILURE); } -void usage(void) +static void usage(void) { printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n" "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n" @@ -131,7 +131,7 @@ void usage(void) ); } -unsigned long read_obj(const char *name) +static unsigned long read_obj(const char *name) { FILE *f = fopen(name, "r"); @@ -151,7 +151,7 @@ unsigned long read_obj(const char *name) /* * Get the contents of an attribute */ -unsigned long get_obj(const char *name) +static unsigned long get_obj(const char *name) { if (!read_obj(name)) return 0; @@ -159,7 +159,7 @@ unsigned long get_obj(const char *name) return atol(buffer); } -unsigned long get_obj_and_str(const char *name, char **x) +static unsigned long get_obj_and_str(const char *name, char **x) { unsigned long result = 0; char *p; @@ -178,7 +178,7 @@ unsigned long get_obj_and_str(const char *name, char **x) return result; } -void set_obj(struct slabinfo *s, const char *name, int n) +static void set_obj(struct slabinfo *s, const char *name, int n) { char x[100]; FILE *f; @@ -192,7 +192,7 @@ void set_obj(struct slabinfo *s, const char *name, int n) fclose(f); } -unsigned long read_slab_obj(struct slabinfo *s, const char *name) +static unsigned long read_slab_obj(struct slabinfo *s, const char *name) { char x[100]; FILE *f; @@ -215,7 +215,7 @@ unsigned long read_slab_obj(struct slabinfo *s, const char *name) /* * Put a size string together */ -int store_size(char *buffer, unsigned long value) +static int store_size(char *buffer, unsigned long value) { unsigned long divisor = 1; char trailer = 0; @@ -247,7 +247,7 @@ int store_size(char *buffer, unsigned long value) return n; } -void decode_numa_list(int *numa, char *t) +static void decode_numa_list(int *numa, char *t) { int node; int nr; @@ -272,7 +272,7 @@ void decode_numa_list(int *numa, char *t) } } -void slab_validate(struct slabinfo *s) +static void slab_validate(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -280,7 +280,7 @@ void slab_validate(struct slabinfo *s) set_obj(s, "validate", 1); } -void slab_shrink(struct slabinfo *s) +static void slab_shrink(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -290,7 +290,7 @@ void slab_shrink(struct slabinfo *s) int line = 0; -void first_line(void) +static void first_line(void) { if (show_activity) printf("Name Objects Alloc Free %%Fast Fallb O\n"); @@ -302,7 +302,7 @@ void first_line(void) /* * Find the shortest alias of a slab */ -struct aliasinfo *find_one_alias(struct slabinfo *find) +static struct aliasinfo *find_one_alias(struct slabinfo *find) { struct aliasinfo *a; struct aliasinfo *best = NULL; @@ -318,18 +318,18 @@ struct aliasinfo *find_one_alias(struct slabinfo *find) return best; } -unsigned long slab_size(struct slabinfo *s) +static unsigned long slab_size(struct slabinfo *s) { return s->slabs * (page_size << s->order); } -unsigned long slab_activity(struct slabinfo *s) +static unsigned long slab_activity(struct slabinfo *s) { return s->alloc_fastpath + s->free_fastpath + s->alloc_slowpath + s->free_slowpath; } -void slab_numa(struct slabinfo *s, int mode) +static void slab_numa(struct slabinfo *s, int mode) { int node; @@ -374,7 +374,7 @@ void slab_numa(struct slabinfo *s, int mode) line++; } -void show_tracking(struct slabinfo *s) +static void show_tracking(struct slabinfo *s) { printf("\n%s: Kernel object allocation\n", s->name); printf("-----------------------------------------------------------------------\n"); @@ -392,7 +392,7 @@ void show_tracking(struct slabinfo *s) } -void ops(struct slabinfo *s) +static void ops(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -405,14 +405,14 @@ void ops(struct slabinfo *s) printf("\n%s has no kmem_cache operations\n", s->name); } -const char *onoff(int x) +static const char *onoff(int x) { if (x) return "On "; return "Off"; } -void slab_stats(struct slabinfo *s) +static void slab_stats(struct slabinfo *s) { unsigned long total_alloc; unsigned long total_free; @@ -477,7 +477,7 @@ void slab_stats(struct slabinfo *s) s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total); } -void report(struct slabinfo *s) +static void report(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -518,7 +518,7 @@ void report(struct slabinfo *s) slab_stats(s); } -void slabcache(struct slabinfo *s) +static void slabcache(struct slabinfo *s) { char size_str[20]; char dist_str[40]; @@ -593,7 +593,7 @@ void slabcache(struct slabinfo *s) /* * Analyze debug options. Return false if something is amiss. */ -int debug_opt_scan(char *opt) +static int debug_opt_scan(char *opt) { if (!opt || !opt[0] || strcmp(opt, "-") == 0) return 1; @@ -642,7 +642,7 @@ int debug_opt_scan(char *opt) return 1; } -int slab_empty(struct slabinfo *s) +static int slab_empty(struct slabinfo *s) { if (s->objects > 0) return 0; @@ -657,7 +657,7 @@ int slab_empty(struct slabinfo *s) return 1; } -void slab_debug(struct slabinfo *s) +static void slab_debug(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -717,7 +717,7 @@ void slab_debug(struct slabinfo *s) set_obj(s, "trace", 1); } -void totals(void) +static void totals(void) { struct slabinfo *s; @@ -976,7 +976,7 @@ void totals(void) b1, b2, b3); } -void sort_slabs(void) +static void sort_slabs(void) { struct slabinfo *s1,*s2; @@ -1005,7 +1005,7 @@ void sort_slabs(void) } } -void sort_aliases(void) +static void sort_aliases(void) { struct aliasinfo *a1,*a2; @@ -1030,7 +1030,7 @@ void sort_aliases(void) } } -void link_slabs(void) +static void link_slabs(void) { struct aliasinfo *a; struct slabinfo *s; @@ -1048,7 +1048,7 @@ void link_slabs(void) } } -void alias(void) +static void alias(void) { struct aliasinfo *a; char *active = NULL; @@ -1079,7 +1079,7 @@ void alias(void) } -void rename_slabs(void) +static void rename_slabs(void) { struct slabinfo *s; struct aliasinfo *a; @@ -1102,12 +1102,12 @@ void rename_slabs(void) } } -int slab_mismatch(char *slab) +static int slab_mismatch(char *slab) { return regexec(&pattern, slab, 0, NULL, 0); } -void read_slab_dir(void) +static void read_slab_dir(void) { DIR *dir; struct dirent *de; @@ -1209,7 +1209,7 @@ void read_slab_dir(void) fatal("Too many aliases\n"); } -void output_slabs(void) +static void output_slabs(void) { struct slabinfo *slab; diff --git a/Documentation/watchdog/src/watchdog-test.c b/Documentation/watchdog/src/watchdog-test.c index 65f6c19cb865..a750532ffcf8 100644 --- a/Documentation/watchdog/src/watchdog-test.c +++ b/Documentation/watchdog/src/watchdog-test.c @@ -18,7 +18,7 @@ int fd; * the PC Watchdog card to reset its internal timer so it doesn't trigger * a computer reset. */ -void keep_alive(void) +static void keep_alive(void) { int dummy; diff --git a/Documentation/x86/earlyprintk.txt b/Documentation/x86/earlyprintk.txt index 607b1a016064..f19802c0f485 100644 --- a/Documentation/x86/earlyprintk.txt +++ b/Documentation/x86/earlyprintk.txt @@ -7,7 +7,7 @@ and two USB cables, connected like this: [host/target] <-------> [USB debug key] <-------> [client/console] -1. There are three specific hardware requirements: +1. There are a number of specific hardware requirements: a.) Host/target system needs to have USB debug port capability. @@ -42,7 +42,35 @@ and two USB cables, connected like this: This is a small blue plastic connector with two USB connections, it draws power from its USB connections. - c.) Thirdly, you need a second client/console system with a regular USB port. + c.) You need a second client/console system with a high speed USB 2.0 + port. + + d.) The Netchip device must be plugged directly into the physical + debug port on the "host/target" system. You cannot use a USB hub in + between the physical debug port and the "host/target" system. + + The EHCI debug controller is bound to a specific physical USB + port and the Netchip device will only work as an early printk + device in this port. The EHCI host controllers are electrically + wired such that the EHCI debug controller is hooked up to the + first physical and there is no way to change this via software. + You can find the physical port through experimentation by trying + each physical port on the system and rebooting. Or you can try + and use lsusb or look at the kernel info messages emitted by the + usb stack when you plug a usb device into various ports on the + "host/target" system. + + Some hardware vendors do not expose the usb debug port with a + physical connector and if you find such a device send a complaint + to the hardware vendor, because there is no reason not to wire + this port into one of the physically accessible ports. + + e.) It is also important to note, that many versions of the Netchip + device require the "client/console" system to be plugged into the + right and side of the device (with the product logo facing up and + readable left to right). The reason being is that the 5 volt + power supply is taken from only one side of the device and it + must be the side that does not get rebooted. 2. Software requirements: @@ -56,6 +84,13 @@ and two USB cables, connected like this: (If you are using Grub, append it to the 'kernel' line in /etc/grub.conf) + On systems with more than one EHCI debug controller you must + specify the correct EHCI debug controller number. The ordering + comes from the PCI bus enumeration of the EHCI controllers. The + default with no number argument is "0" the first EHCI debug + controller. To use the second EHCI debug controller, you would + use the command line: "earlyprintk=dbgp1" + NOTE: normally earlyprintk console gets turned off once the regular console is alive - use "earlyprintk=dbgp,keep" to keep this channel open beyond early bootup. This can be useful for |