diff options
Diffstat (limited to 'Documentation/admin-guide')
45 files changed, 750 insertions, 163 deletions
diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst index 8d3a2d045c0a..bb5032a99234 100644 --- a/Documentation/admin-guide/bcache.rst +++ b/Documentation/admin-guide/bcache.rst @@ -204,7 +204,7 @@ For example:: This should present your unmodified backing device data in /dev/loop0 If your cache is in writethrough mode, then you can safely discard the -cache device without loosing data. +cache device without losing data. E) Wiping a cache device diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst index 16253eda192e..dabb80cdd25a 100644 --- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst +++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst @@ -106,7 +106,7 @@ Proportional weight policy files see Documentation/block/bfq-iosched.rst. blkio.bfq.weight_device - Specifes per cgroup per device weights, overriding the default group + Specifies per cgroup per device weights, overriding the default group weight. For more details, see Documentation/block/bfq-iosched.rst. Following is the format:: diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5db4c4dd5bb4..f67c0829350b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -624,7 +624,7 @@ and is an example of this type. Limits ------ -A child can only consume upto the configured amount of the resource. +A child can only consume up to the configured amount of the resource. Limits can be over-committed - the sum of the limits of children can exceed the amount of resource available to the parent. @@ -642,11 +642,11 @@ on an IO device and is an example of this type. Protections ----------- -A cgroup is protected upto the configured amount of the resource +A cgroup is protected up to the configured amount of the resource as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case -only upto the amount available to the parent is protected among +only up to the amount available to the parent is protected among children. Protections are in the range [0, max] and defaults to 0, which is @@ -1079,7 +1079,7 @@ All time durations are in microseconds. $MAX $PERIOD - which indicates that the group may consume upto $MAX in each + which indicates that the group may consume up to $MAX in each $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. @@ -2289,7 +2289,7 @@ Cpuset Interface Files For a valid partition root with the sibling cpu exclusivity rule enabled, changes made to "cpuset.cpus" that violate the exclusivity rule will invalidate the partition as well as its - sibiling partitions with conflicting cpuset.cpus values. So + sibling partitions with conflicting cpuset.cpus values. So care must be taking in changing "cpuset.cpus". A valid non-root parent partition may distribute out all its CPUs diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index ed3b8dc854ec..2e151cd8c2e4 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -399,7 +399,7 @@ A partial list of the supported mount options follows: sep if first mount option (after the -o), overrides the comma as the separator between the mount - parms. e.g.:: + parameters. e.g.:: -o user=myname,password=mypassword,domain=mydom @@ -765,7 +765,7 @@ cifsFYI If set to non-zero value, additional debug information Some debugging statements are not compiled into the cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the kernel configuration. cifsFYI may be set to one or - nore of the following flags (7 sets them all):: + more of the following flags (7 sets them all):: +-----------------------------------------------+------+ | log cifs informational messages | 0x01 | diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst index b17fe352fc41..13da4d831d46 100644 --- a/Documentation/admin-guide/device-mapper/cache-policies.rst +++ b/Documentation/admin-guide/device-mapper/cache-policies.rst @@ -70,7 +70,7 @@ the entries (each hotspot block covers a larger area than a single cache block). All this means smq uses ~25bytes per cache block. Still a lot of -memory, but a substantial improvement nontheless. +memory, but a substantial improvement nonetheless. Level balancing ^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/device-mapper/dm-ebs.rst b/Documentation/admin-guide/device-mapper/dm-ebs.rst index 534fa38e8862..c09f66db5621 100644 --- a/Documentation/admin-guide/device-mapper/dm-ebs.rst +++ b/Documentation/admin-guide/device-mapper/dm-ebs.rst @@ -31,7 +31,7 @@ Mandatory parameters: Optional parameter: - <underyling sectors>: + <underlying sectors>: Number of sectors defining the logical block size of <dev path>. 2^N supported, e.g. 8 = emulate 8 sectors of 512 bytes = 4KiB. If not provided, the logical block size of <dev path> will be used. diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst index 0fac051caeac..932383fe6e88 100644 --- a/Documentation/admin-guide/device-mapper/dm-zoned.rst +++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst @@ -46,7 +46,7 @@ just like conventional zones. The zones of the device(s) are separated into 2 types: 1) Metadata zones: these are conventional zones used to store metadata. -Metadata zones are not reported as useable capacity to the user. +Metadata zones are not reported as usable capacity to the user. 2) Data zones: all remaining zones, the vast majority of which will be sequential zones used exclusively to store user data. The conventional diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst index 0a8d3eb3f072..5772ccdd1f5f 100644 --- a/Documentation/admin-guide/device-mapper/unstriped.rst +++ b/Documentation/admin-guide/device-mapper/unstriped.rst @@ -35,7 +35,7 @@ An example of undoing an existing dm-stripe This small bash script will setup 4 loop devices and use the existing striped target to combine the 4 devices into one. It then will use -the unstriped target ontop of the striped device to access the +the unstriped target on top of the striped device to access the individual backing loop devices. We write data to the newly exposed unstriped devices and verify the data written matches the correct underlying device on the striped array:: @@ -110,8 +110,8 @@ to get a 92% reduction in read latency using this device mapper target. Example dmsetup usage ===================== -unstriped ontop of Intel NVMe device that has 2 cores ------------------------------------------------------ +unstriped on top of Intel NVMe device that has 2 cores +------------------------------------------------------ :: @@ -124,8 +124,8 @@ respectively:: /dev/mapper/nvmset0 /dev/mapper/nvmset1 -unstriped ontop of striped with 4 drives using 128K chunk size --------------------------------------------------------------- +unstriped on top of striped with 4 drives using 128K chunk size +--------------------------------------------------------------- :: diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index faa22f77847a..8dc668cc1216 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -330,7 +330,7 @@ Examples // boot-args example, with newlines and comments for readability Kernel command line: ... - // see whats going on in dyndbg=value processing + // see what's going on in dyndbg=value processing dynamic_debug.verbose=3 // enable pr_debugs in the btrfs module (can be builtin or loadable) btrfs.dyndbg="+p" diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index d8a90c81b9ee..1cc5567a4bbe 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -123,7 +123,7 @@ Each simulated GPIO chip creates a separate sysfs group under its device directory for each exposed line (e.g. ``/sys/devices/platform/gpio-sim.X/gpiochipY/``). The name of each group is of the form: ``'sim_gpioX'`` where X is the offset of the line. Inside each -group there are two attibutes: +group there are two attributes: ``pull`` - allows to read and set the current simulated pull setting for every line, when writing the value must be one of: ``'pull-up'``, diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst index 2d19c9f4c1fe..f491de74ea79 100644 --- a/Documentation/admin-guide/hw-vuln/mds.rst +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -64,8 +64,8 @@ architecture section: :ref:`Documentation/x86/mds.rst <mds>`. Attack scenarios ---------------- -Attacks against the MDS vulnerabilities can be mounted from malicious non -priviledged user space applications running on hosts or guest. Malicious +Attacks against the MDS vulnerabilities can be mounted from malicious non- +privileged user space applications running on hosts or guest. Malicious guest OSes can obviously mount attacks as well. Contrary to other speculation based vulnerabilities the MDS vulnerability diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 0571938ecdc8..0ad7e7ec0d27 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -56,6 +56,17 @@ ABI will be found here. sysfs-rules +This is the beginning of a section with information of interest to +application developers and system integrators doing analysis of the +Linux kernel for safety critical applications. Documents supporting +analysis of kernel interactions with applications, and key kernel +subsystems expectations will be found here. + +.. toctree:: + :maxdepth: 1 + + workload-tracing + The rest of this manual consists of various unordered guides on how to configure specific aspects of kernel behavior to your liking. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 5e4ee29cf393..fae4cf7e78a6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -378,18 +378,16 @@ autoconf= [IPV6] See Documentation/networking/ipv6.rst. - show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller - Limit apic dumping. The parameter defines the maximal - number of local apics being dumped. Also it is possible - to set it to "all" by meaning -- no limit here. - Format: { 1 (default) | 2 | ... | all }. - The parameter valid if only apic=debug or - apic=verbose is specified. - Example: apic=debug show_lapic=all - apm= [APM] Advanced Power Management See header of arch/x86/kernel/apm_32.c. + apparmor= [APPARMOR] Disable or enable AppArmor at boot time + Format: { "0" | "1" } + See security/apparmor/Kconfig help text + 0 -- disable. + 1 -- enable. + Default value is set via kernel config option. + arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> @@ -480,8 +478,10 @@ See Documentation/block/cmdline-partition.rst boot_delay= Milliseconds to delay each printk during boot. - Values larger than 10 seconds (10000) are changed to - no delay (0). + Only works if CONFIG_BOOT_PRINTK_DELAY is enabled, + and you may also have to specify "lpj=". Boot_delay + values larger than 10 seconds (10000) are assumed + erroneous and ignored. Format: integer bootconfig [KNL] @@ -673,7 +673,7 @@ Sets the size of kernel per-numa memory area for contiguous memory allocations. A value of 0 disables per-numa CMA altogether. And If this option is not - specificed, the default value is 0. + specified, the default value is 0. With per-numa CMA enabled, DMA users on node nid will first try to allocate buffer from the pernuma area which is located in node nid, if the allocation fails, @@ -945,7 +945,7 @@ driver code when a CPU writes to (or reads from) a random memory location. Note that there exists a class of memory corruptions problems caused by buggy H/W or - F/W or by drivers badly programing DMA (basically when + F/W or by drivers badly programming DMA (basically when memory is written at bus level and the CPU MMU is bypassed) which are not detectable by CONFIG_DEBUG_PAGEALLOC, hence this option will not help @@ -1046,26 +1046,12 @@ can be useful when debugging issues that require an SLB miss to occur. - stress_slb [PPC] - Limits the number of kernel SLB entries, and flushes - them frequently to increase the rate of SLB faults - on kernel addresses. - - stress_hpt [PPC] - Limits the number of kernel HPT entries in the hash - page table to increase the rate of hash page table - faults on kernel addresses. - disable= [IPV6] See Documentation/networking/ipv6.rst. disable_radix [PPC] Disable RADIX MMU mode on POWER9 - radix_hcall_invalidate=on [PPC/PSERIES] - Disable RADIX GTSE feature and use hcall for TLB - invalidate. - disable_tlbie [PPC] Disable TLBIE instruction. Currently does not work with KVM, with HASH MMU, or with coherent accelerators. @@ -1167,16 +1153,6 @@ Documentation/admin-guide/dynamic-debug-howto.rst for details. - nopku [X86] Disable Memory Protection Keys CPU feature found - in some Intel CPUs. - - <module>.async_probe[=<bool>] [KNL] - If no <bool> value is specified or if the value - specified is not a valid <bool>, enable asynchronous - probe on this module. Otherwise, enable/disable - asynchronous probe on this module as indicated by the - <bool> value. See also: module.async_probe - early_ioremap_debug [KNL] Enable debug messages in early_ioremap support. This is useful for tracking down temporary early mappings @@ -1753,7 +1729,7 @@ boot-time allocation of gigantic hugepages is skipped. hugetlb_free_vmemmap= - [KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP + [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. Control if HugeTLB Vmemmap Optimization (HVO) is enabled. Allows heavy hugetlb users to free up some more @@ -1792,12 +1768,6 @@ which allow the hypervisor to 'idle' the guest on lock contention. - keep_bootcon [KNL] - Do not unregister boot console at start. This is only - useful for debugging when something happens in the window - between unregistering the boot console and initializing - the real console. - i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. @@ -2367,17 +2337,18 @@ js= [HW,JOY] Analog joystick See Documentation/input/joydev/joystick.rst. - nokaslr [KNL] - When CONFIG_RANDOMIZE_BASE is set, this disables - kernel and module base offset ASLR (Address Space - Layout Randomization). - kasan_multi_shot [KNL] Enforce KASAN (Kernel Address Sanitizer) to print report on every invalid memory access. Without this parameter KASAN will print report only for the first invalid access. + keep_bootcon [KNL] + Do not unregister boot console at start. This is only + useful for debugging when something happens in the window + between unregistering the boot console and initializing + the real console. + keepinitrd [HW,ARM] kernelcore= [KNL,X86,IA-64,PPC] @@ -3326,6 +3297,13 @@ For details see: Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + <module>.async_probe[=<bool>] [KNL] + If no <bool> value is specified or if the value + specified is not a valid <bool>, enable asynchronous + probe on this module. Otherwise, enable/disable + asynchronous probe on this module as indicated by the + <bool> value. See also: module.async_probe + module.async_probe=<bool> [KNL] When set to true, modules will use async probing by default. To enable/disable async probing for a @@ -3709,7 +3687,7 @@ implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the sleep(SH) or wfi(ARM,ARM64) instructions do not work - correctly or when doing power measurements to evalute + correctly or when doing power measurements to evaluate the impact of the sleep instructions. This is also useful when using JTAG debugger. @@ -3780,6 +3758,11 @@ nojitter [IA-64] Disables jitter checking for ITC timers. + nokaslr [KNL] + When CONFIG_RANDOMIZE_BASE is set, this disables + kernel and module base offset ASLR (Address Space + Layout Randomization). + no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page @@ -3825,6 +3808,19 @@ nopcid [X86-64] Disable the PCID cpu feature. + nopku [X86] Disable Memory Protection Keys CPU feature found + in some Intel CPUs. + + nopv= [X86,XEN,KVM,HYPER_V,VMWARE] + Disables the PV optimizations forcing the guest to run + as generic guest with no PV drivers. Currently support + XEN HVM, KVM, HYPER_V and VMWARE guest. + + nopvspin [X86,XEN,KVM] + Disables the qspinlock slow path using PV optimizations + which allow the hypervisor to 'idle' the guest on lock + contention. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space @@ -4592,6 +4588,10 @@ r128= [HW,DRM] + radix_hcall_invalidate=on [PPC/PSERIES] + Disable RADIX GTSE feature and use hcall for TLB + invalidate. + raid= [HW,RAID] See Documentation/admin-guide/md.rst. @@ -5584,13 +5584,6 @@ 1 -- enable. Default value is 1. - apparmor= [APPARMOR] Disable or enable AppArmor at boot time - Format: { "0" | "1" } - See security/apparmor/Kconfig help text - 0 -- disable. - 1 -- enable. - Default value is set via kernel config option. - serialnumber [BUGS=X86-32] sev=option[,option...] [X86-64] See Documentation/x86/x86_64/boot-options.rst @@ -5598,6 +5591,15 @@ shapers= [NET] Maximal number of shapers. + show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller + Limit apic dumping. The parameter defines the maximal + number of local apics being dumped. Also it is possible + to set it to "all" by meaning -- no limit here. + Format: { 1 (default) | 2 | ... | all }. + The parameter valid if only apic=debug or + apic=verbose is specified. + Example: apic=debug show_lapic=all + simeth= [IA-64] simscsi= @@ -6037,6 +6039,16 @@ be used to filter out binaries which have not yet been made aware of AT_MINSIGSTKSZ. + stress_hpt [PPC] + Limits the number of kernel HPT entries in the hash + page table to increase the rate of hash page table + faults on kernel addresses. + + stress_slb [PPC] + Limits the number of kernel SLB entries, and flushes + them frequently to increase the rate of SLB faults + on kernel addresses. + sunrpc.min_resvport= sunrpc.max_resvport= [NFS,SUNRPC] @@ -6290,7 +6302,7 @@ that can be enabled or disabled just as if you were to echo the option name into - /sys/kernel/debug/tracing/trace_options + /sys/kernel/tracing/trace_options For example, to enable stacktrace option (to dump the stack trace of each event), add to the command line: @@ -6323,7 +6335,7 @@ [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can be enabled again by echoing '1' into the "tracing_on" - file located in /sys/kernel/debug/tracing/ + file located in /sys/kernel/tracing/ This option is useful, as it disables the trace before the WARNING dump is called, which prevents the trace to @@ -6778,11 +6790,11 @@ functions are at fixed addresses, they make nice targets for exploits that can control RIP. - emulate [default] Vsyscalls turn into traps and are - emulated reasonably safely. The vsyscall - page is readable. + emulate Vsyscalls turn into traps and are emulated + reasonably safely. The vsyscall page is + readable. - xonly Vsyscalls turn into traps and are + xonly [default] Vsyscalls turn into traps and are emulated reasonably safely. The vsyscall page is not readable. @@ -6979,16 +6991,6 @@ fairer and the number of possible event channels is much higher. Default is on (use fifo events). - nopv= [X86,XEN,KVM,HYPER_V,VMWARE] - Disables the PV optimizations forcing the guest to run - as generic guest with no PV drivers. Currently support - XEN HVM, KVM, HYPER_V and VMWARE guest. - - nopvspin [X86,XEN,KVM] - Disables the qspinlock slow path using PV optimizations - which allow the hypervisor to 'idle' the guest on lock - contention. - xirc2ps_cs= [NET,PCMCIA] Format: <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index e4a5fc26f1a9..993c2a05f5ee 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -25,7 +25,7 @@ References - In order to locate kernel-generated OS jitter on CPU N: - cd /sys/kernel/debug/tracing + cd /sys/kernel/tracing echo 1 > max_graph_depth # Increase the "1" for more detail echo function_graph > current_tracer # run workload diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 475eb0e81e4a..e27a1c3f634e 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -1488,7 +1488,7 @@ Example of command to set keyboard language is mentioned below:: Text corresponding to keyboard layout to be set in sysfs are: be(Belgian), cz(Czech), da(Danish), de(German), en(English), es(Spain), et(Estonian), fr(French), fr-ch(French(Switzerland)), hu(Hungarian), it(Italy), jp (Japan), -nl(Dutch), nn(Norway), pl(Polish), pt(portugese), sl(Slovenian), sv(Sweden), +nl(Dutch), nn(Norway), pl(Polish), pt(portuguese), sl(Slovenian), sv(Sweden), tr(Turkey) WWAN Antenna type diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index d8fc9a59c086..4ff2cc291d18 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst @@ -317,7 +317,7 @@ All md devices contain: suspended (not supported yet) All IO requests will block. The array can be reconfigured. - Writing this, if accepted, will block until array is quiessent + Writing this, if accepted, will block until array is quiescent readonly no resync can happen. no superblocks get written. diff --git a/Documentation/admin-guide/media/bttv.rst b/Documentation/admin-guide/media/bttv.rst index 125f6f47123d..58cbaf6df694 100644 --- a/Documentation/admin-guide/media/bttv.rst +++ b/Documentation/admin-guide/media/bttv.rst @@ -909,7 +909,7 @@ DE hat diverse Treiber fuer diese Modelle (Stand 09/2002): - TVPhone98 (Bt878) - AVerTV und TVCapture98 w/VCR (Bt 878) - AVerTVStudio und TVPhone98 w/VCR (Bt878) - - AVerTV GO Serie (Kein SVideo Input) + - AVerTV GO Series (Kein SVideo Input) - AVerTV98 (BT-878 chip) - AVerTV98 mit Fernbedienung (BT-878 chip) - AVerTV/FM98 (BT-878 chip) diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst index 2d660b76caea..a06473429916 100644 --- a/Documentation/admin-guide/media/building.rst +++ b/Documentation/admin-guide/media/building.rst @@ -137,7 +137,7 @@ The ``LIRC user interface`` option adds enhanced functionality when using the from remote controllers. The ``Support for eBPF programs attached to lirc devices`` option allows -the usage of special programs (called eBPF) that would allow aplications +the usage of special programs (called eBPF) that would allow applications to add extra remote controller decoding functionality to the Linux Kernel. The ``Remote controller decoders`` option allows selecting the diff --git a/Documentation/admin-guide/media/si476x.rst b/Documentation/admin-guide/media/si476x.rst index 87062301d6a1..c8882ee9f208 100644 --- a/Documentation/admin-guide/media/si476x.rst +++ b/Documentation/admin-guide/media/si476x.rst @@ -142,7 +142,7 @@ The drivers exposes following files: indicator 0x18 lassi Signed Low side adjacent Channel Strength indicator - 0x19 hassi ditto fpr High side + 0x19 hassi ditto for High side 0x20 mult Multipath indicator 0x21 dev Frequency deviation 0x24 assi Adjacent channel SSI diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index 672a8371f6ad..58ac25b2c385 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -580,7 +580,7 @@ Metadata Capture ---------------- The Metadata capture generates UVC format metadata. The PTS and SCR are -transmitted based on the values set in vivid contols. +transmitted based on the values set in vivid controls. The Metadata device will only work for the Webcam input, it will give back an error for all other inputs. diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst index c79f1e336222..e796b0a7e4a5 100644 --- a/Documentation/admin-guide/mm/concepts.rst +++ b/Documentation/admin-guide/mm/concepts.rst @@ -1,5 +1,3 @@ -.. _mm_concepts: - ================= Concepts overview ================= @@ -86,16 +84,15 @@ memory with the huge pages. The first one is `HugeTLB filesystem`, or hugetlbfs. It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages. The hugetlbfs is described at -:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. +Documentation/admin-guide/mm/hugetlbpage.rst. Another, more recent, mechanism that enables use of the huge pages is called `Transparent HugePages`, or THP. Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the -name. See -:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` -for more details about THP. +name. See Documentation/admin-guide/mm/transhuge.rst for more details +about THP. Zones ===== @@ -125,8 +122,8 @@ processor. Each bank is referred to as a `node` and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters. You can find more details about NUMA in -:ref:`Documentation/mm/numa.rst <numa>` and in -:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. +Documentation/mm/numa.rst` and in +Documentation/admin-guide/mm/numa_memory_policy.rst. Page cache ========== diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index c09cace80651..7b0775d281b4 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -54,7 +54,7 @@ that is built with ``CONFIG_DAMON_LRU_SORT=y``. To let sysadmins enable or disable it and tune for the given system, DAMON_LRU_SORT utilizes module parameters. That is, you can put ``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write -proper values to ``/sys/modules/damon_lru_sort/parameters/<parameter>`` files. +proper values to ``/sys/module/damon_lru_sort/parameters/<parameter>`` files. Below are the description of each parameter. @@ -283,7 +283,7 @@ doesn't make progress and therefore the free memory rate becomes lower than 20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to the LRU-list based page granularity reclamation. :: - # cd /sys/modules/damon_lru_sort/parameters + # cd /sys/module/damon_lru_sort/parameters # echo 500 > hot_thres_access_freq # echo 120000000 > cold_min_age # echo 10 > quota_ms diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 4f1479a11e63..d2ccd9c21b9a 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -46,7 +46,7 @@ that is built with ``CONFIG_DAMON_RECLAIM=y``. To let sysadmins enable or disable it and tune for the given system, DAMON_RECLAIM utilizes module parameters. That is, you can put ``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write -proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files. +proper values to ``/sys/module/damon_reclaim/parameters/<parameter>`` files. Below are the description of each parameter. @@ -251,7 +251,7 @@ therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to do nothing again, so that we can fall back to the LRU-list based page granularity reclamation. :: - # cd /sys/modules/damon_reclaim/parameters + # cd /sys/module/damon_reclaim/parameters # echo 30000000 > min_age # echo $((1 * 1024 * 1024 * 1024)) > quota_sz # echo 1000 > quota_reset_interval_ms diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 19f27c0d92e0..bca00cb6f43a 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -1,5 +1,3 @@ -.. _hugetlbpage: - ============= HugeTLB Pages ============= @@ -86,7 +84,7 @@ by increasing or decreasing the value of ``nr_hugepages``. Note: When the feature of freeing unused vmemmap pages associated with each hugetlb page is enabled, we can fail to free the huge pages triggered by -the user when ths system is under memory pressure. Please try again later. +the user when the system is under memory pressure. Please try again later. Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under @@ -313,7 +311,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: #. Regardless of mempolicy mode [see - :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`], + Documentation/admin-guide/mm/numa_memory_policy.rst], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst index df9394fb39c2..b5a285bd73fd 100644 --- a/Documentation/admin-guide/mm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -1,5 +1,3 @@ -.. _idle_page_tracking: - ================== Idle Page Tracking ================== @@ -70,9 +68,8 @@ If the tool is run initially with the appropriate option, it will mark all the queried pages as idle. Subsequent runs of the tool can then show which pages have their idle flag cleared in the interim. -See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more -information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and -``/proc/kpagecgroup``. +See Documentation/admin-guide/mm/pagemap.rst for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. .. _impl_details: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index d1064e0ba34a..1f883abf3f00 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -16,8 +16,7 @@ are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html Linux memory management has its own jargon and if you are not yet -familiar with it, consider reading -:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`. +familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst. Here we document in detail how to interact with various mechanisms in the Linux memory management. diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index fb6ba2002a4b..eed51a910c94 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -1,5 +1,3 @@ -.. _admin_guide_ksm: - ======================= Kernel Samepage Merging ======================= diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index a3c9e8ad8fa0..1b02fe5807cc 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -1,5 +1,3 @@ -.. _admin_guide_memory_hotplug: - ================== Memory Hot(Un)Plug ================== diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 5a6afecbb0d0..46515ad2337f 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -1,5 +1,3 @@ -.. _numa_memory_policy: - ================== NUMA Memory Policy ================== @@ -246,7 +244,7 @@ MPOL_INTERLEAVED interleaved system default policy works in this mode. MPOL_PREFERRED_MANY - This mode specifices that the allocation should be preferrably + This mode specifies that the allocation should be preferably satisfied from the nodemask specified in the policy. If there is a memory pressure on all nodes in the nodemask, the allocation can fall back to all existing numa nodes. This is effectively @@ -360,7 +358,7 @@ and NUMA nodes. "Usage" here means one of the following: 2) examination of the policy to determine the policy mode and associated node or node lists, if any, for page allocation. This is considered a "hot path". Note that for MPOL_BIND, the "usage" extends across the entire - allocation process, which may sleep during page reclaimation, because the + allocation process, which may sleep during page reclamation, because the BIND policy nodemask is used, by reference, to filter ineligible nodes. We can avoid taking an extra reference during the usages listed above as diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst index 166697325947..24e63e740420 100644 --- a/Documentation/admin-guide/mm/numaperf.rst +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -1,5 +1,3 @@ -.. _numaperf: - ============= NUMA Locality ============= diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..1a22674ab18e 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -1,5 +1,3 @@ -.. _pagemap: - ============================= Examining Process Page Tables ============================= @@ -19,10 +17,10 @@ There are four components to pagemap: * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped * Bit 55 pte is soft-dirty (see - :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) + Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see - :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`) + Documentation/admin-guide/mm/userfaultfd.rst) * Bits 58-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped @@ -105,8 +103,7 @@ Short descriptions to the page flags A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages - (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`), + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. @@ -128,7 +125,7 @@ Short descriptions to the page flags Zero page for pfn_zero or huge_zero page. 25 - IDLE The page has not been accessed since it was marked idle (see - :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst index 3887f0b294fe..c582033bd113 100644 --- a/Documentation/admin-guide/mm/shrinker_debugfs.rst +++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst @@ -1,5 +1,3 @@ -.. _shrinker_debugfs: - ========================== Shrinker Debugfs Interface ========================== diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst index cb0cfd6672fa..aeea936caa44 100644 --- a/Documentation/admin-guide/mm/soft-dirty.rst +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -1,5 +1,3 @@ -.. _soft_dirty: - =============== Soft-Dirty PTEs =============== diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst index e0466f2db8fa..2e630627bcee 100644 --- a/Documentation/admin-guide/mm/swap_numa.rst +++ b/Documentation/admin-guide/mm/swap_numa.rst @@ -1,5 +1,3 @@ -.. _swap_numa: - =========================================== Automatically bind swap device to numa node =========================================== diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 8ee78ec232eb..b0cc8243e093 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -1,5 +1,3 @@ -.. _admin_guide_transhuge: - ============================ Transparent Hugepage Support ============================ diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..7dc823b56ca4 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -1,5 +1,3 @@ -.. _userfaultfd: - =========== Userfaultfd =========== diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 6dd74a18268b..c5c2c7dbb155 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -1,5 +1,3 @@ -.. _zswap: - ===== zswap ===== diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst index 578407e487d6..75a40846d47f 100644 --- a/Documentation/admin-guide/perf/hns3-pmu.rst +++ b/Documentation/admin-guide/perf/hns3-pmu.rst @@ -53,7 +53,7 @@ two events have same value of bits 0~15 of config, that means they are event pair. And the bit 16 of config indicates getting counter 0 or counter 1 of hardware event. -After getting two values of event pair in usersapce, the formula of +After getting two values of event pair in userspace, the formula of computation to calculate real performance data is::: counter 0 / counter 1 diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index d143e72cf93e..6e5298b521b1 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -473,7 +473,7 @@ Unit Tests for amd-pstate * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization. -1. Test case decriptions +1. Test case descriptions 1). Basic tests diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index d5043cd8d2f5..bf13ad25a32f 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -712,7 +712,7 @@ it works in the `active mode <Active Mode_>`_. The following sequence of shell commands can be used to enable them and see their output (if the kernel is generally configured to support event tracing):: - # cd /sys/kernel/debug/tracing/ + # cd /sys/kernel/tracing/ # echo 1 > events/power/pstate_sample/enable # echo 1 > events/power/cpu_frequency/enable # cat trace @@ -732,7 +732,7 @@ The ``ftrace`` interface can be used for low-level diagnostics of P-state is called, the ``ftrace`` filter can be set to :c:func:`intel_pstate_set_pstate`:: - # cd /sys/kernel/debug/tracing/ + # cd /sys/kernel/tracing/ # cat available_filter_functions | grep -i pstate intel_pstate_set_pstate intel_pstate_cpu_init diff --git a/Documentation/admin-guide/spkguide.txt b/Documentation/admin-guide/spkguide.txt index 1265c1eab31c..74ea7f391942 100644 --- a/Documentation/admin-guide/spkguide.txt +++ b/Documentation/admin-guide/spkguide.txt @@ -1105,8 +1105,8 @@ speakup load Alternatively, you can add the above line to your file ~/.bashrc or ~/.bash_profile. -If your system administrator ran himself the script, all the users will be able -to change from English to the language choosed by root and do directly +If your system administrator himself ran the script, all the users will be able +to change from English to the language chosen by root and do directly speakupconf load (or add this to the ~/.bashrc or ~/.bash_profile file). If there are several languages to handle, the administrator (or every user) will have to run the first steps until speakupconf diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 988f6a4c8084..45ba1f4dc004 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -356,7 +356,7 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file:: But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). +in /proc/zoneinfo like the following. (This is an example of x86-64 box). Each zone has an array of protection pages like this:: Node 0, zone DMA @@ -433,7 +433,7 @@ a 2bit error in a memory module) is detected in the background by hardware that cannot be handled by the kernel. In some cases (like the page still having a valid copy on disk) the kernel will handle the failure transparently without affecting any applications. But if there is -no other uptodate copy of the data it will kill to prevent any data +no other up-to-date copy of the data it will kill to prevent any data corruptions from propagating. 1: Kill all processes that have the corrupted and not reloadable page mapped diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 0a178ef0111d..51906e47327b 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -138,7 +138,7 @@ Command Function ``v`` Forcefully restores framebuffer console ``v`` Causes ETM buffer dump [ARM-specific] -``w`` Dumps tasks that are in uninterruptable (blocked) state. +``w`` Dumps tasks that are in uninterruptible (blocked) state. ``x`` Used by xmon interface on ppc/powerpc platforms. Show global PMU Registers on sparc64. diff --git a/Documentation/admin-guide/thermal/intel_powerclamp.rst b/Documentation/admin-guide/thermal/intel_powerclamp.rst index 3ce96043af17..08509b978af4 100644 --- a/Documentation/admin-guide/thermal/intel_powerclamp.rst +++ b/Documentation/admin-guide/thermal/intel_powerclamp.rst @@ -87,7 +87,7 @@ migrated, unless the CPU is taken offline. In this case, threads belong to the offlined CPUs will be terminated immediately. Running as SCHED_FIFO and relatively high priority, also allows such -scheme to work for both preemptable and non-preemptable kernels. +scheme to work for both preemptible and non-preemptible kernels. Alignment of idle time around jiffies ensures scalability for HZ values. This effect can be better visualized using a Perf timechart. The following diagram shows the behavior of kernel thread diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst new file mode 100644 index 000000000000..b2e254ec8ee8 --- /dev/null +++ b/Documentation/admin-guide/workload-tracing.rst @@ -0,0 +1,606 @@ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) + +====================================================== +Discovering Linux kernel subsystems used by a workload +====================================================== + +:Authors: - Shuah Khan <skhan@linuxfoundation.org> + - Shefali Sharma <sshefali021@gmail.com> +:maintained-by: Shuah Khan <skhan@linuxfoundation.org> + +Key Points +========== + + * Understanding system resources necessary to build and run a workload + is important. + * Linux tracing and strace can be used to discover the system resources + in use by a workload. The completeness of the system usage information + depends on the completeness of coverage of a workload. + * Performance and security of the operating system can be analyzed with + the help of tools such as: + `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_, + `stress-ng <https://www.mankier.com/1/stress-ng>`_, + `paxtest <https://github.com/opntr/paxtest-freebsd>`_. + * Once we discover and understand the workload needs, we can focus on them + to avoid regressions and use it to evaluate safety considerations. + +Methodology +=========== + +`strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ is a +diagnostic, instructional, and debugging tool and can be used to discover +the system resources in use by a workload. Once we discover and understand +the workload needs, we can focus on them to avoid regressions and use it +to evaluate safety considerations. We use strace tool to trace workloads. + +This method of tracing using strace tells us the system calls invoked by +the workload and doesn't include all the system calls that can be invoked +by it. In addition, this tracing method tells us just the code paths within +these system calls that are invoked. As an example, if a workload opens a +file and reads from it successfully, then the success path is the one that +is traced. Any error paths in that system call will not be traced. If there +is a workload that provides full coverage of a workload then the method +outlined here will trace and find all possible code paths. The completeness +of the system usage information depends on the completeness of coverage of a +workload. + +The goal is tracing a workload on a system running a default kernel without +requiring custom kernel installs. + +How do we gather fine-grained system information? +================================================= + +strace tool can be used to trace system calls made by a process and signals +it receives. System calls are the fundamental interface between an +application and the operating system kernel. They enable a program to +request services from the kernel. For instance, the open() system call in +Linux is used to provide access to a file in the file system. strace enables +us to track all the system calls made by an application. It lists all the +system calls made by a process and their resulting output. + +You can generate profiling data combining strace and perf record tools to +record the events and information associated with a process. This provides +insight into the process. "perf annotate" tool generates the statistics of +each instruction of the program. This document goes over the details of how +to gather fine-grained information on a workload's usage of system resources. + +We used strace to trace the perf, stress-ng, paxtest workloads to illustrate +our methodology to discover resources used by a workload. This process can +be applied to trace other workloads. + +Getting the system ready for tracing +==================================== + +Before we can get started we will show you how to get your system ready. +We assume that you have a Linux distribution running on a physical system +or a virtual machine. Most distributions will include strace command. Let’s +install other tools that aren’t usually included to build Linux kernel. +Please note that the following works on Debian based distributions. You +might have to find equivalent packages on other Linux distributions. + +Install tools to build Linux kernel and tools in kernel repository. +scripts/ver_linux is a good way to check if your system already has +the necessary tools:: + + sudo apt-get build-essentials flex bison yacc + sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + +cscope is a good tool to browse kernel sources. Let's install it now:: + + sudo apt-get install cscope + +Install stress-ng and paxtest:: + + apt-get install stress-ng + apt-get install paxtest + +Workload overview +================= + +As mentioned earlier, we used strace to trace perf bench, stress-ng and +paxtest workloads to show how to analyze a workload and identify Linux +subsystems used by these workloads. Let's start with an overview of these +three workloads to get a better understanding of what they do and how to +use them. + +perf bench (all) workload +------------------------- + +The perf bench command contains multiple multi-threaded microkernel +benchmarks for executing different subsystems in the Linux kernel and +system calls. This allows us to easily measure the impact of changes, +which can help mitigate performance regressions. It also acts as a common +benchmarking framework, enabling developers to easily create test cases, +integrate transparently, and use performance-rich tooling subsystems. + +Stress-ng netdev stressor workload +---------------------------------- + +stress-ng is used for performing stress testing on the kernel. It allows +you to exercise various physical subsystems of the computer, as well as +interfaces of the OS kernel, using "stressor-s". They are available for +CPU, CPU cache, devices, I/O, interrupts, file system, memory, network, +operating system, pipelines, schedulers, and virtual machines. Please refer +to the `stress-ng man-page <https://www.mankier.com/1/stress-ng>`_ to +find the description of all the available stressor-s. The netdev stressor +starts specified number (N) of workers that exercise various netdevice +ioctl commands across all the available network devices. + +paxtest kiddie workload +----------------------- + +paxtest is a program that tests buffer overflows in the kernel. It tests +kernel enforcements over memory usage. Generally, execution in some memory +segments makes buffer overflows possible. It runs a set of programs that +attempt to subvert memory usage. It is used as a regression test suite for +PaX, but might be useful to test other memory protection patches for the +kernel. We used paxtest kiddie mode which looks for simple vulnerabilities. + +What is strace and how do we use it? +==================================== + +As mentioned earlier, strace which is a useful diagnostic, instructional, +and debugging tool and can be used to discover the system resources in use +by a workload. It can be used: + + * To see how a process interacts with the kernel. + * To see why a process is failing or hanging. + * For reverse engineering a process. + * To find the files on which a program depends. + * For analyzing the performance of an application. + * For troubleshooting various problems related to the operating system. + +In addition, strace can generate run-time statistics on times, calls, and +errors for each system call and report a summary when program exits, +suppressing the regular output. This attempts to show system time (CPU time +spent running in the kernel) independent of wall clock time. We plan to use +these features to get information on workload system usage. + +strace command supports basic, verbose, and stats modes. strace command when +run in verbose mode gives more detailed information about the system calls +invoked by a process. + +Running strace -c generates a report of the percentage of time spent in each +system call, the total time in seconds, the microseconds per call, the total +number of calls, the count of each system call that has failed with an error +and the type of system call made. + + * Usage: strace <command we want to trace> + * Verbose mode usage: strace -v <command> + * Gather statistics: strace -c <command> + +We used the “-c” option to gather fine-grained run-time statistics in use +by three workloads we have chose for this analysis. + + * perf + * stress-ng + * paxtest + +What is cscope and how do we use it? +==================================== + +Now let’s look at `cscope <https://cscope.sourceforge.net/>`_, a command +line tool for browsing C, C++ or Java code-bases. We can use it to find +all the references to a symbol, global definitions, functions called by a +function, functions calling a function, text strings, regular expression +patterns, files including a file. + +We can use cscope to find which system call belongs to which subsystem. +This way we can find the kernel subsystems used by a process when it is +executed. + +Let’s checkout the latest Linux repository and build cscope database:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux + cd linux + cscope -R -p10 # builds cscope.out database before starting browse session + cscope -d -p10 # starts browse session on cscope.out database + +Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to +enter into the browsing session. cscope by default cscope.out database. +To get out of this mode press ctrl+d. -p option is used to specify the +number of file path components to display. -p10 is optimal for browsing +kernel sources. + +What is perf and how do we use it? +================================== + +Perf is an analysis tool based on Linux 2.6+ systems, which abstracts the +CPU hardware difference in performance measurement in Linux, and provides +a simple command line interface. Perf is based on the perf_events interface +exported by the kernel. It is very useful for profiling the system and +finding performance bottlenecks in an application. + +If you haven't already checked out the Linux mainline repository, you can do +so and then build kernel and perf tool:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux + cd linux + make -j3 all + cd tools/perf + make + +Note: The perf command can be built without building the kernel in the +repository and can be run on older kernels. However matching the kernel +and perf revisions gives more accurate information on the subsystem usage. + +We used "perf stat" and "perf bench" options. For a detailed information on +the perf tool, run "perf -h". + +perf stat +--------- +The perf stat command generates a report of various hardware and software +events. It does so with the help of hardware counter registers found in +modern CPUs that keep the count of these activities. "perf stat cal" shows +stats for cal command. + +Perf bench +---------- +The perf bench command contains multiple multi-threaded microkernel +benchmarks for executing different subsystems in the Linux kernel and +system calls. This allows us to easily measure the impact of changes, +which can help mitigate performance regressions. It also acts as a common +benchmarking framework, enabling developers to easily create test cases, +integrate transparently, and use performance-rich tooling. + +"perf bench all" command runs the following benchmarks: + + * sched/messaging + * sched/pipe + * syscall/basic + * mem/memcpy + * mem/memset + +What is stress-ng and how do we use it? +======================================= + +As mentioned earlier, stress-ng is used for performing stress testing on +the kernel. It allows you to exercise various physical subsystems of the +computer, as well as interfaces of the OS kernel, using stressor-s. They +are available for CPU, CPU cache, devices, I/O, interrupts, file system, +memory, network, operating system, pipelines, schedulers, and virtual +machines. + +The netdev stressor starts N workers that exercise various netdevice ioctl +commands across all the available network devices. The following ioctls are +exercised: + + * SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFFLAGS + * SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, SIOCGIFMTU + * SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN + +The following command runs the stressor:: + + stress-ng --netdev 1 -t 60 --metrics command. + +We can use the perf record command to record the events and information +associated with a process. This command records the profiling data in the +perf.data file in the same directory. + +Using the following commands you can record the events associated with the +netdev stressor, view the generated report perf.data and annotate the to +view the statistics of each instruction of the program:: + + perf record stress-ng --netdev 1 -t 60 --metrics command. + perf report + perf annotate + +What is paxtest and how do we use it? +===================================== + +paxtest is a program that tests buffer overflows in the kernel. It tests +kernel enforcements over memory usage. Generally, execution in some memory +segments makes buffer overflows possible. It runs a set of programs that +attempt to subvert memory usage. It is used as a regression test suite for +PaX, and will be useful to test other memory protection patches for the +kernel. + +paxtest provides kiddie and blackhat modes. The paxtest kiddie mode runs +in normal mode, whereas the blackhat mode tries to get around the protection +of the kernel testing for vulnerabilities. We focus on the kiddie mode here +and combine "paxtest kiddie" run with "perf record" to collect CPU stack +traces for the paxtest kiddie run to see which function is calling other +functions in the performance profile. Then the "dwarf" (DWARF's Call Frame +Information) mode can be used to unwind the stack. + +The following command can be used to view resulting report in call-graph +format:: + + perf record --call-graph dwarf paxtest kiddie + perf report --stdio + +Tracing workloads +================= + +Now that we understand the workloads, let's start tracing them. + +Tracing perf bench all workload +------------------------------- + +Run the following command to trace perf bench all workload:: + + strace -c perf bench all + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+-------------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+=========================+ +| getppid | 10000001 | Process Mgmt | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| clone | 1077 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+-------------------------+ +| prctl | 23 | Process Mgmt. | sys_prctl() | ++-------------------+-----------+-----------------+-------------------------+ +| prlimit64 | 7 | Process Mgmt. | sys_prlimit64() | ++-------------------+-----------+-----------------+-------------------------+ +| getpid | 10 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| uname | 3 | Process Mgmt. | sys_uname() | ++-------------------+-----------+-----------------+-------------------------+ +| sysinfo | 1 | Process Mgmt. | sys_sysinfo() | ++-------------------+-----------+-----------------+-------------------------+ +| getuid | 1 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getgid | 1 | Process Mgmt. | sys_getgid() | ++-------------------+-----------+-----------------+-------------------------+ +| geteuid | 1 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getegid | 1 | Process Mgmt. | sys_getegid | ++-------------------+-----------+-----------------+-------------------------+ +| close | 49951 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+-------------------------+ +| pipe | 604 | Filesystem | sys_pipe() | ++-------------------+-----------+-----------------+-------------------------+ +| openat | 48560 | Filesystem | sys_opennat() | ++-------------------+-----------+-----------------+-------------------------+ +| fstat | 8338 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+-------------------------+ +| stat | 1573 | Filesystem | sys_stat() | ++-------------------+-----------+-----------------+-------------------------+ +| pread64 | 9646 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+-------------------------+ +| getdents64 | 1873 | Filesystem | sys_getdents64() | ++-------------------+-----------+-----------------+-------------------------+ +| access | 3 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+-------------------------+ +| lstat | 1880 | Filesystem | sys_lstat() | ++-------------------+-----------+-----------------+-------------------------+ +| lseek | 6 | Filesystem | sys_lseek() | ++-------------------+-----------+-----------------+-------------------------+ +| ioctl | 3 | Filesystem | sys_ioctl() | ++-------------------+-----------+-----------------+-------------------------+ +| dup2 | 1 | Filesystem | sys_dup2() | ++-------------------+-----------+-----------------+-------------------------+ +| execve | 2 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+-------------------------+ +| fcntl | 8779 | Filesystem | sys_fcntl() | ++-------------------+-----------+-----------------+-------------------------+ +| statfs | 1 | Filesystem | sys_statfs() | ++-------------------+-----------+-----------------+-------------------------+ +| epoll_create | 2 | Filesystem | sys_epoll_create() | ++-------------------+-----------+-----------------+-------------------------+ +| epoll_ctl | 64 | Filesystem | sys_epoll_ctl() | ++-------------------+-----------+-----------------+-------------------------+ +| newfstatat | 8318 | Filesystem | sys_newfstatat() | ++-------------------+-----------+-----------------+-------------------------+ +| eventfd2 | 192 | Filesystem | sys_eventfd2() | ++-------------------+-----------+-----------------+-------------------------+ +| mmap | 243 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+-------------------------+ +| mprotect | 32 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+-------------------------+ +| brk | 21 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+-------------------------+ +| munmap | 128 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+-------------------------+ +| set_mempolicy | 156 | Memory Mgmt. | sys_set_mempolicy() | ++-------------------+-----------+-----------------+-------------------------+ +| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() | ++-------------------+-----------+-----------------+-------------------------+ +| set_robust_list | 1 | Futex | sys_set_robust_list() | ++-------------------+-----------+-----------------+-------------------------+ +| futex | 341 | Futex | sys_futex() | ++-------------------+-----------+-----------------+-------------------------+ +| sched_getaffinity | 79 | Scheduler | sys_sched_getaffinity() | ++-------------------+-----------+-----------------+-------------------------+ +| sched_setaffinity | 223 | Scheduler | sys_sched_setaffinity() | ++-------------------+-----------+-----------------+-------------------------+ +| socketpair | 202 | Network | sys_socketpair() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigprocmask | 21 | Signal | sys_rt_sigprocmask() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigaction | 36 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigreturn | 2 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+-------------------------+ +| wait4 | 889 | Time | sys_wait4() | ++-------------------+-----------+-----------------+-------------------------+ +| clock_nanosleep | 37 | Time | sys_clock_nanosleep() | ++-------------------+-----------+-----------------+-------------------------+ +| capget | 4 | Capability | sys_capget() | ++-------------------+-----------+-----------------+-------------------------+ + +Tracing stress-ng netdev stressor workload +------------------------------------------ + +Run the following command to trace stress-ng netdev stressor workload:: + + strace -c stress-ng --netdev 1 -t 60 --metrics + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+-------------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+=========================+ +| openat | 74 | Filesystem | sys_openat() | ++-------------------+-----------+-----------------+-------------------------+ +| close | 75 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+-------------------------+ +| read | 58 | Filesystem | sys_read() | ++-------------------+-----------+-----------------+-------------------------+ +| fstat | 20 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+-------------------------+ +| flock | 10 | Filesystem | sys_flock() | ++-------------------+-----------+-----------------+-------------------------+ +| write | 7 | Filesystem | sys_write() | ++-------------------+-----------+-----------------+-------------------------+ +| getdents64 | 8 | Filesystem | sys_getdents64() | ++-------------------+-----------+-----------------+-------------------------+ +| pread64 | 8 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+-------------------------+ +| lseek | 1 | Filesystem | sys_lseek() | ++-------------------+-----------+-----------------+-------------------------+ +| access | 2 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+-------------------------+ +| getcwd | 1 | Filesystem | sys_getcwd() | ++-------------------+-----------+-----------------+-------------------------+ +| execve | 1 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+-------------------------+ +| mmap | 61 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+-------------------------+ +| munmap | 3 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+-------------------------+ +| mprotect | 20 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+-------------------------+ +| mlock | 2 | Memory Mgmt. | sys_mlock() | ++-------------------+-----------+-----------------+-------------------------+ +| brk | 3 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigaction | 21 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigprocmask | 1 | Signal | sys_rt_sigprocmask() | ++-------------------+-----------+-----------------+-------------------------+ +| sigaltstack | 1 | Signal | sys_sigaltstack() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigreturn | 1 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+-------------------------+ +| getpid | 8 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| prlimit64 | 5 | Process Mgmt. | sys_prlimit64() | ++-------------------+-----------+-----------------+-------------------------+ +| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() | ++-------------------+-----------+-----------------+-------------------------+ +| sysinfo | 2 | Process Mgmt. | sys_sysinfo() | ++-------------------+-----------+-----------------+-------------------------+ +| getuid | 2 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+-------------------------+ +| uname | 1 | Process Mgmt. | sys_uname() | ++-------------------+-----------+-----------------+-------------------------+ +| setpgid | 1 | Process Mgmt. | sys_setpgid() | ++-------------------+-----------+-----------------+-------------------------+ +| getrusage | 1 | Process Mgmt. | sys_getrusage() | ++-------------------+-----------+-----------------+-------------------------+ +| geteuid | 1 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getppid | 1 | Process Mgmt. | sys_getppid() | ++-------------------+-----------+-----------------+-------------------------+ +| sendto | 3 | Network | sys_sendto() | ++-------------------+-----------+-----------------+-------------------------+ +| connect | 1 | Network | sys_connect() | ++-------------------+-----------+-----------------+-------------------------+ +| socket | 1 | Network | sys_socket() | ++-------------------+-----------+-----------------+-------------------------+ +| clone | 1 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+-------------------------+ +| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() | ++-------------------+-----------+-----------------+-------------------------+ +| wait4 | 2 | Time | sys_wait4() | ++-------------------+-----------+-----------------+-------------------------+ +| alarm | 1 | Time | sys_alarm() | ++-------------------+-----------+-----------------+-------------------------+ +| set_robust_list | 1 | Futex | sys_set_robust_list() | ++-------------------+-----------+-----------------+-------------------------+ + +Tracing paxtest kiddie workload +------------------------------- + +Run the following command to trace paxtest kiddie workload:: + + strace -c paxtest kiddie + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+----------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+======================+ +| read | 3 | Filesystem | sys_read() | ++-------------------+-----------+-----------------+----------------------+ +| write | 11 | Filesystem | sys_write() | ++-------------------+-----------+-----------------+----------------------+ +| close | 41 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+----------------------+ +| stat | 24 | Filesystem | sys_stat() | ++-------------------+-----------+-----------------+----------------------+ +| fstat | 2 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+----------------------+ +| pread64 | 6 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+----------------------+ +| access | 1 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+----------------------+ +| pipe | 1 | Filesystem | sys_pipe() | ++-------------------+-----------+-----------------+----------------------+ +| dup2 | 24 | Filesystem | sys_dup2() | ++-------------------+-----------+-----------------+----------------------+ +| execve | 1 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+----------------------+ +| fcntl | 26 | Filesystem | sys_fcntl() | ++-------------------+-----------+-----------------+----------------------+ +| openat | 14 | Filesystem | sys_openat() | ++-------------------+-----------+-----------------+----------------------+ +| rt_sigaction | 7 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+----------------------+ +| rt_sigreturn | 38 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+----------------------+ +| clone | 38 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+----------------------+ +| wait4 | 44 | Time | sys_wait4() | ++-------------------+-----------+-----------------+----------------------+ +| mmap | 7 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+----------------------+ +| mprotect | 3 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+----------------------+ +| munmap | 1 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+----------------------+ +| brk | 3 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+----------------------+ +| getpid | 1 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+----------------------+ +| getuid | 1 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+----------------------+ +| getgid | 1 | Process Mgmt. | sys_getgid() | ++-------------------+-----------+-----------------+----------------------+ +| geteuid | 2 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+----------------------+ +| getegid | 1 | Process Mgmt. | sys_getegid() | ++-------------------+-----------+-----------------+----------------------+ +| getppid | 1 | Process Mgmt. | sys_getppid() | ++-------------------+-----------+-----------------+----------------------+ +| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() | ++-------------------+-----------+-----------------+----------------------+ + +Conclusion +========== + +This document is intended to be used as a guide on how to gather fine-grained +information on the resources in use by workloads using strace. + +References +========== + + * `Discovery Linux Kernel Subsystems used by OpenAPS <https://elisa.tech/blog/2022/02/02/discovery-linux-kernel-subsystems-used-by-openaps>`_ + * `ELISA-White-Papers-Discovering Linux kernel subsystems used by a workload <https://github.com/elisa-tech/ELISA-White-Papers/blob/master/Processes/Discovering_Linux_kernel_subsystems_used_by_a_workload.md>`_ + * `strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ + * `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_ + * `paxtest README <https://github.com/opntr/paxtest-freebsd/blob/hardenedbsd/0.9.14-hbsd/README>`_ + * `stress-ng <https://www.mankier.com/1/stress-ng>`_ + * `Monitoring and managing system status and performance <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/index>`_ |