diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/sysfs-devices-system-cpu | 16 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 63 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt | 19 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt | 63 | ||||
-rw-r--r-- | Documentation/driver-api/pm/devices.rst | 54 | ||||
-rw-r--r-- | Documentation/filesystems/nilfs2.txt | 4 | ||||
-rw-r--r-- | Documentation/gpu/i915.rst | 5 | ||||
-rw-r--r-- | Documentation/kbuild/kconfig-language.txt | 23 | ||||
-rw-r--r-- | Documentation/networking/index.rst | 2 | ||||
-rw-r--r-- | Documentation/networking/msg_zerocopy.rst | 4 | ||||
-rw-r--r-- | Documentation/power/pci.txt | 11 | ||||
-rw-r--r-- | Documentation/thermal/cpu-cooling-api.txt | 115 | ||||
-rw-r--r-- | Documentation/usb/gadget-testing.txt | 2 | ||||
-rw-r--r-- | Documentation/x86/pti.txt | 186 | ||||
-rw-r--r-- | Documentation/x86/x86_64/mm.txt | 18 |
15 files changed, 431 insertions, 154 deletions
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index d6d862db3b5d..bfd29bc8d37a 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -375,3 +375,19 @@ Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> Description: information about CPUs heterogeneity. cpu_capacity: capacity of cpu#. + +What: /sys/devices/system/cpu/vulnerabilities + /sys/devices/system/cpu/vulnerabilities/meltdown + /sys/devices/system/cpu/vulnerabilities/spectre_v1 + /sys/devices/system/cpu/vulnerabilities/spectre_v2 +Date: January 2018 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: Information about CPU vulnerabilities + + The files are named after the code names of CPU + vulnerabilities. The output of those files reflects the + state of the CPUs in the system. Possible output values: + + "Not affected" CPU is not affected by the vulnerability + "Vulnerable" CPU is affected and no mitigation in effect + "Mitigation: $M" CPU is affected and mitigation $M is in effect diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index af7104aaffd9..281c85839c17 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -223,7 +223,7 @@ acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, - old_ordering, nonvs, sci_force_enable } + old_ordering, nonvs, sci_force_enable, nobl } See Documentation/power/video.txt for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep @@ -239,6 +239,9 @@ sci_force_enable causes the kernel to set SCI_EN directly on resume from S1/S3 (which is against the ACPI spec, but some broken systems don't work without it). + nobl causes the internal blacklist of systems known to + behave incorrectly in some ways with respect to system + suspend and resume to be ignored (use wisely). acpi_use_timer_override [HW,ACPI] Use timer override. For some broken Nvidia NF5 boards @@ -713,9 +716,6 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. - crossrelease_fullstack - [KNL] Allow to record full stack trace in cross-release - cryptomgr.notests [KNL] Disable crypto self-tests @@ -2626,6 +2626,11 @@ nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. + nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2 + (indirect branch prediction) vulnerability. System may + allow data leaks with this option, which is equivalent + to spectre_v2=off. + noxsave [BUGS=X86] Disables x86 extended register state save and restore using xsave. The kernel will fallback to enabling legacy floating-point and sse state. @@ -2712,8 +2717,6 @@ steal time is computed, but won't influence scheduler behaviour - nopti [X86-64] Disable kernel page table isolation - nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. @@ -3100,6 +3103,12 @@ pcie_scan_all Scan all possible PCIe devices. Otherwise we only look for one device below a PCIe downstream port. + big_root_window Try to add a big 64bit memory window to the PCIe + root complex on AMD CPUs. Some GFX hardware + can resize a BAR to allow access to all VRAM. + Adding the window is slightly risky (it may + conflict with unreported devices), so this + taints the kernel. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. @@ -3288,11 +3297,20 @@ pt. [PARIDE] See Documentation/blockdev/paride.txt. - pti= [X86_64] - Control user/kernel address space isolation: - on - enable - off - disable - auto - default setting + pti= [X86_64] Control Page Table Isolation of user and + kernel address spaces. Disabling this feature + removes hardening, but improves performance of + system calls and interrupts. + + on - unconditionally enable + off - unconditionally disable + auto - kernel detects whether your CPU model is + vulnerable to issues that PTI mitigates + + Not specifying this option is equivalent to pti=auto. + + nopti [X86_64] + Equivalent to pti=off pty.legacy_count= [KNL] Number of legacy pty's. Overwrites compiled-in @@ -3943,6 +3961,29 @@ sonypi.*= [HW] Sony Programmable I/O Control Device driver See Documentation/laptops/sonypi.txt + spectre_v2= [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability. + + on - unconditionally enable + off - unconditionally disable + auto - kernel detects whether your CPU model is + vulnerable + + Selecting 'on' will, and 'auto' may, choose a + mitigation method at run time according to the + CPU, the available microcode, the setting of the + CONFIG_RETPOLINE configuration option, and the + compiler with which the kernel was built. + + Specific mitigations can also be selected manually: + + retpoline - replace indirect branches + retpoline,generic - google's original retpoline + retpoline,amd - AMD-specific minimal thunk + + Not specifying this option is equivalent to + spectre_v2=auto. + spia_io_base= [HW,MTD] spia_fio_base= spia_pedr= diff --git a/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt b/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt index 51336e5fc761..35c3c3460d17 100644 --- a/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt +++ b/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt @@ -14,3 +14,22 @@ following property before the previous one: Example: compatible = "marvell,armada-3720-db", "marvell,armada3720", "marvell,armada3710"; + + +Power management +---------------- + +For power management (particularly DVFS and AVS), the North Bridge +Power Management component is needed: + +Required properties: +- compatible : should contain "marvell,armada-3700-nb-pm", "syscon"; +- reg : the register start and length for the North Bridge + Power Management + +Example: + +nb_pm: syscon@14000 { + compatible = "marvell,armada-3700-nb-pm", "syscon"; + reg = <0x14000 0x60>; +} diff --git a/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt b/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt new file mode 100644 index 000000000000..832346e489a3 --- /dev/null +++ b/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt @@ -0,0 +1,63 @@ +Texas Instruments OMAP compatible OPP supply description + +OMAP5, DRA7, and AM57 family of SoCs have Class0 AVS eFuse registers which +contain data that can be used to adjust voltages programmed for some of their +supplies for more efficient operation. This binding provides the information +needed to read these values and use them to program the main regulator during +an OPP transitions. + +Also, some supplies may have an associated vbb-supply which is an Adaptive Body +Bias regulator which much be transitioned in a specific sequence with regards +to the vdd-supply and clk when making an OPP transition. By supplying two +regulators to the device that will undergo OPP transitions we can make use +of the multi regulator binding that is part of the OPP core described here [1] +to describe both regulators needed by the platform. + +[1] Documentation/devicetree/bindings/opp/opp.txt + +Required Properties for Device Node: +- vdd-supply: phandle to regulator controlling VDD supply +- vbb-supply: phandle to regulator controlling Body Bias supply + (Usually Adaptive Body Bias regulator) + +Required Properties for opp-supply node: +- compatible: Should be one of: + "ti,omap-opp-supply" - basic OPP supply controlling VDD and VBB + "ti,omap5-opp-supply" - OMAP5+ optimized voltages in efuse(class0)VDD + along with VBB + "ti,omap5-core-opp-supply" - OMAP5+ optimized voltages in efuse(class0) VDD + but no VBB. +- reg: Address and length of the efuse register set for the device (mandatory + only for "ti,omap5-opp-supply") +- ti,efuse-settings: An array of u32 tuple items providing information about + optimized efuse configuration. Each item consists of the following: + volt: voltage in uV - reference voltage (OPP voltage) + efuse_offseet: efuse offset from reg where the optimized voltage is stored. +- ti,absolute-max-voltage-uv: absolute maximum voltage for the OPP supply. + +Example: + +/* Device Node (CPU) */ +cpus { + cpu0: cpu@0 { + device_type = "cpu"; + + ... + + vdd-supply = <&vcc>; + vbb-supply = <&abb_mpu>; + }; +}; + +/* OMAP OPP Supply with Class0 registers */ +opp_supply_mpu: opp_supply@4a003b20 { + compatible = "ti,omap5-opp-supply"; + reg = <0x4a003b20 0x8>; + ti,efuse-settings = < + /* uV offset */ + 1060000 0x0 + 1160000 0x4 + 1210000 0x8 + >; + ti,absolute-max-voltage-uv = <1500000>; +}; diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 53c1b0b06da5..1128705a5731 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -777,17 +777,51 @@ The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in runtime suspend at the beginning of the ``suspend_late`` phase of system-wide suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM has been disabled for it, under the assumption that its state should not change -after that point until the system-wide transition is over. If that happens, the -driver's system-wide resume callbacks, if present, may still be invoked during -the subsequent system-wide resume transition and the device's runtime power -management status may be set to "active" before enabling runtime PM for it, -so the driver must be prepared to cope with the invocation of its system-wide -resume callbacks back-to-back with its ``->runtime_suspend`` one (without the -intervening ``->runtime_resume`` and so on) and the final state of the device -must reflect the "active" status for runtime PM in that case. +after that point until the system-wide transition is over (the PM core itself +does that for devices whose "noirq", "late" and "early" system-wide PM callbacks +are executed directly by it). If that happens, the driver's system-wide resume +callbacks, if present, may still be invoked during the subsequent system-wide +resume transition and the device's runtime power management status may be set +to "active" before enabling runtime PM for it, so the driver must be prepared to +cope with the invocation of its system-wide resume callbacks back-to-back with +its ``->runtime_suspend`` one (without the intervening ``->runtime_resume`` and +so on) and the final state of the device must reflect the "active" runtime PM +status in that case. During system-wide resume from a sleep state it's easiest to put devices into the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. -Refer to that document for more information regarding this particular issue as +[Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in -general. +general.] + +However, it often is desirable to leave devices in suspend after system +transitions to the working state, especially if those devices had been in +runtime suspend before the preceding system-wide suspend (or analogous) +transition. Device drivers can use the ``DPM_FLAG_LEAVE_SUSPENDED`` flag to +indicate to the PM core (and middle-layer code) that they prefer the specific +devices handled by them to be left suspended and they have no problems with +skipping their system-wide resume callbacks for this reason. Whether or not the +devices will actually be left in suspend may depend on their state before the +given system suspend-resume cycle and on the type of the system transition under +way. In particular, devices are not left suspended if that transition is a +restore from hibernation, as device states are not guaranteed to be reflected +by the information stored in the hibernation image in that case. + +The middle-layer code involved in the handling of the device is expected to +indicate to the PM core if the device may be left in suspend by setting its +:c:member:`power.may_skip_resume` status bit which is checked by the PM core +during the "noirq" phase of the preceding system-wide suspend (or analogous) +transition. The middle layer is then responsible for handling the device as +appropriate in its "noirq" resume callback, which is executed regardless of +whether or not the device is left suspended, but the other resume callbacks +(except for ``->complete``) will be skipped automatically by the PM core if the +device really can be left in suspend. + +For devices whose "noirq", "late" and "early" driver callbacks are invoked +directly by the PM core, all of the system-wide resume callbacks are skipped if +``DPM_FLAG_LEAVE_SUSPENDED`` is set and the device is in runtime suspend during +the ``suspend_noirq`` (or analogous) phase or the transition under way is a +proper system suspend (rather than anything related to hibernation) and the +device's wakeup settings are suitable for runtime PM (that is, it cannot +generate wakeup signals at all or it is allowed to wake up the system from +sleep). diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt index c0727dc36271..f2f3f8592a6f 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.txt @@ -25,8 +25,8 @@ available from the following download page. At least "mkfs.nilfs2", cleaner or garbage collector) are required. Details on the tools are described in the man pages included in the package. -Project web page: http://nilfs.sourceforge.net/ -Download page: http://nilfs.sourceforge.net/en/download.html +Project web page: https://nilfs.sourceforge.io/ +Download page: https://nilfs.sourceforge.io/en/download.html List info: http://vger.kernel.org/vger-lists.html#linux-nilfs Caveats diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst index 2e7ee0313c1c..e94d3ac2bdd0 100644 --- a/Documentation/gpu/i915.rst +++ b/Documentation/gpu/i915.rst @@ -341,10 +341,7 @@ GuC GuC-specific firmware loader ---------------------------- -.. kernel-doc:: drivers/gpu/drm/i915/intel_guc_loader.c - :doc: GuC-specific firmware loader - -.. kernel-doc:: drivers/gpu/drm/i915/intel_guc_loader.c +.. kernel-doc:: drivers/gpu/drm/i915/intel_guc_fw.c :internal: GuC-based command submission diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt index 262722d8867b..c4a293a03c33 100644 --- a/Documentation/kbuild/kconfig-language.txt +++ b/Documentation/kbuild/kconfig-language.txt @@ -200,10 +200,14 @@ module state. Dependency expressions have the following syntax: <expr> ::= <symbol> (1) <symbol> '=' <symbol> (2) <symbol> '!=' <symbol> (3) - '(' <expr> ')' (4) - '!' <expr> (5) - <expr> '&&' <expr> (6) - <expr> '||' <expr> (7) + <symbol1> '<' <symbol2> (4) + <symbol1> '>' <symbol2> (4) + <symbol1> '<=' <symbol2> (4) + <symbol1> '>=' <symbol2> (4) + '(' <expr> ')' (5) + '!' <expr> (6) + <expr> '&&' <expr> (7) + <expr> '||' <expr> (8) Expressions are listed in decreasing order of precedence. @@ -214,10 +218,13 @@ Expressions are listed in decreasing order of precedence. otherwise 'n'. (3) If the values of both symbols are equal, it returns 'n', otherwise 'y'. -(4) Returns the value of the expression. Used to override precedence. -(5) Returns the result of (2-/expr/). -(6) Returns the result of min(/expr/, /expr/). -(7) Returns the result of max(/expr/, /expr/). +(4) If value of <symbol1> is respectively lower, greater, lower-or-equal, + or greater-or-equal than value of <symbol2>, it returns 'y', + otherwise 'n'. +(5) Returns the value of the expression. Used to override precedence. +(6) Returns the result of (2-/expr/). +(7) Returns the result of min(/expr/, /expr/). +(8) Returns the result of max(/expr/, /expr/). An expression can have a value of 'n', 'm' or 'y' (or 0, 1, 2 respectively for calculations). A menu entry becomes visible when its diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 66e620866245..7d4b15977d61 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -9,6 +9,7 @@ Contents: batman-adv kapi z8530book + msg_zerocopy .. only:: subproject @@ -16,4 +17,3 @@ Contents: ======= * :ref:`genindex` - diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst index 77f6d7e25cfd..291a01264967 100644 --- a/Documentation/networking/msg_zerocopy.rst +++ b/Documentation/networking/msg_zerocopy.rst @@ -72,6 +72,10 @@ this flag, a process must first signal intent by setting a socket option: if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) error(1, errno, "setsockopt zerocopy"); +Setting the socket option only works when the socket is in its initial +(TCP_CLOSED) state. Trying to set the option for a socket returned by accept(), +for example, will lead to an EBUSY error. In this case, the option should be set +to the listening socket and it will be inherited by the accepted sockets. Transmission ------------ diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt index 704cd36079b8..8eaf9ee24d43 100644 --- a/Documentation/power/pci.txt +++ b/Documentation/power/pci.txt @@ -994,6 +994,17 @@ into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + 3.2. Device Runtime Power Management ------------------------------------ In addition to providing device power management callbacks PCI device drivers diff --git a/Documentation/thermal/cpu-cooling-api.txt b/Documentation/thermal/cpu-cooling-api.txt index 71653584cd03..7df567eaea1a 100644 --- a/Documentation/thermal/cpu-cooling-api.txt +++ b/Documentation/thermal/cpu-cooling-api.txt @@ -26,39 +26,16 @@ the user. The registration APIs returns the cooling device pointer. clip_cpus: cpumask of cpus where the frequency constraints will happen. 1.1.2 struct thermal_cooling_device *of_cpufreq_cooling_register( - struct device_node *np, const struct cpumask *clip_cpus) + struct cpufreq_policy *policy) This interface function registers the cpufreq cooling device with the name "thermal-cpufreq-%x" linking it with a device tree node, in order to bind it via the thermal DT code. This api can support multiple instances of cpufreq cooling devices. - np: pointer to the cooling device device tree node - clip_cpus: cpumask of cpus where the frequency constraints will happen. + policy: CPUFreq policy. -1.1.3 struct thermal_cooling_device *cpufreq_power_cooling_register( - const struct cpumask *clip_cpus, u32 capacitance, - get_static_t plat_static_func) - -Similar to cpufreq_cooling_register, this function registers a cpufreq -cooling device. Using this function, the cooling device will -implement the power extensions by using a simple cpu power model. The -cpus must have registered their OPPs using the OPP library. - -The additional parameters are needed for the power model (See 2. Power -models). "capacitance" is the dynamic power coefficient (See 2.1 -Dynamic power). "plat_static_func" is a function to calculate the -static power consumed by these cpus (See 2.2 Static power). - -1.1.4 struct thermal_cooling_device *of_cpufreq_power_cooling_register( - struct device_node *np, const struct cpumask *clip_cpus, u32 capacitance, - get_static_t plat_static_func) - -Similar to cpufreq_power_cooling_register, this function register a -cpufreq cooling device with power extensions using the device tree -information supplied by the np parameter. - -1.1.5 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) +1.1.3 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) This interface function unregisters the "thermal-cpufreq-%x" cooling device. @@ -67,20 +44,14 @@ information supplied by the np parameter. 2. Power models The power API registration functions provide a simple power model for -CPUs. The current power is calculated as dynamic + (optionally) -static power. This power model requires that the operating-points of +CPUs. The current power is calculated as dynamic power (static power isn't +supported currently). This power model requires that the operating-points of the CPUs are registered using the kernel's opp library and the `cpufreq_frequency_table` is assigned to the `struct device` of the cpu. If you are using CONFIG_CPUFREQ_DT then the `cpufreq_frequency_table` should already be assigned to the cpu device. -The `plat_static_func` parameter of `cpufreq_power_cooling_register()` -and `of_cpufreq_power_cooling_register()` is optional. If you don't -provide it, only dynamic power will be considered. - -2.1 Dynamic power - The dynamic power consumption of a processor depends on many factors. For a given processor implementation the primary factors are: @@ -119,79 +90,3 @@ mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range from 100 to 500. For reference, the approximate values for the SoC in ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and 140 for the Cortex-A53 cluster. - - -2.2 Static power - -Static leakage power consumption depends on a number of factors. For a -given circuit implementation the primary factors are: - -- Time the circuit spends in each 'power state' -- Temperature -- Operating voltage -- Process grade - -The time the circuit spends in each 'power state' for a given -evaluation period at first order means OFF or ON. However, -'retention' states can also be supported that reduce power during -inactive periods without loss of context. - -Note: The visibility of state entries to the OS can vary, according to -platform specifics, and this can then impact the accuracy of a model -based on OS state information alone. It might be possible in some -cases to extract more accurate information from system resources. - -The temperature, operating voltage and process 'grade' (slow to fast) -of the circuit are all significant factors in static leakage power -consumption. All of these have complex relationships to static power. - -Circuit implementation specific factors include the chosen silicon -process as well as the type, number and size of transistors in both -the logic gates and any RAM elements included. - -The static power consumption modelling must take into account the -power managed regions that are implemented. Taking the example of an -ARM processor cluster, the modelling would take into account whether -each CPU can be powered OFF separately or if only a single power -region is implemented for the complete cluster. - -In one view, there are others, a static power consumption model can -then start from a set of reference values for each power managed -region (e.g. CPU, Cluster/L2) in each state (e.g. ON, OFF) at an -arbitrary process grade, voltage and temperature point. These values -are then scaled for all of the following: the time in each state, the -process grade, the current temperature and the operating voltage. -However, since both implementation specific and complex relationships -dominate the estimate, the appropriate interface to the model from the -cpu cooling device is to provide a function callback that calculates -the static power in this platform. When registering the cpu cooling -device pass a function pointer that follows the `get_static_t` -prototype: - - int plat_get_static(cpumask_t *cpumask, int interval, - unsigned long voltage, u32 &power); - -`cpumask` is the cpumask of the cpus involved in the calculation. -`voltage` is the voltage at which they are operating. The function -should calculate the average static power for the last `interval` -milliseconds. It returns 0 on success, -E* on error. If it -succeeds, it should store the static power in `power`. Reading the -temperature of the cpus described by `cpumask` is left for -plat_get_static() to do as the platform knows best which thermal -sensor is closest to the cpu. - -If `plat_static_func` is NULL, static power is considered to be -negligible for this platform and only dynamic power is considered. - -The platform specific callback can then use any combination of tables -and/or equations to permute the estimated value. Process grade -information is not passed to the model since access to such data, from -on-chip measurement capability or manufacture time data, is platform -specific. - -Note: the significance of static power for CPUs in comparison to -dynamic power is highly dependent on implementation. Given the -potential complexity in implementation, the importance and accuracy of -its inclusion when using cpu cooling devices should be assessed on a -case by case basis. - diff --git a/Documentation/usb/gadget-testing.txt b/Documentation/usb/gadget-testing.txt index 441a4b9b666f..5908a21fddb6 100644 --- a/Documentation/usb/gadget-testing.txt +++ b/Documentation/usb/gadget-testing.txt @@ -693,7 +693,7 @@ such specification consists of a number of lines with an inverval value in each line. The rules stated above are best illustrated with an example: # mkdir functions/uvc.usb0/control/header/h -# cd functions/uvc.usb0/control/header/h +# cd functions/uvc.usb0/control/ # ln -s header/h class/fs # ln -s header/h class/ss # mkdir -p functions/uvc.usb0/streaming/uncompressed/u/360p diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt new file mode 100644 index 000000000000..d11eff61fc9a --- /dev/null +++ b/Documentation/x86/pti.txt @@ -0,0 +1,186 @@ +Overview +======== + +Page Table Isolation (pti, previously known as KAISER[1]) is a +countermeasure against attacks on the shared user/kernel address +space such as the "Meltdown" approach[2]. + +To mitigate this class of attacks, we create an independent set of +page tables for use only when running userspace applications. When +the kernel is entered via syscalls, interrupts or exceptions, the +page tables are switched to the full "kernel" copy. When the system +switches back to user mode, the user copy is used again. + +The userspace page tables contain only a minimal amount of kernel +data: only what is needed to enter/exit the kernel such as the +entry/exit functions themselves and the interrupt descriptor table +(IDT). There are a few strictly unnecessary things that get mapped +such as the first C function when entering an interrupt (see +comments in pti.c). + +This approach helps to ensure that side-channel attacks leveraging +the paging structures do not function when PTI is enabled. It can be +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. +Once enabled at compile-time, it can be disabled at boot with the +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). + +Page Table Management +===================== + +When PTI is enabled, the kernel manages two sets of page tables. +The first set is very similar to the single set which is present in +kernels without PTI. This includes a complete mapping of userspace +that the kernel can use for things like copy_to_user(). + +Although _complete_, the user portion of the kernel page tables is +crippled by setting the NX bit in the top level. This ensures +that any missed kernel->user CR3 switch will immediately crash +userspace upon executing its first instruction. + +The userspace page tables map only the kernel data needed to enter +and exit the kernel. This data is entirely contained in the 'struct +cpu_entry_area' structure which is placed in the fixmap which gives +each CPU's copy of the area a compile-time-fixed virtual address. + +For new userspace mappings, the kernel makes the entries in its +page tables like normal. The only difference is when the kernel +makes entries in the top (PGD) level. In addition to setting the +entry in the main kernel PGD, a copy of the entry is made in the +userspace page tables' PGD. + +This sharing at the PGD level also inherently shares all the lower +layers of the page tables. This leaves a single, shared set of +userspace page tables to manage. One PTE to lock, one set of +accessed bits, dirty bits, etc... + +Overhead +======== + +Protection against side-channel attacks is important. But, +this protection comes at a cost: + +1. Increased Memory Use + a. Each process now needs an order-1 PGD instead of order-0. + (Consumes an additional 4k per process). + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB + aligned so that it can be mapped by setting a single PMD + entry. This consumes nearly 2MB of RAM once the kernel + is decompressed, but no space in the kernel image itself. + +2. Runtime Cost + a. CR3 manipulation to switch between the page table copies + must be done at interrupt, syscall, and exception entry + and exit (it can be skipped when the kernel is interrupted, + though.) Moves to CR3 are on the order of a hundred + cycles, and are required at every entry and exit. + b. A "trampoline" must be used for SYSCALL entry. This + trampoline depends on a smaller set of resources than the + non-PTI SYSCALL entry code, so requires mapping fewer + things into the userspace page tables. The downside is + that stacks must be switched at entry time. + d. Global pages are disabled for all kernel structures not + mapped into both kernel and userspace page tables. This + feature of the MMU allows different processes to share TLB + entries mapping the kernel. Losing the feature means more + TLB misses after a context switch. The actual loss of + performance is very small, however, never exceeding 1%. + d. Process Context IDentifiers (PCID) is a CPU feature that + allows us to skip flushing the entire TLB when switching page + tables by setting a special bit in CR3 when the page tables + are changed. This makes switching the page tables (at context + switch, or kernel entry/exit) cheaper. But, on systems with + PCID support, the context switch code must flush both the user + and kernel entries out of the TLB. The user PCID TLB flush is + deferred until the exit to userspace, minimizing the cost. + See intel.com/sdm for the gory PCID/INVPCID details. + e. The userspace page tables must be populated for each new + process. Even without PTI, the shared kernel mappings + are created by copying top-level (PGD) entries into each + new process. But, with PTI, there are now *two* kernel + mappings: one in the kernel page tables that maps everything + and one for the entry/exit structures. At fork(), we need to + copy both. + f. In addition to the fork()-time copying, there must also + be an update to the userspace PGD any time a set_pgd() is done + on a PGD used to map userspace. This ensures that the kernel + and userspace copies always map the same userspace + memory. + g. On systems without PCID support, each CR3 write flushes + the entire TLB. That means that each syscall, interrupt + or exception flushes the TLB. + h. INVPCID is a TLB-flushing instruction which allows flushing + of TLB entries for non-current PCIDs. Some systems support + PCIDs, but do not support INVPCID. On these systems, addresses + can only be flushed from the TLB for the current PCID. When + flushing a kernel address, we need to flush all PCIDs, so a + single kernel address flush will require a TLB-flushing CR3 + write upon the next use of every PCID. + +Possible Future Work +==================== +1. We can be more careful about not actually writing to CR3 + unless its value is actually changed. +2. Allow PTI to be enabled/disabled at runtime in addition to the + boot-time switching. + +Testing +======== + +To test stability of PTI, the following test procedure is recommended, +ideally doing all of these in parallel: + +1. Set CONFIG_DEBUG_ENTRY=y +2. Run several copies of all of the tools/testing/selftests/x86/ tests + (excluding MPX and protection_keys) in a loop on multiple CPUs for + several minutes. These tests frequently uncover corner cases in the + kernel entry code. In general, old kernels might cause these tests + themselves to crash, but they should never crash the kernel. +3. Run the 'perf' tool in a mode (top or record) that generates many + frequent performance monitoring non-maskable interrupts (see "NMI" + in /proc/interrupts). This exercises the NMI entry/exit code which + is known to trigger bugs in code paths that did not expect to be + interrupted, including nested NMIs. Using "-c" boosts the rate of + NMIs, and using two -c with separate counters encourages nested NMIs + and less deterministic behavior. + + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done + +4. Launch a KVM virtual machine. +5. Run 32-bit binaries on systems supporting the SYSCALL instruction. + This has been a lightly-tested code path and needs extra scrutiny. + +Debugging +========= + +Bugs in PTI cause a few different signatures of crashes +that are worth noting here. + + * Failures of the selftests/x86 code. Usually a bug in one of the + more obscure corners of entry_64.S + * Crashes in early boot, especially around CPU bringup. Bugs + in the trampoline code or mappings cause these. + * Crashes at the first interrupt. Caused by bugs in entry_64.S, + like screwing up a page table switch. Also caused by + incorrectly mapping the IRQ handler entry code. + * Crashes at the first NMI. The NMI code is separate from main + interrupt handlers and can have bugs that do not affect + normal interrupts. Also caused by incorrectly mapping NMI + code. NMIs that interrupt the entry code must be very + careful and can be the cause of crashes that show up when + running perf. + * Kernel crashes at the first exit to userspace. entry_64.S + bugs, or failing to map some of the exit code. + * Crashes at first interrupt that interrupts userspace. The paths + in entry_64.S that return to userspace are sometimes separate + from the ones that return to the kernel. + * Double faults: overflowing the kernel stack because of page + faults upon page faults. Caused by touching non-pti-mapped + data in the entry code, or forgetting to switch to kernel + CR3 before calling into C functions which are not pti-mapped. + * Userspace segfaults early in boot, sometimes manifesting + as mount(8) failing to mount the rootfs. These have + tended to be TLB invalidation issues. Usually invalidating + the wrong PCID, or otherwise missing an invalidation. + +1. https://gruss.cc/files/kaiser.pdf +2. https://meltdownattack.com/meltdown.pdf diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index ad41b3813f0a..ea91cb61a602 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -12,8 +12,9 @@ ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB) ... unused hole ... ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB) ... unused hole ... -fffffe0000000000 - fffffe7fffffffff (=39 bits) LDT remap for PTI -fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping + vaddr_end for KASLR +fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping +fffffe8000000000 - fffffeffffffffff (=39 bits) LDT remap for PTI ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks ... unused hole ... ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space @@ -37,13 +38,15 @@ ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB) ... unused hole ... ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB) ... unused hole ... -fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping + vaddr_end for KASLR +fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping +... unused hole ... ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks ... unused hole ... ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space ... unused hole ... ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0 -ffffffffa0000000 - [fixmap start] (~1526 MB) module mapping space +ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space [fixmap start] - ffffffffff5fffff kernel-internal fixmap range ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole @@ -67,9 +70,10 @@ memory window (this size is arbitrary, it can be raised later if needed). The mappings are not part of any other kernel PGD and are only available during EFI runtime calls. -The module mapping space size changes based on the CONFIG requirements for the -following fixmap section. - Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all physical memory, vmalloc/ioremap space and virtual memory map are randomized. Their order is preserved but their base will be offset early at boot time. + +Be very careful vs. KASLR when changing anything here. The KASLR address +range must not overlap with anything except the KASAN shadow area, which is +correct as KASAN disables KASLR. |