summaryrefslogtreecommitdiff
path: root/virt/kvm
AgeCommit message (Collapse)AuthorFilesLines
2015-02-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds8-885/+2488
Pull KVM update from Paolo Bonzini: "Fairly small update, but there are some interesting new features. Common: Optional support for adding a small amount of polling on each HLT instruction executed in the guest (or equivalent for other architectures). This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This also has to be enabled manually for now, but the plan is to auto-tune this in the future. ARM/ARM64: The highlights are support for GICv3 emulation and dirty page tracking s390: Several optimizations and bugfixes. Also a first: a feature exposed by KVM (UUID and long guest name in /proc/sysinfo) before it is available in IBM's hypervisor! :) MIPS: Bugfixes. x86: Support for PML (page modification logging, a new feature in Broadwell Xeons that speeds up dirty page tracking), nested virtualization improvements (nested APICv---a nice optimization), usual round of emulation fixes. There is also a new option to reduce latency of the TSC deadline timer in the guest; this needs to be tuned manually. Some commits are common between this pull and Catalin's; I see you have already included his tree. Powerpc: Nothing yet. The KVM/PPC changes will come in through the PPC maintainers, because I haven't received them yet and I might end up being offline for some part of next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: ia64: drop kvm.h from installed user headers KVM: x86: fix build with !CONFIG_SMP KVM: x86: emulate: correct page fault error code for NoWrite instructions KVM: Disable compat ioctl for s390 KVM: s390: add cpu model support KVM: s390: use facilities and cpu_id per KVM KVM: s390/CPACF: Choose crypto control block format s390/kernel: Update /proc/sysinfo file with Extended Name and UUID KVM: s390: reenable LPP facility KVM: s390: floating irqs: fix user triggerable endless loop kvm: add halt_poll_ns module parameter kvm: remove KVM_MMIO_SIZE KVM: MIPS: Don't leak FPU/DSP to guest KVM: MIPS: Disable HTW while in guest KVM: nVMX: Enable nested posted interrupt processing KVM: nVMX: Enable nested virtual interrupt delivery KVM: nVMX: Enable nested apic register virtualization KVM: nVMX: Make nested control MSRs per-cpu KVM: nVMX: Enable nested virtualize x2apic mode KVM: nVMX: Prepare for using hardware MSR bitmap ...
2015-02-11mm: gup: kvm use get_user_pages_unlockedAndrea Arcangeli2-47/+5
Use the more generic get_user_pages_unlocked which has the additional benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault (which allows the first page fault in an unmapped area to be always able to block indefinitely by being allowed to release the mmap_sem). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-09KVM: Disable compat ioctl for s390Christian Borntraeger2-6/+10
We never had a 31bit QEMU/kuli running. We would need to review several ioctls to check if this creates holes, bugs or whatever to make it work. Lets just disable compat support for KVM on s390. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-06kvm: add halt_poll_ns module parameterPaolo Bonzini1-7/+41
This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log ↵Kai Huang1-1/+1
dirty We don't have to write protect guest memory for dirty logging if architecture supports hardware dirty logging, such as PML on VMX, so rename it to be more generic. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-27kvm: update_memslots: clean flags for invalid memslotsTiejun Chen1-0/+1
Indeed, any invalid memslots should be new->npages = 0, new->base_gfn = 0 and new->flags = 0 at the same time. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-23KVM: Remove unused config symbolChristoffer Dall1-3/+0
The dirty patch logging series introduced both HAVE_KVM_ARCH_DIRTY_LOG_PROTECT and KVM_GENERIC_DIRTYLOG_READ_PROTECT config symbols, but only KVM_GENERIC_DIRTYLOG_READ_PROTECT is used. Just remove the unused one. (The config symbol was renamed during the development of the patch series and the old name just creeped in by accident.() Reported-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: force alignment of VGIC dist/CPU/redist addressesAndre Przywara1-3/+13
Although the GIC architecture requires us to map the MMIO regions only at page aligned addresses, we currently do not enforce this from the kernel side. Restrict any vGICv2 regions to be 4K aligned and any GICv3 regions to be 64K aligned. Document this requirement. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: allow userland to request a virtual GICv3Andre Przywara2-13/+36
With all of the GICv3 code in place now we allow userland to ask the kernel for using a virtual GICv3 in the guest. Also we provide the necessary support for guests setting the memory addresses for the virtual distributor and redistributors. This requires some userland code to make use of that feature and explicitly ask for a virtual GICv3. Document that KVM_CREATE_IRQCHIP only works for GICv2, but is considered legacy and using KVM_CREATE_DEVICE is preferred. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: enable kernel side of GICv3 emulationAndre Przywara3-25/+66
With all the necessary GICv3 emulation code in place, we can now connect the code to the GICv3 backend in the kernel. The LR register handling is different depending on the emulated GIC model, so provide different implementations for each. Also allow non-v2-compatible GICv3 implementations (which don't provide MMIO regions for the virtual CPU interface in the DT), but restrict those hosts to support GICv3 guests only. If the device tree provides a GICv2 compatible GICV resource entry, but that one is faulty, just disable the GICv2 emulation and let the user use at least the GICv3 emulation for guests. To provide proper support for the legacy KVM_CREATE_IRQCHIP ioctl, note virtual GICv2 compatibility in struct vgic_params and use it on creating a VGICv2. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm64: KVM: add SGI generation register emulationAndre Przywara1-0/+111
While the generation of a (virtual) inter-processor interrupt (SGI) on a GICv2 works by writing to a MMIO register, GICv3 uses the system register ICC_SGI1R_EL1 to trigger them. Add a trap handler function that calls the new SGI register handler in the GICv3 code. As ICC_SRE_EL1.SRE at this point is still always 0, this will not trap yet, but will only be used later when all the data structures have been initialized properly. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: add virtual GICv3 distributor emulationAndre Przywara3-2/+934
With everything separated and prepared, we implement a model of a GICv3 distributor and redistributors by using the existing framework to provide handler functions for each register group. Currently we limit the emulation to a model enforcing a single security state, with SRE==1 (forcing system register access) and ARE==1 (allowing more than 8 VCPUs). We share some of the functions provided for GICv2 emulation, but take the different ways of addressing (v)CPUs into account. Save and restore is currently not implemented. Similar to the split-off of the GICv2 specific code, the new emulation code goes into a new file (vgic-v3-emul.c). Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: add opaque private pointer to MMIO dataAndre Przywara1-0/+1
For a GICv2 there is always only one (v)CPU involved: the one that does the access. On a GICv3 the access to a CPU redistributor is memory-mapped, but not banked, so the (v)CPU affected is determined by looking at the MMIO address region being accessed. To allow passing the affected CPU into the accessors later, extend struct kvm_exit_mmio to add an opaque private pointer parameter. The current GICv2 emulation just does not use it. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: split GICv2 specific emulation code from vgic.cAndre Przywara2-805/+848
vgic.c is currently a mixture of generic vGIC emulation code and functions specific to emulating a GICv2. To ease the addition of GICv3, split off strictly v2 specific parts into a new file vgic-v2-emul.c. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> ------- As the diff isn't always obvious here (and to aid eventual rebases), here is a list of high-level changes done to the code: * added new file to respective arm/arm64 Makefiles * moved GICv2 specific functions to vgic-v2-emul.c: - handle_mmio_misc() - handle_mmio_set_enable_reg() - handle_mmio_clear_enable_reg() - handle_mmio_set_pending_reg() - handle_mmio_clear_pending_reg() - handle_mmio_priority_reg() - vgic_get_target_reg() - vgic_set_target_reg() - handle_mmio_target_reg() - handle_mmio_cfg_reg() - handle_mmio_sgi_reg() - vgic_v2_unqueue_sgi() - read_set_clear_sgi_pend_reg() - write_set_clear_sgi_pend_reg() - handle_mmio_sgi_set() - handle_mmio_sgi_clear() - vgic_v2_handle_mmio() - vgic_get_sgi_sources() - vgic_dispatch_sgi() - vgic_v2_queue_sgi() - vgic_v2_map_resources() - vgic_v2_init() - vgic_v2_add_sgi_source() - vgic_v2_init_model() - vgic_v2_init_emulation() - handle_cpu_mmio_misc() - handle_mmio_abpr() - handle_cpu_mmio_ident() - vgic_attr_regs_access() - vgic_create() (renamed to vgic_v2_create()) - vgic_destroy() (renamed to vgic_v2_destroy()) - vgic_has_attr() (renamed to vgic_v2_has_attr()) - vgic_set_attr() (renamed to vgic_v2_set_attr()) - vgic_get_attr() (renamed to vgic_v2_get_attr()) - struct kvm_mmio_range vgic_dist_ranges[] - struct kvm_mmio_range vgic_cpu_ranges[] - struct kvm_device_ops kvm_arm_vgic_v2_ops {} Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: add vgic.h header fileAndre Przywara2-101/+170
vgic.c is currently a mixture of generic vGIC emulation code and functions specific to emulating a GICv2. To ease the addition of GICv3 later, we create new header file vgic.h, which holds constants and prototypes of commonly used functions. Rename some identifiers to avoid name space clutter. I removed the long-standing comment about using the kvm_io_bus API to tackle the GIC register ranges, as it wouldn't be a win for us anymore. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> ------- As the diff isn't always obvious here (and to aid eventual rebases), here is a list of high-level changes done to the code: * moved definitions and prototypes from vgic.c to vgic.h: - VGIC_ADDR_UNDEF - ACCESS_{READ,WRITE}_* - vgic_init() - vgic_update_state() - vgic_kick_vcpus() - vgic_get_vmcr() - vgic_set_vmcr() - struct mmio_range {} (renamed to struct kvm_mmio_range) * removed static keyword and exported prototype in vgic.h: - vgic_bitmap_get_reg() - vgic_bitmap_set_irq_val() - vgic_bitmap_get_shared_map() - vgic_bytemap_get_reg() - vgic_dist_irq_set_pending() - vgic_dist_irq_clear_pending() - vgic_cpu_irq_clear() - vgic_reg_access() - handle_mmio_raz_wi() - vgic_handle_enable_reg() - vgic_handle_set_pending_reg() - vgic_handle_clear_pending_reg() - vgic_handle_cfg_reg() - vgic_unqueue_irqs() - find_matching_range() (renamed to vgic_find_range) - vgic_handle_mmio_range() - vgic_update_state() - vgic_get_vmcr() - vgic_set_vmcr() - vgic_queue_irq() - vgic_kick_vcpus() - vgic_init() - vgic_v2_init_emulation() - vgic_has_attr_regs() - vgic_set_common_attr() - vgic_get_common_attr() - vgic_destroy() - vgic_create() * moved functions to vgic.h (static inline): - mmio_data_read() - mmio_data_write() - is_in_range() Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: refactor/wrap vgic_set/get_attr()Andre Przywara1-24/+54
vgic_set_attr() and vgic_get_attr() contain both code specific for the emulated GIC as well as code for the userland facing, generic part of the GIC. Split the guest GIC facing code of from the generic part to allow easier splitting later. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: refactor MMIO accessorsAndre Przywara1-52/+74
The MMIO accessors for GICD_I[CS]ENABLER, GICD_I[CS]PENDR and GICD_ICFGR behave very similar for GICv2 and GICv3, although the way the affected VCPU is determined differs. Since we need them to access the registers from three different places in the future, we factor out a generic, backend-facing implementation and use small wrappers in the current GICv2 emulation. This will ease adding GICv3 accessors later. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: make the value of ICC_SRE_EL1 a per-VM variableAndre Przywara1-2/+6
ICC_SRE_EL1 is a system register allowing msr/mrs accesses to the GIC CPU interface for EL1 (guests). Currently we force it to 0, but for proper GICv3 support we have to allow guests to use it (depending on their selected virtual GIC model). So add ICC_SRE_EL1 to the list of saved/restored registers on a world switch, but actually disallow a guest to change it by only restoring a fixed, once-initialized value. This value depends on the GIC model userland has chosen for a guest. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: make the maximum number of vCPUs a per-VM valueAndre Przywara3-0/+18
Currently the maximum number of vCPUs supported is a global value limited by the used GIC model. GICv3 will lift this limit, but we still need to observe it for guests using GICv2. So the maximum number of vCPUs is per-VM value, depending on the GIC model the guest uses. Store and check the value in struct kvm_arch, but keep it down to 8 for now. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: dont rely on a valid GICH base addressAndre Przywara1-1/+1
To check whether the vGIC was already initialized, we currently check the GICH base address for not being NULL. Since with GICv3 we may get along without this address, lets use the irqchip_in_kernel() function to detect an already initialized vGIC. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: move kvm_register_device_ops() into vGIC probingAndre Przywara3-3/+5
Currently we unconditionally register the GICv2 emulation device during the host's KVM initialization. Since with GICv3 support we may end up with only v2 or only v3 or both supported, we move the registration into the GIC probing function, where we will later know which combination is valid. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: introduce per-VM opsAndre Przywara1-10/+76
Currently we only have one virtual GIC model supported, so all guests use the same emulation code. With the addition of another model we end up with different guests using potentially different vGIC models, so we have to split up some functions to be per VM. Introduce a vgic_vm_ops struct to hold function pointers for those functions that are different and provide the necessary code to initialize them. Also split up the vgic_init() function to separate out VGIC model specific functionality into a separate function, which will later be different for a GICv3 model. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit onesAndre Przywara1-3/+50
Some GICv3 registers can and will be accessed as 64 bit registers. Currently the register handling code can only deal with 32 bit accesses, so we do two consecutive calls to cover this. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: refactor vgic_handle_mmio() functionAndre Przywara1-20/+53
Currently we only need to deal with one MMIO region for the GIC emulation (the GICv2 distributor), but we soon need to extend this. Refactor the existing code to allow easier addition of different ranges without code duplication. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-20arm/arm64: KVM: pass down user space provided GIC type into vGIC codeAndre Przywara1-2/+13
With the introduction of a second emulated GIC model we need to let userspace specify the GIC model to use for each VM. Pass the userspace provided value down into the vGIC code and store it there to differentiate later. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-16KVM: Add generic support for dirty page loggingMario Smarduch2-0/+86
kvm_get_dirty_log() provides generic handling of dirty bitmap, currently reused by several architectures. Building on that we intrdoduce kvm_get_dirty_log_protect() adding write protection to mark these pages dirty for future write access, before next KVM_GET_DIRTY_LOG ioctl call from user space. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
2015-01-16KVM: Add architecture-defined TLB flush supportMario Smarduch2-0/+5
Allow architectures to override the generic kvm_flush_remote_tlbs() function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to provide its own TLB flush interface. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
2015-01-11KVM: arm/arm64: vgic: add init entry to VGIC KVM deviceEric Auger1-1/+13
Since the advent of VGIC dynamic initialization, this latter is initialized quite late on the first vcpu run or "on-demand", when injecting an IRQ or when the guest sets its registers. This initialization could be initiated explicitly much earlier by the users-space, as soon as it has provided the requested dimensioning parameters. This patch adds a new entry to the VGIC KVM device that allows the user to manually request the VGIC init: - a new KVM_DEV_ARM_VGIC_GRP_CTRL group is introduced. - Its first attribute is KVM_DEV_ARM_VGIC_CTRL_INIT The rationale behind introducing a group is to be able to add other controls later on, if needed. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-11KVM: arm/arm64: vgic: vgic_init returns -ENODEV when no online vcpuEric Auger1-1/+1
To be more explicit on vgic initialization failure, -ENODEV is returned by vgic_init when no online vcpus can be found at init. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-01-08KVM: nVMX: Add nested msr load/restore algorithmWincy Van1-0/+1
Several hypervisors need MSR auto load/restore feature. We read MSRs from VM-entry MSR load area which specified by L1, and load them via kvm_set_msr in the nested entry. When nested exit occurs, we get MSRs via kvm_get_msr, writing them to L1`s MSR store area. After this, we read MSRs from VM-exit MSR load area, and load them via kvm_set_msr. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-30timecounter: keep track of accumulated fractional nanosecondsRichard Cochran1-1/+2
The current timecounter implementation will drop a variable amount of resolution, depending on the magnitude of the time delta. In other words, reading the clock too often or too close to a time stamp conversion will introduce errors into the time values. This patch fixes the issue by introducing a fractional nanosecond field that accumulates the low order bits. Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl> Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-28kvm: warn on more invariant breakagePaolo Bonzini1-1/+3
Modifying a non-existent slot is not allowed. Also check that the first loop doesn't move a deleted slot beyond the used part of the mslots array. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-28kvm: fix sorting of memslots with base_gfn == 0Paolo Bonzini1-5/+17
Before commit 0e60b0799fed (kvm: change memslot sorting rule from size to GFN, 2014-12-01), the memslots' sorting key was npages, meaning that a valid memslot couldn't have its sorting key equal to zero. On the other hand, a valid memslot can have base_gfn == 0, and invalid memslots are identified by base_gfn == npages == 0. Because of this, commit 0e60b0799fed broke the invariant that invalid memslots are at the end of the mslots array. When a memslot with base_gfn == 0 was created, any invalid memslot before it were left in place. This can be fixed by changing the insertion to use a ">=" comparison instead of "<=", but some care is needed to avoid breaking the case of deleting a memslot; see the comment in update_memslots. Thanks to Tiejun Chen for posting an initial patch for this bug. Reported-by: Jamie Heilman <jamie@audible.transient.net> Reported-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Jamie Heilman <jamie@audible.transient.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds9-2690/+140
Pull KVM update from Paolo Bonzini: "3.19 changes for KVM: - spring cleaning: removed support for IA64, and for hardware- assisted virtualization on the PPC970 - ARM, PPC, s390 all had only small fixes For x86: - small performance improvements (though only on weird guests) - usual round of hardware-compliancy fixes from Nadav - APICv fixes - XSAVES support for hosts and guests. XSAVES hosts were broken because the (non-KVM) XSAVES patches inadvertently changed the KVM userspace ABI whenever XSAVES was enabled; hence, this part is going to stable. Guest support is just a matter of exposing the feature and CPUID leaves support" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits) KVM: move APIC types to arch/x86/ KVM: PPC: Book3S: Enable in-kernel XICS emulation by default KVM: PPC: Book3S HV: Improve H_CONFER implementation KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register KVM: PPC: Book3S HV: Remove code for PPC970 processors KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions KVM: PPC: Book3S HV: Simplify locking around stolen time calculations arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function arch: powerpc: kvm: book3s_pr.c: Remove unused function arch: powerpc: kvm: book3s.c: Remove some unused functions arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked KVM: PPC: Book3S HV: ptes are big endian KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI KVM: PPC: Book3S HV: Fix KSM memory corruption KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI KVM: PPC: Book3S HV: Fix computation of tlbie operand KVM: PPC: Book3S HV: Add missing HPTE unlock KVM: PPC: BookE: Improve irq inject tracepoint arm/arm64: KVM: Require in-kernel vgic for the arch timers ...
2014-12-15Merge tag 'kvm-arm-for-3.19-take2' of ↵Paolo Bonzini3-72/+90
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot problems, clarifies VCPU init, and fixes a regression concerning the VGIC init flow. Conflicts: arch/ia64/kvm/kvm-ia64.c [deleted in HEAD and modified in kvmarm]
2014-12-15arm/arm64: KVM: Require in-kernel vgic for the arch timersChristoffer Dall1-8/+22
It is curently possible to run a VM with architected timers support without creating an in-kernel VGIC, which will result in interrupts from the virtual timer going nowhere. To address this issue, move the architected timers initialization to the time when we run a VCPU for the first time, and then only initialize (and enable) the architected timers if we have a properly created and initialized in-kernel VGIC. When injecting interrupts from the virtual timer to the vgic, the current setup should ensure that this never calls an on-demand init of the VGIC, which is the only call path that could return an error from kvm_vgic_inject_irq(), so capture the return value and raise a warning if there's an error there. We also change the kvm_timer_init() function from returning an int to be a void function, since the function always succeeds. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-12-15arm/arm64: KVM: Initialize the vgic on-demand when injecting IRQsChristoffer Dall1-6/+16
Userspace assumes that it can wire up IRQ injections after having created all VCPUs and after having created the VGIC, but potentially before starting the first VCPU. This can currently lead to lost IRQs because the state of that IRQ injection is not stored anywhere and we don't return an error to userspace. We haven't seen this problem manifest itself yet, presumably because guests reset the devices on boot, but this could cause issues with migration and other non-standard startup configurations. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-12-13arm/arm64: KVM: Add (new) vgic_initialized macroChristoffer Dall1-1/+2
Some code paths will need to check to see if the internal state of the vgic has been initialized (such as when creating new VCPUs), so introduce such a macro that checks the nr_cpus field which is set when the vgic has been initialized. Also set nr_cpus = 0 in kvm_vgic_destroy, because the error path in vgic_init() will call this function, and code should never errornously assume the vgic to be properly initialized after an error. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-12-13arm/arm64: KVM: Rename vgic_initialized to vgic_readyChristoffer Dall1-3/+3
The vgic_initialized() macro currently returns the state of the vgic->ready flag, which indicates if the vgic is ready to be used when running a VM, not specifically if its internal state has been initialized. Rename the macro accordingly in preparation for a more nuanced initialization flow. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-12-13arm/arm64: KVM: vgic: move reset initialization into vgic_init_maps()Peter Maydell1-45/+32
VGIC initialization currently happens in three phases: (1) kvm_vgic_create() (triggered by userspace GIC creation) (2) vgic_init_maps() (triggered by userspace GIC register read/write requests, or from kvm_vgic_init() if not already run) (3) kvm_vgic_init() (triggered by first VM run) We were doing initialization of some state to correspond with the state of a freshly-reset GIC in kvm_vgic_init(); this is too late, since it will overwrite changes made by userspace using the register access APIs before the VM is run. Move this initialization earlier, into the vgic_init_maps() phase. This fixes a bug where QEMU could successfully restore a saved VM state snapshot into a VM that had already been run, but could not restore it "from cold" using the -loadvm command line option (the symptoms being that the restored VM would run but interrupts were ignored). Finally rename vgic_init_maps to vgic_init and renamed kvm_vgic_init to kvm_vgic_map_resources. [ This patch is originally written by Peter Maydell, but I have modified it somewhat heavily, renaming various bits and moving code around. If something is broken, I am to be blamed. - Christoffer ] Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-12-04KVM: track pid for VCPU only on KVM_RUN ioctlChristian Borntraeger1-9/+9
We currently track the pid of the task that runs the VCPU in vcpu_load. If a yield to that VCPU is triggered while the PID of the wrong thread is active, the wrong thread might receive a yield, but this will most likely not help the executing thread at all. Instead, if we only track the pid on the KVM_RUN ioctl, there are two possibilities: 1) the thread that did a non-KVM_RUN ioctl is holding a mutex that the VCPU thread is waiting for. In this case, the VCPU thread is not runnable, but we also do not do a wrong yield. 2) the thread that did a non-KVM_RUN ioctl is sleeping, or doing something that does not block the VCPU thread. In this case, the VCPU thread can receive the directed yield correctly. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> CC: Michael Mueller <mimu@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-04KVM: don't check for PF_VCPU when yieldingDavid Hildenbrand1-4/+0
kvm_enter_guest() has to be called with preemption disabled and will set PF_VCPU. Current code takes PF_VCPU as a hint that the VCPU thread is running and therefore needs no yield. However, the check on PF_VCPU is wrong on s390, where preemption has to stay enabled in order to correctly process page faults. Thus, s390 reenables preemption and starts to execute the guest. The thread might be scheduled out between kvm_enter_guest() and kvm_exit_guest(), resulting in PF_VCPU being set but not being run. When this happens, the opportunity for directed yield is missed. However, this check is done already in kvm_vcpu_on_spin before calling kvm_vcpu_yield_loop: if (!ACCESS_ONCE(vcpu->preempted)) continue; so the check on PF_VCPU is superfluous in general, and this patch removes it. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-04kvm: optimize GFN to memslot lookup with large slots amountIgor Mammedov1-1/+7
Current linear search doesn't scale well when large amount of memslots is used and looked up slot is not in the beginning memslots array. Taking in account that memslots don't overlap, it's possible to switch sorting order of memslots array from 'npages' to 'base_gfn' and use binary search for memslot lookup by GFN. As result of switching to binary search lookup times are reduced with large amount of memslots. Following is a table of search_memslot() cycles during WS2008R2 guest boot. boot, boot + ~10 min mostly same of using it, slot lookup randomized lookup max average average cycles cycles cycles 13 slots : 1450 28 30 13 slots : 1400 30 40 binary search 117 slots : 13000 30 460 117 slots : 2000 35 180 binary search Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-04kvm: change memslot sorting rule from size to GFNIgor Mammedov1-6/+11
it will allow to use binary search for GFN -> memslot lookups, reducing lookup cost with large slots amount. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-04kvm: update_memslots: drop not needed check for the same slotIgor Mammedov1-14/+11
UP/DOWN shift loops will shift array in needed direction and stop at place where new slot should be placed regardless of old slot size. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-04kvm: update_memslots: drop not needed check for the same number of pagesIgor Mammedov1-15/+13
if number of pages haven't changed sorting algorithm will do nothing, so there is no need to do extra check to avoid entering sorting logic. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-26kvm: fix kvm_is_mmio_pfn() and rename to kvm_is_reserved_pfn()Ard Biesheuvel1-8/+8
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn. The problem being addressed by the patch above was that some ARM code based the memory mapping attributes of a pfn on the return value of kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should be mapped as device memory. However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin, and the existing non-ARM users were already using it in a way which suggests that its name should probably have been 'kvm_is_reserved_pfn' from the beginning, e.g., whether or not to call get_page/put_page on it etc. This means that returning false for the zero page is a mistake and the patch above should be reverted. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-26arm/arm64: KVM: vgic: Fix error code in kvm_vgic_create()Christoffer Dall1-4/+4
If we detect another vCPU is running we just exit and return 0 as if we succesfully created the VGIC, but the VGIC wouldn't actual be created. This shouldn't break in-kernel behavior because the kernel will not observe the failed the attempt to create the VGIC, but userspace could be rightfully confused. Cc: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-26arm/arm64: KVM: vgic: kick the specific vcpu instead of iterating through allShannon Zhao1-5/+10
When call kvm_vgic_inject_irq to inject interrupt, we can known which vcpu the interrupt for by the irq_num and the cpuid. So we should just kick this vcpu to avoid iterating through all. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <zhaoshenglong@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2014-11-25arm/arm64: vgic: Remove unreachable irq_clear_pendingChristoffer Dall1-2/+0
When 'injecting' an edge-triggered interrupt with a falling edge we shouldn't clear the pending state on the distributor. In fact, we don't, because the check in vgic_validate_injection would prevent us from ever reaching this bit of code. Remove the unreachable snippet. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>