summaryrefslogtreecommitdiff
path: root/arch/powerpc/kvm/book3s_64_mmu_hv.c
AgeCommit message (Collapse)AuthorFilesLines
2017-11-23KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hostsPaul Mackerras1-14/+23
This fixes two errors that prevent a guest using the HPT MMU from successfully migrating to a POWER9 host in radix MMU mode, or resizing its HPT when running on a radix host. The first bug was that commit 8dc6cca556e4 ("KVM: PPC: Book3S HV: Don't rely on host's page size information", 2017-09-11) missed two uses of hpte_base_page_size(), one in the HPT rehashing code and one in kvm_htab_write() (which is used on the destination side in migrating a HPT guest). Instead we use kvmppc_hpte_base_page_shift(). Having the shift count means that we can use left and right shifts instead of multiplication and division in a few places. Along the way, this adds a check in kvm_htab_write() to ensure that the page size encoding in the incoming HPTEs is recognized, and if not return an EINVAL error to userspace. The second bug was that kvm_htab_write was performing some but not all of the functions of kvmhv_setup_mmu(), resulting in the destination VM being left in radix mode as far as the hardware is concerned. The simplest fix for now is make kvm_htab_write() call kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does. In future it would be better to refactor the code more extensively to remove the duplication. Fixes: 8dc6cca556e4 ("KVM: PPC: Book3S HV: Don't rely on host's page size information") Fixes: 7a84084c6054 ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9") Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-09Merge branch 'kvm-ppc-fixes' into kvm-ppc-nextPaul Mackerras1-0/+10
This merges in a couple of fixes from the kvm-ppc-fixes branch that modify the same areas of code as some commits from the kvm-ppc-next branch, in order to resolve the conflicts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-08KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updatesPaul Mackerras1-0/+10
Commit 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation", 2016-12-20) added code that tries to exclude any use or update of the hashed page table (HPT) while the HPT resizing code is iterating through all the entries in the HPT. It does this by taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done flag and then sending an IPI to all CPUs in the host. The idea is that any VCPU task that tries to enter the guest will see that the hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma, which also takes the kvm->lock mutex and will therefore block until we release kvm->lock. However, any VCPU that is already in the guest, or is handling a hypervisor page fault or hypercall, can re-enter the guest without rechecking the hpte_setup_done flag. The IPI will cause a guest exit of any VCPUs that are currently in the guest, but does not prevent those VCPU tasks from immediately re-entering the guest. The result is that after resize_hpt_rehash_hpte() has made a HPTE absent, a hypervisor page fault can occur and make that HPTE present again. This includes updating the rmap array for the guest real page, meaning that we now have a pointer in the rmap array which connects with pointers in the old rev array but not the new rev array. In fact, if the HPT is being reduced in size, the pointer in the rmap array could point outside the bounds of the new rev array. If that happens, we can get a host crash later on such as this one: [91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c [91652.628668] Faulting instruction address: 0xc0000000000e2640 [91652.628736] Oops: Kernel access of bad area, sig: 11 [#1] [91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV [91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259] [91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G O 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1 [91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000 [91652.629750] NIP: c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0 [91652.629829] REGS: c0000000028db5d0 TRAP: 0300 Tainted: G O (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le) [91652.629932] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022422 XER: 00000000 [91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1 [91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000 [91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000 [91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70 [91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000 [91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10 [91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000 [91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000 [91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100 [91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120 [91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.630884] Call Trace: [91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable) [91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv] [91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm] [91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0 [91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130 [91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c [91652.631635] Instruction dump: [91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110 [91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008 [91652.631959] ---[ end trace ac85ba6db72e5b2e ]--- To fix this, we tighten up the way that the hpte_setup_done flag is checked to ensure that it does provide the guarantee that the resizing code needs. In kvmppc_run_core(), we check the hpte_setup_done flag after disabling interrupts and refuse to enter the guest if it is clear (for a HPT guest). The code that checks hpte_setup_done and calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv() to a point inside the main loop in kvmppc_run_vcpu(), ensuring that we don't just spin endlessly calling kvmppc_run_core() while hpte_setup_done is clear, but instead have a chance to block on the kvm->lock mutex. Finally we also check hpte_setup_done inside the region in kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about to update the HPTE, and bail out if it is clear. If another CPU is inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done, then we know that either we are looking at a HPTE that resize_hpt_rehash_hpte() has not yet processed, which is OK, or else we will see hpte_setup_done clear and refuse to update it, because of the full barrier formed by the unlock of the HPTE in resize_hpt_rehash_hpte() combined with the locking of the HPTE in kvmppc_book3s_hv_page_fault(). Fixes: 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation") Cc: stable@vger.kernel.org # v4.10+ Reported-by: Satheesh Rajendran <satheera@in.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix hostPaul Mackerras1-10/+12
This sets up the machinery for switching a guest between HPT (hashed page table) and radix MMU modes, so that in future we can run a HPT guest on a radix host on POWER9 machines. * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or radix mode, on a radix host. * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9 with HV KVM on a radix host. * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a radix host. * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the guest to HPT mode and allocate a HPT. * For simplicity, we now allocate the rmap array for each memslot, even on a radix host, since it will be needed if the guest switches to HPT mode. * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN ioctl will return an EINVAL error in that case. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Unify dirty page map between HPT and radixPaul Mackerras1-28/+11
Currently, the HPT code in HV KVM maintains a dirty bit per guest page in the rmap array, whether or not dirty page tracking has been enabled for the memory slot. In contrast, the radix code maintains a dirty bit per guest page in memslot->dirty_bitmap, and only does so when dirty page tracking has been enabled. This changes the HPT code to maintain the dirty bits in the memslot dirty_bitmap like radix does. This results in slightly less code overall, and will mean that we do not lose the dirty bits when transitioning between HPT and radix mode in future. There is one minor change to behaviour as a result. With HPT, when dirty tracking was enabled for a memslot, we would previously clear all the dirty bits at that point (both in the HPT entries and in the rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately following would show no pages as dirty (assuming no vcpus have run in the meantime). With this change, the dirty bits on HPT entries are not cleared at the point where dirty tracking is enabled, so KVM_GET_DIRTY_LOG would show as dirty any guest pages that are resident in the HPT and dirty. This is consistent with what happens on radix. This also fixes a bug in the mark_pages_dirty() function for radix (in the sense that the function no longer exists). In the case where a large page of 64 normal pages or more is marked dirty, the addressing of the dirty bitmap was incorrect and could write past the end of the bitmap. Fortunately this case was never hit in practice because a 2MB large page is only 32 x 64kB pages, and we don't support backing the guest with 1GB huge pages at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_readyPaul Mackerras1-18/+18
This renames the kvm->arch.hpte_setup_done field to mmu_ready because we will want to use it for radix guests too -- both for setting things up before vcpu execution, and for excluding vcpus from executing while MMU-related things get changed, such as in future switching the MMU from radix to HPT mode or vice-versa. This also moves the call to kvmppc_setup_partition_table() that was done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting of mmu_ready, into the caller in kvmppc_vcpu_run_hv(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Don't rely on host's page size informationPaul Mackerras1-6/+7
This removes the dependence of KVM on the mmu_psize_defs array (which stores information about hardware support for various page sizes) and the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(), hpte_actual_page_size() and get_sllp_encoding(). We also no longer rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature bit. The reason for doing this is so we can support a HPT guest on a radix host. In a radix host, the mmu_psize_defs array contains information about page sizes supported by the MMU in radix mode rather than the page sizes supported by the MMU in HPT mode. Similarly, mmu_slb_size and the MMU_FTR_1T_SEGMENTS bit are not set. Instead we hard-code knowledge of the behaviour of the HPT MMU in the POWER7, POWER8 and POWER9 processors (which are the only processors supported by HV KVM) - specifically the encoding of the LP fields in the HPT and SLB entries, and the fact that they have 32 SLB entries and support 1TB segments. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-16KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guestsPaul Mackerras1-3/+10
This adds code to make sure that we don't try to access the non-existent HPT for a radix guest using the htab file for the VM in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls. At present nothing bad happens if userspace does access these interfaces on a radix guest, mostly because kvmppc_hpt_npte() gives 0 for a radix guest, which in turn is because 1 << -4 comes out as 0 on POWER processors. However, that relies on undefined behaviour, so it is better to be explicit about not accessing the HPT for a radix guest. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation ↵Markus Elfring1-1/+0
in kvmppc_allocate_hpt() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-09-01KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fdnixiaoming1-0/+1
We do ctx = kzalloc(sizeof(*ctx), GFP_KERNEL) and then later on call anon_inode_getfd(), but if that fails we don't free ctx, so that memory gets leaked. To fix it, this adds kfree(ctx) in the failure path. Signed-off-by: nixiaoming <nixiaoming@huawei.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras1-2/+3
This merges in the 'ppc-kvm' topic branch from the powerpc tree in order to bring in some fixes which touch both powerpc and KVM code. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-17powerpc/mm: Rename find_linux_pte_or_hugepte()Aneesh Kumar K.V1-2/+3
Add newer helpers to make the function usage simpler. It is always recommended to use find_current_mm_pte() for walking the page table. If we cannot use find_current_mm_pte(), it should be documented why the said usage of __find_linux_pte() is safe against a parallel THP split. For now we have KVM code using __find_linux_pte(). This is because kvm code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but with PACA soft_enabled = 1. We may want to fix that later and make sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that we can switch kvm to use find_linux_pte(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24KVM: PPC: Book3S HV: Fix host crash on changing HPT sizePaul Mackerras1-1/+3
Commit f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size", 2016-12-20) changed the behaviour of the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT and new revmap array if there was a previously-allocated HPT of a different size from the size being requested. In this case, we need to reset the rmap arrays of the memslots, because the rmap arrays will contain references to HPTEs which are no longer valid. Worse, these references are also references to slots in the new revmap array (which parallels the HPT), and the new revmap array contains random contents, since it doesn't get zeroed on allocation. The effect of having these stale references to slots in the revmap array that contain random contents is that subsequent calls to functions such as kvmppc_add_revmap_chain will crash because they will interpret the non-zero contents of the revmap array as HPTE indexes and thus index outside of the revmap array. This leads to host crashes such as the following. [ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8 [ 7072.862218] Faulting instruction address: 0xc0000000000e1c78 [ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1] [ 7072.862286] SMP NR_CPUS=1024 [ 7072.862286] NUMA [ 7072.862325] PowerNV [ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry [ 7072.863085] nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv] [ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59 [ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000 [ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0 [ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300 Not tainted (4.12.0-kvm+) [ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]> [ 7072.863667] CR: 44082882 XER: 20000000 [ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1 GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0 GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000 GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78 GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000 GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190 GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000 GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048 [ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130 [ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864566] Call Trace: [ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable) [ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv] [ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv] [ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm] [ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0 [ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130 [ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c [ 7072.865644] Instruction dump: [ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14 [ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000 [ 7072.866001] ---[ end trace 627b6e4bf8080edc ]--- Note that to trigger this, it is necessary to use a recent upstream QEMU (or other userspace that resizes the HPT at CAS time), specify a maximum memory size substantially larger than the current memory size, and boot a guest kernel that does not support HPT resizing. This fixes the problem by resetting the rmap arrays when the old HPT is freed. Fixes: f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size") Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko1-1/+1
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-06KVM: PPC: Book3S HV: Check for kmalloc errors in ioctlDan Carpenter1-0/+4
kzalloc() won't actually fail because sizeof(*resize) is small, but static checkers complain. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-03-09power/mm: update pte_write and pte_wrprotect to handle savedwriteAneesh Kumar K.V1-1/+1
We use pte_write() to check whethwer the pte entry is writable. This is mostly used to later mark the pte read only if it is writable. The other use of pte_write() is to check whether the pte_entry is writable so that hardware page table entry can be marked accordingly. This is used in kvm where we look at qemu page table entry and update hardware hash page table for the guest with correct write enable bit. With the above, for the first usage we should also check the savedwrite bit so that we can correctly clear the savedwite bit. For the later, we add a new variant __pte_write(). With this we can revert write_protect_page part of 595cd8f256d2 ("mm/ksm: handle protnone saved writes when making page write protect"). But I left it as it is as an example code for savedwrite check. Fixes: c137a2757b886 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write") Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-18KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for nowPaul Mackerras1-0/+6
The new HPT resizing code added in commit b5baa6877315 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation", 2016-12-20) doesn't have code to handle the new HPTE format which POWER9 uses. Thus it would be best not to advertise it to userspace on POWER9 systems until it works properly. Also, since resize_hpt_rehash_hpte() contains BUG_ON() calls that could be hit on POWER9, let's prevent it from being called on POWER9 for now. Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-17KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug messageThomas Huth1-2/+2
The average user likely does not know what a "htab" or "LPID" is, and it's annoying that these messages are quickly filling the dmesg log when you're doing boot cycle tests, so let's turn it into a debug message instead. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-16KVM: PPC: Book3S HV: Prevent double-free on HPT resize commit pathDavid Gibson1-0/+3
resize_hpt_release(), called once the HPT resize of a KVM guest is completed (successfully or unsuccessfully) frees the state structure for the resize. It is currently not safe to call with a NULL pointer. However, one of the error paths in kvm_vm_ioctl_resize_hpt_commit() can invoke it with a NULL pointer. This will occur if userspace improperly invokes KVM_PPC_RESIZE_HPT_COMMIT without previously calling KVM_PPC_RESIZE_HPT_PREPARE, or if it calls COMMIT twice without an intervening PREPARE. To fix this potential crash bug - and maybe others like it, make it safe (and a no-op) to call resize_hpt_release() with a NULL resize pointer. Found by Dan Carpenter with a static checker. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: KVM-HV HPT resizing implementationDavid Gibson1-1/+187
This adds the "guts" of the implementation for the HPT resizing PAPR extension. It has the code to allocate and clear a new HPT, rehash an existing HPT's entries into it, and accomplish the switchover for a KVM guest from the old HPT to the new one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementationDavid Gibson1-0/+191
This adds a not yet working outline of the HPT resizing PAPR extension. Specifically it adds the necessary ioctl() functions, their basic steps, the work function which will handle preparation for the resize, and synchronization between these, the guest page fault path and guest HPT update path. The actual guts of the implementation isn't here yet, so for now the calls will always fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Create kvmppc_unmap_hpte_helper()David Gibson1-33/+44
The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used to force all guest mappings of a particular host page to be set ABSENT, and removed from the reverse mappings. For HPT resizing, we will have some cases where we want to set just a single guest HPTE ABSENT and remove its reverse mappings. To prepare with this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE, moving it to a new helper function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT sizeDavid Gibson1-13/+16
The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page table (HPT) that userspace expects a guest VM to have, and is also used to clear that HPT when necessary (e.g. guest reboot). At present, once the ioctl() is called for the first time, the HPT size can never be changed thereafter - it will be cleared but always sized as from the first call. With upcoming HPT resize implementation, we're going to need to allow userspace to resize the HPT at reset (to change it back to the default size if the guest changed it). So, we need to allow this ioctl() to change the HPT size. This patch also updates Documentation/virtual/kvm/api.txt to reflect the new behaviour. In fact the documentation was already slightly incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctl" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Split HPT allocation from activationDavid Gibson1-50/+47
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT) and sets it up as the active page table for a VM. For the upcoming HPT resize implementation we're going to want to allocate HPTs separately from activating them. So, split the allocation itself out into kvmppc_allocate_hpt() and perform the activation with a new kvmppc_set_hpt() function. Likewise we split kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt() which unsets it as an active HPT, then frees it. We also move the logic to fall back to smaller HPT sizes if the first try fails into the single caller which used that behaviour, kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in that previously if the initial attempt at CMA allocation failed, we would fall back to attempting smaller sizes with the page allocator. Now, we try first CMA, then the page allocator at each size. As far as I can tell this change should be harmless. To match, we make kvmppc_free_hpt() just free the actual HPT itself. The call to kvmppc_free_lpid() that was there, we move to the single caller. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Don't store values derivable from HPT orderDavid Gibson1-15/+13
Currently the kvm_hpt_info structure stores the hashed page table's order, and also the number of HPTEs it contains and a mask for its size. The last two can be easily derived from the order, so remove them and just calculate them as necessary with a couple of helper inlines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Gather HPT related variables into sub-structureDavid Gibson1-46/+46
Currently, the powerpc kvm_arch structure contains a number of variables tracking the state of the guest's hashed page table (HPT) in KVM HV. This patch gathers them all together into a single kvm_hpt_info substructure. This makes life more convenient for the upcoming HPT resizing implementation. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarityDavid Gibson1-4/+4
The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at all obvious from the name. In practice kvmppc_alloc_hpt() allocates an HPT by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate it with CMA only. To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma(). Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31KVM: PPC: Book3S HV: Enable radix guest supportPaul Mackerras1-1/+0
This adds a few last pieces of the support for radix guests: * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and KVM_PPC_GET_RMMU_INFO ioctls for radix guests * On POWER9, allow secondary threads to be on/off-lined while guests are running. * Set up LPCR and the partition table entry for radix guests. * Don't allocate the rmap array in the kvm_memory_slot structure on radix. * Don't try to initialize the HPT for radix guests, since they don't have an HPT. * Take out the code that prevents the HV KVM module from initializing on radix hosts. At this stage, we only support radix guests if the host is running in radix mode, and only support HPT guests if the host is running in HPT mode. Thus a guest cannot switch from one mode to the other, which enables some simplifications. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31KVM: PPC: Book3S HV: Implement dirty page logging for radix guestsPaul Mackerras1-17/+11
This adds code to keep track of dirty pages when requested (that is, when memslot->dirty_bitmap is non-NULL) for radix guests. We use the dirty bits in the PTEs in the second-level (partition-scoped) page tables, together with a bitmap of pages that were dirty when their PTE was invalidated (e.g., when the page was paged out). This bitmap is stored in the first half of the memslot->dirty_bitmap area, and kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the bitmap that gets returned to userspace. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31KVM: PPC: Book3S HV: MMU notifier callbacks for radix guestsPaul Mackerras1-21/+43
This adapts our implementations of the MMU notifier callbacks (unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva) to call radix functions when the guest is using radix. These implementations are much simpler than for HPT guests because we have only one PTE to deal with, so we don't need to traverse rmap chains. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31KVM: PPC: Book3S HV: Page table construction and page faults for radix guestsPaul Mackerras1-2/+5
This adds the code to construct the second-level ("partition-scoped" in architecturese) page tables for guests using the radix MMU. Apart from the PGD level, which is allocated when the guest is created, the rest of the tree is all constructed in response to hypervisor page faults. As well as hypervisor page faults for missing pages, we also get faults for reference/change (RC) bits needing to be set, as well as various other error conditions. For now, we only set the R or C bit in the guest page table if the same bit is set in the host PTE for the backing page. This code can take advantage of the guest being backed with either transparent or ordinary 2MB huge pages, and insert 2MB page entries into the guest page tables. There is no support for 1GB huge pages yet. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31KVM: PPC: Book3S HV: Add basic infrastructure for radix guestsPaul Mackerras1-2/+8
This adds a field in struct kvm_arch and an inline helper to indicate whether a guest is a radix guest or not, plus a new file to contain the radix MMU code, which currently contains just a translate function which knows how to traverse the guest page tables to translate an address. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-24KVM: PPC: Book3S HV: Adapt to new HPTE format on POWER9Paul Mackerras1-9/+30
This adapts the KVM-HV hashed page table (HPT) code to read and write HPT entries in the new format defined in Power ISA v3.00 on POWER9 machines. The new format moves the B (segment size) field from the first doubleword to the second, and trims some bits from the AVA (abbreviated virtual address) and ARPN (abbreviated real page number) fields. As far as possible, the conversion is done when reading or writing the HPT entries, and the rest of the code continues to use the old format. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entriesYongji Xie1-0/+16
This keeps a per vcpu cache for recently page faulted MMIO entries. On a page fault, if the entry exists in the cache, we can avoid some time-consuming paths, for example, looking up HPT, locking HPTE twice and searching mmio gfn from memslots, then directly call kvmppc_hv_emulate_mmio(). In current implenment, we limit the size of cache to four. We think it's enough to cover the high-frequency MMIO HPTEs in most case. For example, considering the case of using virtio device, for virtio legacy devices, one HPTE could handle notifications from up to 1024 (64K page / 64 byte Port IO register) devices, so one cache entry is enough; for virtio modern devices, we always need one HPTE to handle notification for each device because modern device would use a 8M MMIO register to notify host instead of Port IO register, typically the system's configuration should not exceed four virtio devices per vcpu, four cache entry is also enough in this case. Of course, if needed, we could also modify the macro to a module parameter in the future. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21KVM: PPC: Book3S HV: Clear the key field of HPTE when the page is paged outYongji Xie1-1/+3
Currently we mark a HPTE for emulated MMIO with HPTE_V_ABSENT bit set as well as key 0x1f. However, those HPTEs may be conflicted with the HPTE for real guest RAM page HPTE with key 0x1f when the page get paged out. This patch clears the key field of HPTE when the page is paged out, then recover it when HPTE is re-established. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21KVM: PPC: Book3S HV: Fix sparse static warningDaniel Axtens1-2/+2
Squash a couple of sparse warnings by making things static. Build tested. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-05-01powerpc/mm: Drop WIMG in favour of new constantsAneesh Kumar K.V1-6/+5
PowerISA 3.0 introduces two pte bits with the below meaning for radix: 00 -> Normal Memory 01 -> Strong Access Order (SAO) 10 -> Non idempotent I/O (Cache inhibited and guarded) 11 -> Tolerant I/O (Cache inhibited) We drop the existing WIMG bits in the Linux page table in favour of the above constants. We loose _PAGE_WRITETHRU with this conversion. We only use writethru via pgprot_cached_wthru() which is used by fbdev/controlfb.c which is Apple control display and also PPC32. With respect to _PAGE_COHERENCE, we have been marking hpte always coherent for some time now. htab_convert_pte_flags() always added HPTE_R_M. NOTE: KVM changes need closer review. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03powerpc/mm: Move hash related mmu-*.h headers to book3s/Aneesh Kumar K.V1-1/+1
No code changes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-11-05Merge tag 'powerpc-4.4-1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng - Refresh ps3_defconfig from Geoff Levand - Emit GNU & SysV hashes for the vdso from Michael Ellerman - Define an enum for the bolted SLB indexes from Anshuman Khandual - Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman - Add gettimeofday() benchmark from Michael Neuling - Avoid link stack corruption in __get_datapage() from Michael Neuling - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V - Add ppc64le_defconfig from Michael Ellerman - pseries: extract of_helpers module from Andy Shevchenko - Correct string length in pseries_of_derive_parent() from Nathan Fontenot - Free the MSI bitmap if it was slab allocated from Denis Kirjanov - Shorten irq_chip name for the SIU from Christophe Leroy - Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King - Disable hugepd for 64K page size, from Aneesh Kumar K.V - Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V - Make PCI non-optional for pseries from Michael Ellerman - Individual System V IPC system calls from Sam bobroff - Add selftest of unmuxed IPC calls from Michael Ellerman - discard .exit.data at runtime from Stephen Rothwell - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul Gortmaker - Use of_get_next_parent to simplify code from Christophe Jaillet - Paginate some xmon output from Sam bobroff - Add some more elements to the xmon PACA dump from Michael Ellerman - Allow the tm-syscall selftest to build with old headers from Michael Ellerman - Run EBB selftests only on POWER8 from Denis Kirjanov - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman - Avoid reference to potentially freed memory in prom.c from Christophe Jaillet - Quieten boot wrapper output with run_cmd from Geoff Levand - EEH fixes and cleanups from Gavin Shan - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman - Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov - Fix ps3-lpm white space from Rudhresh Kumar J - Fix ps3-vuart null dereference from Colin King - nvram: Add missing kfree in error path from Christophe Jaillet - nvram: Fix function name in some errors messages, from Christophe Jaillet - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov - cxl: Free virtual PHB when removing from Andrew Donnellan - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman - Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump support, a rework of the qoriq clock driver, device tree changes including qoriq fman nodes, support for a new 85xx board, and some fixes. - MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x LocalPlus Bus FIFO with its device tree binding documentation, mpc512x device tree updates and some minor fixes. * tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits) powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc() powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id() powerpc/pseries: Correct string length in pseries_of_derive_parent() powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s) powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes powerpc: handle error case in cpm_muram_alloc() powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake powerpc/book3e-64: Enable kexec powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32 powerpc/book3e-64/kexec: Enable SMP release powerpc/book3e-64/kexec: create an identity TLB mapping powerpc/book3e-64: Don't limit paca to 256 MiB powerpc/book3e/kdump: Enable crash_kexec_wait_realmode powerpc/book3e: support CONFIG_RELOCATABLE powerpc/booke64: Fix args to copy_and_flush powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts powerpc/e6500: kexec: Handle hardware threads ...
2015-10-21KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctlPaul Mackerras1-1/+2
Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested size of HPT, and if that is not possible, then try to allocate smaller sizes (by factors of 2) until either a minimum is reached or the allocation succeeds. This is not ideal for userspace, particularly in migration scenarios, where the destination VM really does require the size requested. Also, the minimum HPT size of 256kB may be insufficient for the guest to run successfully. This removes the fallback to smaller sizes on allocation failure for the KVM_PPC_ALLOCATE_HTAB ioctl. The fallback still exists for the case where the HPT is allocated at the time the first VCPU is run, if no HPT has been allocated by ioctl by that time. Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-12powerpc/mm: Differentiate between hugetlb and THP during page walkAneesh Kumar K.V1-1/+1
We need to properly identify whether a hugepage is an explicit or a transparent hugepage in follow_huge_addr(). We used to depend on hugepage shift argument to do that. But in some case that can result in wrong results. For ex: On finding a transparent hugepage we set hugepage shift to PMD_SHIFT. But we can end up clearing the thp pte, via pmdp_huge_get_and_clear. We do prevent reusing the pfn page via the usage of kick_all_cpus_sync(). But that happens after we updated the pte to 0. Hence in follow_huge_addr() we can find hugepage shift set, but transparent huge page check fail for a thp pte. NOTE: We fixed a variant of this race against thp split in commit 691e95fd7396905a38d98919e9c150dbc3ea21a3 ("powerpc/mm/thp: Make page table walk safe against thp split/collapse") Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in follow_page_mask occasionally. In the long term, we may want to switch ppc64 64k page size config to enable CONFIG_ARCH_WANT_GENERAL_HUGETLB Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-22KVM: PPC: Book3S HV: Fix bug in dirty page trackingPaul Mackerras1-1/+7
This fixes a bug in the tracking of pages that get modified by the guest. If the guest creates a large-page HPTE, writes to memory somewhere within the large page, and then removes the HPTE, we only record the modified state for the first normal page within the large page, when in fact the guest might have modified some other normal page within the large page. To fix this we use some unused bits in the rmap entry to record the order (log base 2) of the size of the page that was modified, when removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that order to return the correct number of modified pages. The same thing could in principle happen when removing a HPTE at the host's request, i.e. when paging out a page, except that we never page out large pages, and the guest can only create large-page HPTEs if the guest RAM is backed by large pages. However, we also fix this case for the sake of future-proofing. The reference bit is also subject to the same loss of information. We don't make the same fix here for the reference bit because there isn't an interface for userspace to find out which pages the guest has referenced, whereas there is one for userspace to find out which pages the guest has modified. Because of this loss of information, the kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly say that a page has not been referenced when it has, but that doesn't matter greatly because we never page or swap out large pages. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-05-26KVM: use kvm_memslots whenever possiblePaolo Bonzini1-1/+1
kvm_memslots provides lockdep checking. Use it consistently instead of explicit dereferencing of kvm->memslots. Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-26Merge tag 'powerpc-4.1-2' of ↵Linus Torvalds1-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc fixes from Michael Ellerman: - fix for mm_dec_nr_pmds() from Scott. - fixes for oopses seen with KVM + THP from Aneesh. - build fixes from Aneesh & Shreyas. * tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error powerpc/mm/thp: Return pte address if we find trans_splitting. powerpc/mm/thp: Make page table walk safe against thp split/collapse KVM: PPC: Remove page table walk helpers KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()
2015-04-21KVM: PPC: Book3S HV: Create debugfs file for each guest's HPTPaul Mackerras1-0/+136
This creates a debugfs directory for each HV guest (assuming debugfs is enabled in the kernel config), and within that directory, a file by which the contents of the guest's HPT (hashed page table) can be read. The directory is named vmnnnn, where nnnn is the PID of the process that created the guest. The file is named "htab". This is intended to help in debugging problems in the host's management of guest memory. The contents of the file consist of a series of lines like this: 3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196 The first field is the index of the entry in the HPT, the second and third are the HPT entry, so the third entry contains the real page number that is mapped by the entry if the entry's valid bit is set. The fourth field is the guest's view of the second doubleword of the entry, so it contains the guest physical address. (The format of the second through fourth fields are described in the Power ISA and also in arch/powerpc/include/asm/mmu-hash64.h.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21KVM: PPC: Book3S HV: Add helpers for lock/unlock hpteAneesh Kumar K.V1-15/+10
This adds helper routines for locking and unlocking HPTEs, and uses them in the rest of the code. We don't change any locking rules in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21KVM: PPC: Book3S HV: Remove RMA-related variables from codeAneesh Kumar K.V1-14/+14
We don't support real-mode areas now that 970 support is removed. Remove the remaining details of rma from the code. Also rename rma_setup_done to hpte_setup_done to better reflect the changes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-17powerpc/mm/thp: Return pte address if we find trans_splitting.Aneesh Kumar K.V1-6/+3
For THP that is marked trans splitting, we return the pte. This require the callers to handle the pmd_trans_splitting scenario, if they care. All the current callers are either looking at pfn or write_ok, hence we don't need to update them. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17powerpc/mm/thp: Make page table walk safe against thp split/collapseAneesh Kumar K.V1-2/+3
We can disable a THP split or a hugepage collapse by disabling irq. We do send IPI to all the cpus in the early part of split/collapse, and disabling local irq ensure we don't make progress with split/collapse. If the THP is getting split we return NULL from find_linux_pte_or_hugepte(). For all the current callers it should be ok. We need to be careful if we want to use returned pte_t pointer outside the irq disabled region. W.r.t to THP split, the pfn remains the same, but then a hugepage collapse will result in a pfn change. There are few steps we can take to avoid a hugepage collapse.One way is to take page reference inside the irq disable region. Other option is to take mmap_sem so that a parallel collapse will not happen. We can also disable collapse by taking pmd_lock. Another method used by kvm subsystem is to check whether we had a mmu_notifer update in between using mmu_notifier_retry(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-17KVM: PPC: Book3S HV: Remove code for PPC970 processorsPaul Mackerras1-186/+14
This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>