Merge branch 'mmu_notifier_fixes'

Merge mmu_notifier fixes from Jérôme Glisse: "The invalidate_page callback suffered from 2 pitfalls. First it used to happen after page table lock was release and thus a new page might have been setup for the virtual address before the call to invalidate_page(). This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") which moved the callback under the page table lock. Which also broke several existing user of the mmu_notifier API that assumed they could sleep inside this callback. The second pitfall was invalidate_page being the only callback not taking a range of address in respect to invalidation but was giving an address and a page. Lot of the callback implementer assumed this could never be THP and thus failed to invalidate the appropriate range for THP pages. By killing this callback we unify the mmu_notifier callback API to always take a virtual address range as input. There is now two clear API (I am not mentioning the youngess API which is seldomly used): - invalidate_range_start()/end() callback (which allow you to sleep) - invalidate_range() where you can not sleep but happen right after page table update under page table lock Note that a lot of existing user feels broken in respect to range_start/ range_end. Many user only have range_start() callback but there is nothing preventing them to undo what was invalidated in their range_start() callback after it returns but before any CPU page table update take place. The code pattern use in kvm or umem odp is an example on how to properly avoid such race. In a nutshell use some kind of sequence number and active range invalidation counter to block anything that might undo what the range_start() callback did. If you do not care about keeping fully in sync with CPU page table (ie you can live with CPU page table pointing to new different page for a given virtual address) then you can take a reference on the pages inside the range_start callback and drop it in range_end or when your driver is done with those pages. Last alternative is to use invalidate_range() if you can do invalidation without sleeping as invalidate_range() callback happens under the CPU page table spinlock right after the page table is updated. The first two patches convert existing mmu_notifier_invalidate_page() calls to mmu_notifier_invalidate_range() and bracket those call with call to mmu_notifier_invalidate_range_start()/end(). The next ten patches remove existing invalidate_page() callback as it can no longer happen. Finally the last page remove the invalidate_page() callback completely so it can RIP. Changes since v1: - remove more dead code in kvm (no testing impact) - more accurate end address computation (patch 2) in page_mkclean_one and try_to_unmap_one - added tested-by/reviewed-by gotten so far" * emailed patches from Jérôme Glisse <jglisse@redhat.com>: mm/mmu_notifier: kill invalidate_page KVM: update to new mmu_notifier semantic v2 xen/gntdev: update to new mmu_notifier semantic sgi-gru: update to new mmu_notifier semantic misc/mic/scif: update to new mmu_notifier semantic iommu/intel: update to new mmu_notifier semantic iommu/amd: update to new mmu_notifier semantic IB/hfi1: update to new mmu_notifier semantic IB/umem: update to new mmu_notifier semantic drm/amdgpu: update to new mmu_notifier semantic powerpc/powernv: update to new mmu_notifier semantic mm/rmap: update to new mmu_notifier semantic v2 dax: update to new mmu_notifier semantic
author: Linus Torvalds <torvalds@linux-foundation.org> 2017-08-31 17:30:01 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-08-31 17:30:01 -0700
commit: ea25c43179462e342d4a0e66c3f6a5f53514da05 (patch)
tree: ba4ab1aef8f166de65da4ad9e9162ff277f318a4 /fs
parent: c227390c91a355300f47f9bef0aefbdfaaca1500 (diff)
parent: 5f32b265400de723ab0db23101a75ac073bdd980 (diff)
1 files changed, 11 insertions, 8 deletions
diff --git a/fs/dax.c b/fs/dax.c
index 865d42c63e23..ab925dc6647a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -646,11 +646,10 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	pte_t pte, *ptep = NULL;
 	pmd_t *pmdp = NULL;
 	spinlock_t *ptl;
-	bool changed;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
-		unsigned long address;
+		unsigned long address, start, end;
 
 		cond_resched();
 
@@ -658,8 +657,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			continue;
 
 		address = pgoff_address(index, vma);
-		changed = false;
-		if (follow_pte_pmd(vma->vm_mm, address, &ptep, &pmdp, &ptl))
+
+		/*
+		 * Note because we provide start/end to follow_pte_pmd it will
+		 * call mmu_notifier_invalidate_range_start() on our behalf
+		 * before taking any lock.
+		 */
+		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
 		if (pmdp) {
@@ -676,7 +680,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -691,13 +695,12 @@ unlock_pmd:
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
 
-		if (changed)
-			mmu_notifier_invalidate_page(vma->vm_mm, address);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 	}
 	i_mmap_unlock_read(mapping);
 }
author	Linus Torvalds <torvalds@linux-foundation.org>	2017-08-31 17:30:01 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-08-31 17:30:01 -0700
commit	ea25c43179462e342d4a0e66c3f6a5f53514da05 (patch)
tree	ba4ab1aef8f166de65da4ad9e9162ff277f318a4 /fs
parent	c227390c91a355300f47f9bef0aefbdfaaca1500 (diff)
parent	5f32b265400de723ab0db23101a75ac073bdd980 (diff)