NOPOST HMM 4.21 update cover letterhmm-4.21

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
author: Jérôme Glisse <jglisse@redhat.com> 2018-12-10 11:35:07 -0500
committer: Jérôme Glisse <jglisse@redhat.com> 2018-12-11 16:56:33 -0500
commit: 1ebb3f8075a6b8f21353ec2b0e5869507b4cb215 (patch)
tree: 9aa49cad8ca4e0c34171d81152910ade40be2e7f
parent: 41d4d0979f392ef9aafc6346f9b6f4376094adbf (diff)
1 files changed, 20 insertions, 337 deletions
diff --git a/cover-letter b/cover-letter
index 4fa4bae8ca40..fa34030a3a14 100644
--- a/cover-letter
+++ b/cover-letter
@@ -1,343 +1,26 @@
-Heterogeneous memory system are becoming more and more the norm, in
-those system there is not only the main system memory for each node,
-but also device memory and|or memory hierarchy to consider. Device
-memory can comes from a device like GPU, FPGA, ... or from a memory
-only device (persistent memory, or high density memory device).
+From d98404cdf488d9521debba9bbf5825c5199309e6 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
+Date: Mon, 10 Dec 2018 11:30:47 -0500
+Subject: [PATCH 0/2] HMM use new mmu notifier struct and simplify life time tracking
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
 
-Memory hierarchy is when you not only have the main memory but also
-other type of memory like HBM (High Bandwidth Memory often stack up
-on CPU die or GPU die), peristent memory or high density memory (ie
-something slower then regular DDR DIMM but much bigger).
+First patch convert code to use kref for HMM core structure lifetime
+this is easier and simpler than existing code that rely on mm struct
+lifetime and locking.
 
-On top of this diversity of memories you also have to account for the
-system bus topology ie how all CPUs and devices are connected to each
-others. Userspace do not care about the exact physical topology but
-care about topology from behavior point of view ie what are all the
-paths between an initiator (anything that can initiate memory access
-like CPU, GPU, FGPA, network controller ...) and a target memory and
-what are all the properties of each of those path (bandwidth, latency,
-granularity, ...).
+Second one leverage the new mmu notifier range structure, removing
+bit of HMM code in the process.
 
-This means that it is no longer sufficient to consider a flat view
-for each node in a system but for maximum performance we need to
-account for all of this new memory but also for system topology.
-This is why this proposal is unlike the HMAT proposal [1] which
-tries to extend the existing NUMA for new type of memory. Here we
-are tackling a much more profound change that depart from NUMA.
+Jérôme Glisse (2):
+  mm/hmm: use reference counting for HMM struct
+  mm/hmm: for HMM mirror use mmu_notifier_range range directly
 
+ include/linux/hmm.h |  27 +---------
+ mm/hmm.c            | 120 ++++++++++++++++++++++----------------------
+ 2 files changed, 62 insertions(+), 85 deletions(-)
 
-One of the reasons for radical change is the advance of accelerator
-like GPU or FPGA means that CPU is no longer the only piece where
-computation happens. It is becoming more and more common for an
-application to use a mix and match of different accelerator to
-perform its computation. So we can no longer satisfy our self with
-a CPU centric and flat view of a system like NUMA and NUMA distance.
+-- 
+2.17.2
 
-
-This patchset is a proposal to tackle this problems through three
-aspects:
-    1 - Expose complex system topology and various kind of memory
-        to user space so that application have a standard way and
-        single place to get all the information it cares about.
-    2 - A new API for user space to bind/provide hint to kernel on
-        which memory to use for range of virtual address (a new
-        mbind() syscall).
-    3 - Kernel side changes for vm policy to handle this changes
-
-This patchset is not and end to end solution but it provides enough
-pieces to be useful against nouveau (upstream open source driver for
-NVidia GPU). It is intended as a starting point for discussion so
-that we can figure out what to do. To avoid having too much topics
-to discuss i am not considering memory cgroup for now but it is
-definitely something we will want to integrate with.
-
-The rest of this emails is splits in 3 sections, the first section
-talks about complex system topology: what it is, how it is use today
-and how to describe it tomorrow. The second sections talks about
-new API to bind/provide hint to kernel for range of virtual address.
-The third section talks about new mechanism to track bind/hint
-provided by user space or device driver inside the kernel.
-
-
-1) Complex system topology and representing them
-------------------------------------------------
-
-Inside a node you can have a complex topology of memory, for instance
-you can have multiple HBM memory in a node, each HBM memory tie to a
-set of CPUs (all of which are in the same node). This means that you
-have a hierarchy of memory for CPUs. The local fast HBM but which is
-expected to be relatively small compare to main memory and then the
-main memory. New memory technology might also deepen this hierarchy
-with another level of yet slower memory but gigantic in size (some
-persistent memory technology might fall into that category). Another
-example is device memory, and device themself can have a hierarchy
-like HBM on top of device core and main device memory.
-
-On top of that you can have multiple path to access each memory and
-each path can have different properties (latency, bandwidth, ...).
-Also there is not always symmetry ie some memory might only be
-accessible by some device or CPU ie not accessible by everyone.
-
-So a flat hierarchy for each node is not capable of representing this
-kind of complexity. To simplify discussion and because we do not want
-to single out CPU from device, from here on out we will use initiator
-to refer to either CPU or device. An initiator is any kind of CPU or
-device that can access memory (ie initiate memory access).
-
-At this point a example of such system might help:
-    - 2 nodes and for each node:
-        - 1 CPU per node with 2 complex of CPUs cores per CPU
-        - one HBM memory for each complex of CPUs cores (200GB/s)
-        - CPUs cores complex are linked to each other (100GB/s)
-        - main memory is (90GB/s)
-        - 4 GPUs each with:
-            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
-            - GDDR memory for each GPU (500GB/s) (CPU accessible)
-            - connected to CPU root controller (60GB/s)
-            - connected to other GPUs (even GPUs from the second
-              node) with GPU link (400GB/s)
-
-In this example we restrict our self to bandwidth and ignore bus width
-or latency, this is just to simplify discussions but obviously they
-also factor in.
-
-
-Userspace very much would like to know about this information, for
-instance HPC folks have develop complex library to manage this and
-there is wide research on the topics [2] [3] [4] [5]. Today most of
-the work is done by hardcoding thing for specific platform. Which is
-somewhat acceptable for HPC folks where the platform stays the same
-for a long period of time. But if we want a more ubiquituous support
-we should aim to provide the information needed through standard
-kernel API such as the one presented in this patchset.
-
-Roughly speaking i see two broads use case for topology information.
-First is for virtualization and vm where you want to segment your
-hardware properly for each vm (binding memory, CPU and GPU that are
-all close to each others). Second is for application, many of which
-can partition their workload to minimize exchange between partition
-allowing each partition to be bind to a subset of device and CPUs
-that are close to each others (for maximum locality). Here it is much
-more than just NUMA distance, you can leverage the memory hierarchy
-and  the system topology all-together (see [2] [3] [4] [5] for more
-references and details).
-
-So this is not exposing topology just for the sake of cool graph in
-userspace. They are active user today of such information and if we
-want to growth and broaden the usage we should provide a unified API
-to standardize how that information is accessible to every one.
-
-
-One proposal so far to handle new type of memory is to user CPU less
-node for those [6]. While same idea can apply for device memory, it is
-still hard to describe multiple path with different property in such
-scheme. While it is backward compatible and have minimum changes, it
-simplify can not convey complex topology (think any kind of random
-graph, not just a tree like graph).
-
-Thus far this kind of system have been use through device specific API
-and rely on all kind of system specific quirks. To avoid this going out
-of hands and grow into a bigger mess than it already is, this patchset
-tries to provide a common generic API that should fit various devices
-(GPU, FPGA, ...).
-
-So this patchset propose a new way to expose to userspace the system
-topology. It relies on 4 types of objects:
-    - target: any kind of memory (main memory, HBM, device, ...)
-    - initiator: CPU or device (anything that can access memory)
-    - link: anything that link initiator and target
-    - bridges: anything that allow group of initiator to access
-      remote target (ie target they are not connected with directly
-      through an link)
-
-Properties like bandwidth, latency, ... are all sets per bridges and
-links. All initiators connected to an link can access any target memory
-also connected to the same link and all with the same link properties.
-
-Link do not need to match physical hardware ie you can have a single
-physical link match a single or multiples software expose link. This
-allows to model device connected to same physical link (like PCIE
-for instance) but not with same characteristics (like number of lane
-or lane speed in PCIE). The reverse is also true ie having a single
-software expose link match multiples physical link.
-
-Bridges allows initiator to access remote link. A bridges connect two
-links to each others and is also specific to list of initiators (ie
-not all initiators connected to each of the link can use the bridge).
-Bridges have their own properties (bandwidth, latency, ...) so that
-the actual property value for each property is the lowest common
-denominator between bridge and each of the links.
-
-
-This model allows to describe any kind of directed graph and thus
-allows to describe any kind of topology we might see in the future.
-It is also easier to add new properties to each object type.
-
-Moreover it can be use to expose devices capable to do peer to peer
-between them. For that simply have all devices capable to peer to
-peer to have a common link or use the bridge object if the peer to
-peer capabilities is only one way for instance.
-
-
-This patchset use the above scheme to expose system topology through
-sysfs under /sys/bus/hms/ with:
-    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
-      each has a UID and you can usual value in that folder (node id,
-      size, ...)
-
-    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
-      (CPU or device), each has a HMS UID but also a CPU id for CPU
-      (which match CPU id in (/sys/bus/cpu/). For device you have a
-      path that can be PCIE BUS ID for instance)
-
-    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
-      UID and a file per property (bandwidth, latency, ...) you also
-      find a symlink to every target and initiator connected to that
-      link.
-
-    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
-      a UID and a file per property (bandwidth, latency, ...) you
-      also find a symlink to all initiators that can use that bridge.
-
-To help with forward compatibility each object as a version value and
-it is mandatory for user space to only use target or initiator with
-version supported by the user space. For instance if user space only
-knows about what version 1 means and sees a target with version 2 then
-the user space must ignore that target as if it does not exist.
-
-Mandating that allows the additions of new properties that break back-
-ward compatibility ie user space must know how this new property affect
-the object to be able to use it safely.
-
-This patchset expose main memory of each node under a common target.
-For now device driver are responsible to register memory they want to
-expose through that scheme but in the future that information might
-come from the system firmware (this is a different discussion).
-
-
-
-2) hbind() bind range of virtual address to heterogeneous memory
-----------------------------------------------------------------
-
-With this new topology description the mbind() API is too limited to
-handle which memory to picks. This is why this patchset introduce a new
-API: hbind() for heterogeneous bind. The hbind() API allows to bind any
-kind of target memory (using the HMS target uid), this can be any memory
-expose through HMS ie main memory, HBM, device memory ... 
-
-So instead of using a bitmap, hbind() take an array of uid and each uid
-is a unique memory target inside the new memory topology description.
-User space also provide an array of modifiers. This patchset only define
-some modifier. Modifier can be seen as the flags parameter of mbind()
-but here we use an array so that user space can not only supply a modifier
-but also value with it. This should allow the API to grow more features
-in the future. Kernel should return -EINVAL if it is provided with an
-unkown modifier and just ignore the call all together, forcing the user
-space to restrict itself to modifier supported by the kernel it is
-running on (i know i am dreaming about well behave user space).
-
-
-Note that none of this is exclusive of automatic memory placement like
-autonuma. I also believe that we will see something similar to autonuma
-for device memory. This patchset is just there to provide new API for
-process that wish to have a fine control over their memory placement
-because process should know better than the kernel on where to place
-thing.
-
-This patchset also add necessary bits to the nouveau open source driver
-for it to expose its memory and to allow process to bind some range to
-the GPU memory. Note that on x86 the GPU memory is not accessible by
-CPU because PCIE does not allow cache coherent access to device memory.
-Thus when using PCIE device memory on x86 it is mapped as swap out from
-CPU POV and any CPU access will triger a migration back to main memory
-(this is all part of HMM and nouveau not in this patchset).
-
-This is all done under staging so that we can experiment with the user-
-space API for a while before committing to anything. Getting this right
-is hard and it might not happen on the first try so instead of having to
-support forever an API i would rather have it leave behind staging for
-people to experiment with and once we feel confident we have something
-we can live with then convert it to a syscall.
-
-
-3) Tracking and applying heterogeneous memory policies
-------------------------------------------------------
-
-Current memory policy infrastructure is node oriented, instead of
-changing that and risking breakage and regression this patchset add a
-new heterogeneous policy tracking infra-structure. The expectation is
-that existing application can keep using mbind() and all existing
-infrastructure under-disturb and unaffected, while new application
-will use the new API and should avoid mix and matching both (as they
-can achieve the same thing with the new API).
-
-Also the policy is not directly tie to the vma structure for a few
-reasons:
-    - avoid having to split vma for policy that do not cover full vma
-    - avoid changing too much vma code
-    - avoid growing the vma structure with an extra pointer
-So instead this patchset use the mmu_notifier API to track vma liveness
-(munmap(),mremap(),...).
-
-This patchset is not tie to process memory allocation either (like said
-at the begining this is not and end to end patchset but a starting
-point). It does however demonstrate how migration to device memory can
-work under this scheme (using nouveau as a demonstration vehicle).
-
-The overall design is simple, on hbind() call a hms policy structure
-is created for the supplied range and hms use the callback associated
-with the target memory. This callback is provided by device driver
-for device memory or by core HMS for regular main memory. The callback
-can decide to migrate the range to the target memories or do nothing
-(this can be influenced by flags provided to hbind() too).
-
-
-Latter patches can tie page fault with HMS policy to direct memory
-allocation to the right target. For now i would rather postpone that
-discussion until a consensus is reach on how to move forward on all
-the topics presented in this email. Start smalls, grow big ;)
-
-Cheers,
-Jérôme Glisse
-
-https://cgit.freedesktop.org/~glisse/linux/log/?h=hms-hbind-v01
-git://people.freedesktop.org/~glisse/linux hms-hbind-v01
-
-
-[1] https://lkml.org/lkml/2018/11/15/331
-[2] https://arxiv.org/pdf/1704.08273.pdf
-[3] https://csmd.ornl.gov/highlight/sharp-unified-memory-allocator-intent-based-memory-allocator-extreme-scale-systems
-[4] https://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/Trott-white-paper.pdf
-    http://cacs.usc.edu/education/cs653/Edwards-Kokkos-JPDC14.pdf
-[5] https://github.com/LLNL/Umpire
-    https://umpire.readthedocs.io/en/develop/
-[6] https://www.spinics.net/lists/hotplug/msg06171.html
-
-Cc: Rafael J. Wysocki <rafael@kernel.org>
-Cc: Matthew Wilcox <willy@infradead.org>
-Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
-Cc: Keith Busch <keith.busch@intel.com>
-Cc: Dan Williams <dan.j.williams@intel.com>
-Cc: Dave Hansen <dave.hansen@intel.com>
-Cc: Haggai Eran <haggaie@mellanox.com>
-Cc: Balbir Singh <bsingharora@gmail.com>
-Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
-Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
-Cc: Felix Kuehling <felix.kuehling@amd.com>
-Cc: Philip Yang <Philip.Yang@amd.com>
-Cc: Christian König <christian.koenig@amd.com>
-Cc: Paul Blinzer <Paul.Blinzer@amd.com>
-Cc: Logan Gunthorpe <logang@deltatee.com>
-Cc: John Hubbard <jhubbard@nvidia.com>
-Cc: Ralph Campbell <rcampbell@nvidia.com>
-Cc: Michal Hocko <mhocko@kernel.org>
-Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
-Cc: Mark Hairgrove <mhairgrove@nvidia.com>
-Cc: Vivek Kini <vkini@nvidia.com>
-Cc: Mel Gorman <mgorman@techsingularity.net>
-Cc: Dave Airlie <airlied@redhat.com>
-Cc: Ben Skeggs <bskeggs@redhat.com>
-Cc: Andrea Arcangeli <aarcange@redhat.com>
-Cc: Rik van Riel <riel@surriel.com>
-Cc: Ben Woodard <woodard@redhat.com>
-Cc: linux-acpi@vger.kernel.org
author	Jérôme Glisse <jglisse@redhat.com>	2018-12-10 11:35:07 -0500
committer	Jérôme Glisse <jglisse@redhat.com>	2018-12-11 16:56:33 -0500
commit	1ebb3f8075a6b8f21353ec2b0e5869507b4cb215 (patch)
tree	9aa49cad8ca4e0c34171d81152910ade40be2e7f
parent	41d4d0979f392ef9aafc6346f9b6f4376094adbf (diff)