Move I915ScalarSSA to I965ScalarSSA

author: Jason Ekstrand <jason.ekstrand@intel.com> 2015-04-24 13:44:38 -0700
committer: Jason Ekstrand <jason.ekstrand@intel.com> 2015-04-24 13:44:38 -0700
commit: a63d361a41892280d43236681396a702ff795648 (patch)
tree: 2376785bd940cbe8d5cc9326c8ef4e1523f51a47 /I965ScalarSSA.mdwn
parent: 10553e4319941d59ae070331445a99ab404870a6 (diff)
1 files changed, 213 insertions, 0 deletions
diff --git a/I965ScalarSSA.mdwn b/I965ScalarSSA.mdwn
new file mode 100644
index 0000000..afbe61b
--- /dev/null
+++ b/I965ScalarSSA.mdwn
@@ -0,0 +1,213 @@
+This page is a collection of thoughts on how to best do SSA form in the
+scalar backend IR on i965.  There are a variety of different issues
+discussed here.  I've tried my best to put them into a reasonable order.
+
+# General Considerations
+
+There are a number of general things that will probably need to change in
+order to nicely handle SSA form.
+
+ 1. Register numbers.  Right now, we "allocate" an integer name for each
+    register and treat each value as if it comes out of a magical
+    variable-length register file.  For SSA values, we will probably want
+    to do something similar to NIR where we just point back to the
+    instruction that generates the value.
+
+# Validation
+
+One of the nicest things in all of NIR is the validator.  A validation
+layer cannot check that the shader is correct, but it can check that it is
+in the right form.  Having a good validator has been a massive help in NIR
+and we really should repeat it in the backend.
+
+# Registers
+
+Most of the interesting bits of SSA have to do with how you handle
+registers/values.
+
+## Register sizes
+
+Every register has a concept of size.  Hardware registers (physical or
+otherwise) and messages will be specified in terms of a number of physical
+registers.  Other registers have a three-dimensional size; a number of
+bits, a width, and a length.  The number of bits is fairly obvious: 32 for
+float/int, 16 for hf, 64 for double, etc.  The width is the number of SIMD
+channels stored in the register.
+
+## Execution masks and interference
+
+Most values, in particular anything in a GLSL variable, exist in a
+per-SIMD-channel way.  These registers can be treated with the normal
+SSA-based interference rules.  Consider the following example:
+
+    float a = 1.0;
+    if (foo) {
+        b = 0.5;
+        use(b);
+    } else {
+        use(a);
+    }
+
+If `foo` has different values for different SIMD channels, then the only
+channels of `b` that are set to 0.5 and used will be the channels where
+`foo` is true.  Similarly, the only channels of `a` that will be used are
+the channels where `foo` is false.  Therefore, we can freely assign `a` and
+`b` to the same register without any harm.
+
+The situation is quite different if the assignment to `b` happens to all
+channels regardless of the execution mask.  In this case, assigning `a` and
+`b` to the same register will cause the values we want in `a` to be
+squashed by the assignment to be so they have to be in separate registers.
+
+For the sake of clearer discussion, will use the term **SIMD-local** to
+refer to a register that is only used in a per-SIMD-channel way and
+**SIMD-global** to refer to anything else.
+
+### Bit sizes and exec offsets.
+
+These issues become even more interesting when you throw in quarter control
+and non-32-bit types.  With these added complexities, the execution masks
+do not always apply to the same 32-bit slice of any given register.
+Instead, it may apply to 16-bit or 64-bit slices.  For alternate quarter
+controls, you can have the second half of the mask apply to the given set
+of 32-bit slices instead of the first.
+
+## Register files
+
+In the current backend IR's, we have a variety of different register files
+that indicate different things.  This has worked pretty well, and we
+probably want to do the same type of thing in a new IR.  I think we want
+the following register files:
+
+ * **ATTR**: Push attribute registers.  These registers will be pushed by
+   the vertex fetching hardware and are considered to be "defined" at the
+   start of the program.  These are per-SIMD-channel and as such, have a
+   width equal to the dispatch width.
+
+ * **UNIFORM**: Push uniform registers.  These registers store uniform push
+   constants that are loaded by the hardware for use and are considered to
+   be "defined" at the start of the program.  These are uniform values and
+   as such always have a width of 1.
+
+ * **HW_REG**: An actual physical hardware register.  These are needed for
+   a variety of setup tasks.  Usually, we only use g0 and g1, but there are
+   other times when we need a particular physical register.  Hardware
+   registers don't really have a width per se but we do need some way of
+   specifying what hardware registers get touched.  Hardware registers will
+   allow full register regioning.  Reads/writes to physcal hardware
+   registers cannot be reordered.
+
+ * **Allocated HW_REG**: An allocated hardware register is similar to a
+   hardware register in that it allows multiple writes, partial writes,
+   full register regioning, and are SIMD-global.  The primary difference
+   between allocated and physical hardware registers is that allocated
+   hardware registers get their actual register number assigned by the
+   allocator.  In order to allow for SSA-based allocation, any given
+   allocated hardware register may only be used in a single basic block.
+   (It will be considered to be defined at the first use/def and all other
+   use/defs will be considered uses.)  We may implement this by simply
+   using a register number of -1 to indicate that it should be allocated.
+
+ * **MRF**: Message registers.  These are only ever defined by a
+   LOAD_PAYLOAD type operation and used by send messages.  On SNB-, these
+   will be allocated out of a special pool of 16 registers (24 on SNB?).
+   On IVB+, these will be allocated out of the same pool as the rest of the
+   registers.  Messages are special in that they are partially SIMD-local
+   and partially SIMD-global.  How will we handle this in the allocator?  I
+   have no idea.
+
+ * **IMM**: Immediate values.  We could do what NIR does and handle
+   immediates by using a constant load instruction.  This may be a good
+   idea, but it doesn't map very well to the hardware.  It's probably
+   better to use a special register file/type for these.
+
+ * **FLAG**: Boolean flag values.  These store boolean-only SSA values and
+   get allocated into one of the available flag registers.  Since we have
+   very few flag registers, we may have to "spill" a flag value to a
+   regular SSA value by inserting a move-to-flag shortly before the uses of
+   the flag value.  However, we do want the ability to track flag values
+   explicitly rather than the current "writes flag" boolean.  There is a
+   tricky bit here because the CMP instruction *always* writes the flag
+   whether you want it to or not so we can't quite use SSA interference for
+   this.
+
+ * **SSA**: These are the regular values.  They have a number of bits,
+   a width (1, 8, 16, or 32), a quarter (first half, second quarter, etc.)
+   and a length.  In general, the width will be 1 for values that are
+   uniform and 8, 16, or 32 for non-uniform values.  Non-uniform value
+   with a width less than the dispatch width will also have a specified
+   quarter to denote which SIMD channels it corresponds to.  Unlike
+   hardware registers, partial writes and interesting regions are *not*
+   allowed on SSA values.  Also, SSA values have exactly one definition and
+   this definition must dominate its entire live-range.
+
+One more we *may* want, but I'm not sure what to do with it:
+
+ * **ACCUM**: We also have an accumulator register.  Some instructions just
+   write to this register by default.  If we use the accumulator value, we
+   need to be very careful with re-ordering of instructions.  I don't know
+   if we want to track this as an SSA value or not, but we do need to track
+   it.
+
+# Register Allocation
+
+One of the big advantages to SSA in the backend IR is the ability to use
+SSA for doing register allocation.  More than the allocation itself
+SSA-based allocation schemes gives you the ability to know the exact
+register pressure at any point in the program and make decisions based on
+it.  The benefits of this apply to many areas of the backend compiler such
+as value numbering, scheduling, spilling, and many more.  If we can, we
+really want an SSA-based scheme.
+
+## SSA-based allocation
+
+The problem here is that  most SSA-based register allocation schemes are
+fairly thoroughly CPU-focused.  In our setting, there are some additional
+constraints:
+
+ 1. Register sizes.  In the usual CPU setting, each register is a single
+    atomic entity.  In the GPU setting, however, we have to deal with
+    allocating blocks of contiguous registers.  This means that we have to
+    solve a packing problem.
+
+ 2. SIMD masking.  In the usual CPU setting, if two registers don't
+    interfere in the SSA sense of interference, you can assign them to the
+    same register.  In the SIMD setting,  SIMD-global registers interfere
+    with more than just what the dominance tree tells you.  Also, registers
+    with different bit-sizes and different quarters can interfere even if
+    they don't interfere in the usual SSA sense.
+
+Connor and I (Jason) have discussed this quite a bit and have come up with
+two different schemes that seem to at least start to address these
+problems.  However, the problem is very far from being solved.
+
+## In the mean time...
+
+Before we actually solve the SSA-based register allocation problem, we can
+probably do better than we are today by using a hybrid approach.  This
+would be to combine both SSA-based interference with the interval-based
+interference we use today.  One way to do this would be as follows:
+
+    bool
+    regs_interfere(reg *a, reg *b)
+    {
+        if (regs_ssa_interfere(a, b))
+            return true;
+
+        if (!regs_interval_interfere(a, b))
+            return false;
+
+        return a->width != b->width ||
+               a->bits != b->bits ||
+               a->quarter != b->quarter ||
+               a->simd_global || b->simd_global;
+    }
+
+This would get us an interference graph with fewer edges than the purely
+interval-based approach but would also allow us to avoid the sharp edges
+given by working on a SIMD architecture.  We would then use the same
+register allocator that we have today.
+
+The problem with this approach is, of course, that we wouldn't get the nice
+register pressure properties of *real* SSA-based register allocation.
+However, it may save us a few registers and gain us some SIMD16 programs.
author	Jason Ekstrand <jason.ekstrand@intel.com>	2015-04-24 13:44:38 -0700
committer	Jason Ekstrand <jason.ekstrand@intel.com>	2015-04-24 13:44:38 -0700
commit	a63d361a41892280d43236681396a702ff795648 (patch)
tree	2376785bd940cbe8d5cc9326c8ef4e1523f51a47 /I965ScalarSSA.mdwn
parent	10553e4319941d59ae070331445a99ab404870a6 (diff)