1 files changed, 55 insertions, 5 deletions
diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml
index c1377b63..60a966b1 100644
--- a/cachegrind/docs/cg-manual.xml
+++ b/cachegrind/docs/cg-manual.xml
@@ -1198,9 +1198,8 @@ fail these checks.</para>
        xreflabel="Acting on Cachegrind's information">
 <title>Acting on Cachegrind's information</title>
 <para>
-So, you've managed to profile your program with Cachegrind.  Now what?
-What's the best way to actually act on the information it provides to speed
-up your program?  Here are some rules of thumb that we have found to be
+Cachegrind gives you lots of information, but acting on that information
+isn't always easy.  Here are some rules of thumb that we have found to be
 useful.</para>
 
 <para>
@@ -1210,6 +1209,17 @@ might identify if any are outliers and worthy of closer investigation.
 Otherwise, they're not enough to act on.</para>
 
 <para>
+The function-by-function counts are more useful to look at, as they pinpoint
+which functions are causing large numbers of counts.  However, beware that
+inlining can make these counts misleading.  If a function
+<function>f</function> is always inlined, counts will be attributed to the
+functions it is inlined into, rather than itself.  However, if you look at
+the line-by-line annotations for <function>f</function> you'll see the
+counts that belong to <function>f</function>.  (This is hard to avoid, it's
+how the debug info is structured.)  So it's worth looking for large numbers
+in the line-by-line annotations.</para>
+
+<para>
 The line-by-line source code annotations are much more useful.  In our
 experience, the best place to start is by looking at the
 <computeroutput>Ir</computeroutput> numbers.  They simply measure how many
@@ -1220,13 +1230,53 @@ bottlenecks.</para>
 <para>
 After that, we have found that L2 misses are typically a much bigger source
 of slow-downs than L1 misses.  So it's worth looking for any snippets of
-code that cause a high proportion of the L2 misses.  If you find any, it's
-still not always easy to work out how to improve things.  You need to have a
+code with high <computeroutput>D2mr</computeroutput> or
+<computeroutput>D2mw</computeroutput> counts.  (You can use
+<option>--show=D2mr
+--sort=D2mr</option> with cg_annotate to focus just on
+<literal>D2mr</literal> counts, for example.) If you find any, it's still
+not always easy to work out how to improve things.  You need to have a
 reasonable understanding of how caches work, the principles of locality, and
 your program's data access patterns.  Improving things may require
 redesigning a data structure, for example.</para>
 
 <para>
+Looking at the <computeroutput>Bcm</computeroutput> and
+<computeroutput>Bim</computeroutput> misses can also be helpful.
+In particular, <computeroutput>Bim</computeroutput> misses are often caused
+by <literal>switch</literal> statements, and in some cases these
+<literal>switch</literal> statements can be replaced with table-driven code.
+For example, you might replace code like this:</para>
+
+<programlisting><![CDATA[
+enum E { A, B, C };
+enum E e;
+int i;
+...
+switch (e)
+{
+    case A: i += 1;
+    case B: i += 2;
+    case C: i += 3;
+}
+]]></programlisting>
+
+<para>with code like this:</para>
+
+<programlisting><![CDATA[
+enum E { A, B, C };
+enum E e;
+enum E table[] = { 1, 2, 3 };
+int i;
+...
+i += table[e];
+]]></programlisting>
+
+<para>
+This is obviously a contrived example, but the basic principle applies in a
+wide variety of situations.</para>
+
+<para>
 In short, Cachegrind can tell you where some of the bottlenecks in your code
 are, but it can't tell you how to fix them.  You have to work that out for
 yourself.  But at least you have the information!