diff options
-rw-r--r-- | cachegrind/docs/cg-manual.xml | 60 |
1 files changed, 55 insertions, 5 deletions
diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml index c1377b63..60a966b1 100644 --- a/cachegrind/docs/cg-manual.xml +++ b/cachegrind/docs/cg-manual.xml @@ -1198,9 +1198,8 @@ fail these checks.</para> xreflabel="Acting on Cachegrind's information"> <title>Acting on Cachegrind's information</title> <para> -So, you've managed to profile your program with Cachegrind. Now what? -What's the best way to actually act on the information it provides to speed -up your program? Here are some rules of thumb that we have found to be +Cachegrind gives you lots of information, but acting on that information +isn't always easy. Here are some rules of thumb that we have found to be useful.</para> <para> @@ -1210,6 +1209,17 @@ might identify if any are outliers and worthy of closer investigation. Otherwise, they're not enough to act on.</para> <para> +The function-by-function counts are more useful to look at, as they pinpoint +which functions are causing large numbers of counts. However, beware that +inlining can make these counts misleading. If a function +<function>f</function> is always inlined, counts will be attributed to the +functions it is inlined into, rather than itself. However, if you look at +the line-by-line annotations for <function>f</function> you'll see the +counts that belong to <function>f</function>. (This is hard to avoid, it's +how the debug info is structured.) So it's worth looking for large numbers +in the line-by-line annotations.</para> + +<para> The line-by-line source code annotations are much more useful. In our experience, the best place to start is by looking at the <computeroutput>Ir</computeroutput> numbers. They simply measure how many @@ -1220,13 +1230,53 @@ bottlenecks.</para> <para> After that, we have found that L2 misses are typically a much bigger source of slow-downs than L1 misses. So it's worth looking for any snippets of -code that cause a high proportion of the L2 misses. If you find any, it's -still not always easy to work out how to improve things. You need to have a +code with high <computeroutput>D2mr</computeroutput> or +<computeroutput>D2mw</computeroutput> counts. (You can use +<option>--show=D2mr +--sort=D2mr</option> with cg_annotate to focus just on +<literal>D2mr</literal> counts, for example.) If you find any, it's still +not always easy to work out how to improve things. You need to have a reasonable understanding of how caches work, the principles of locality, and your program's data access patterns. Improving things may require redesigning a data structure, for example.</para> <para> +Looking at the <computeroutput>Bcm</computeroutput> and +<computeroutput>Bim</computeroutput> misses can also be helpful. +In particular, <computeroutput>Bim</computeroutput> misses are often caused +by <literal>switch</literal> statements, and in some cases these +<literal>switch</literal> statements can be replaced with table-driven code. +For example, you might replace code like this:</para> + +<programlisting><![CDATA[ +enum E { A, B, C }; +enum E e; +int i; +... +switch (e) +{ + case A: i += 1; + case B: i += 2; + case C: i += 3; +} +]]></programlisting> + +<para>with code like this:</para> + +<programlisting><![CDATA[ +enum E { A, B, C }; +enum E e; +enum E table[] = { 1, 2, 3 }; +int i; +... +i += table[e]; +]]></programlisting> + +<para> +This is obviously a contrived example, but the basic principle applies in a +wide variety of situations.</para> + +<para> In short, Cachegrind can tell you where some of the bottlenecks in your code are, but it can't tell you how to fix them. You have to work that out for yourself. But at least you have the information! |