From 1283d8627de251933c7d28a9fb15385e59eaf87f Mon Sep 17 00:00:00 2001
From: faith <faith>
Date: Wed, 25 Sep 2002 23:33:32 +0000
Subject: Update OProfile results section

---
 xc/programs/Xserver/hw/dmx/doc/dmx.sgml | 96 +++++++++++++++++++++++----------
 1 file changed, 67 insertions(+), 29 deletions(-)
diff --git a/xc/programs/Xserver/hw/dmx/doc/dmx.sgml b/xc/programs/Xserver/hw/dmx/doc/dmx.sgml
index 70ab04300..9590b99b3 100644
--- a/xc/programs/Xserver/hw/dmx/doc/dmx.sgml
+++ b/xc/programs/Xserver/hw/dmx/doc/dmx.sgml
@@ -1492,7 +1492,7 @@ server specifically needs to make a call to guarantee interactivity.
 With this new system, X11 buffers protocol as much as possible during a
 100mS interval, and many unnecessary XSync() calls are avoided.
 
-<p>Out of more than 300 x11perf tests, 8 tests became more than 100
+<p>Out of more than 300 <tt/x11perf/ tests, 8 tests became more than 100
 times faster, with 68 more than 50X faster, 114 more than 10X faster,
 and 181 more than 2X faster.  See table below for summary.
 
@@ -1517,7 +1517,7 @@ XSync() calls.  The performance tests were run on a DMX system with only
 two back-end servers.  Greater performance gains will be had as the
 number of back-end servers increases.
 
-<p>Out of more than 300 x11perf tests, 3 tests were at least twice as
+<p>Out of more than 300 <tt/x11perf/ tests, 3 tests were at least twice as
 fast, and 146 tests were at least 10% faster.  Two tests were more than
 10% slower with the offscreen optimization:
             <verb>
@@ -1567,8 +1567,8 @@ resized, which is common in many window managers.
 servers.  Greater performance gains will be had as the number of
 back-end servers increases.
 
-<p>This optimization improved the following x11perf tests by more than
-10%:
+<p>This optimization improved the following <tt/x11perf/ tests by more
+than 10%:
             <verb>
 1.10   500x500 rectangle outline 
 1.12   Fill 100x100 stippled trapezoid (161x145 stipple) 
@@ -1603,8 +1603,8 @@ this optimization was rejected for the other rendering primitives.
 back-end servers.  Greater performance gains will be had as the number
 of back-end servers increases.
 
-<p>This optimization improved the following x11perf tests by more than
-10%:
+<p>This optimization improved the following <tt/x11perf/ tests by more
+than 10%:
             <verb>
 1.12   Fill 100x100 stippled trapezoid (161x145 stipple) 
 1.26   PutImage 10x10 square 
@@ -1625,17 +1625,17 @@ optimization:
 
 <sect2>Summary of x11perf Data
 
-<p>With all of the optimizations on, 53 x11perf tests are more than 100X
-faster than the unoptimized Phase II deliverable, with 69 more than 50X
-faster, 73 more than 10X faster, and 199 more than twice as fast.  No
-tests were more than 10% slower than the unoptimized Phase II
+<p>With all of the optimizations on, 53 <tt/x11perf/ tests are more than
+100X faster than the unoptimized Phase II deliverable, with 69 more than
+50X faster, 73 more than 10X faster, and 199 more than twice as fast.
+No tests were more than 10% slower than the unoptimized Phase II
 deliverable.  (Compared with the Phase I deliverable, only Circulate
 Unmapped window (100 kids) was more than 10% slower than the Phase II
 deliverable.  As noted above, this test seems to have wider variability
-than other x11perf tests.)
+than other <tt/x11perf/ tests.)
 
-<p>The following table summarizes relative x11perf test changes for all
-optimizations individually and collectively.  Note that some of the
+<p>The following table summarizes relative <tt/x11perf/ test changes for
+all optimizations individually and collectively.  Note that some of the
 optimizations have a synergistic effect when used together.
             <verb>
 
@@ -1984,22 +1984,60 @@ that is similar to that provided by <tt/gprof/, but without the
 necessity of recompiling the program with special instrumentation (i.e.,
 OProfile can collect statistical profiling information about optimized
 programs).  A test harness was developed to collect OProfile data for
-each x11perf test.  The results were examined by hand and were found to
-correlate well with gprof data.  However, they failed to reveal any
-information that was helpful for optimization of Xdmx.
-
-The OProfile results for x11perf tests showed drawing, text, copying,
-and image tests to be dominated (> 30%) by calls to Hash(),
-SecurityLookupIDByClass(), SecurityLookupIDByType(), and
-StandardReadRequestFromClient().  Some of these tests also spent
-significant time in WaitForSomething().  In contrast, the window tests
-spent significant time in SecurityLookupIDByType(), Hash(),
-StandardReadRequestFromClient(), but also spent significant time in
-other routines, such as ConfigureWindow().  Some time was spent looking
-at Hash() and the LookupID functions, but optimizations in these
-routines do not lead to a dramatic increase in <tt/x11perf/ performance.
-Since these routines are in the dix layer and are not specific to DMX,
-work based on the OProfile results has been deferred.
+each <tt/x11perf/ test.
+
+<p>Test runs were performed using the RETIRED_INSNS counter on the AMD
+Athlon and the CPU_CLK_HALTED counter on the Intel Pentium III (with a
+test configuration different from the one described above).  We are
+continuing to examine OProfile output and to compare it with <tt/gprof/
+output.  This investigation is ongoing and has not yet produced results
+that yield performance increases in <tt/x11perf/ numbers.  However, we
+will continue this investigation and provide addition information as
+necessary.
+
+%<sect3>Retired Instructions
+
+%<p>The initial tests using OProfile were done using the RETIRED_INSNS
+%counter with DMX running on the dual-processor AMD Athlon machine -- the
+%same test configuration that was described above and that was used for
+%other tests.  The RETIRED_INSNS counter counts retired instructions and
+%showed drawing, text, copying, and image tests to be dominated (&gt;
+%30%) by calls to Hash(), SecurityLookupIDByClass(),
+%SecurityLookupIDByType(), and StandardReadRequestFromClient().  Some of
+%these tests also executed significant instructions in
+%WaitForSomething().
+
+%<p>In contrast, the window tests executed significant
+%instructions in SecurityLookupIDByType(), Hash(),
+%StandardReadRequestFromClient(), but also executed significant
+%instructions in other routines, such as ConfigureWindow().  Some time
+%was spent looking at Hash() function, but optimizations in this routine
+%did not lead to a dramatic increase in <tt/x11perf/ performance.
+
+%<sect3>Clock Cycles
+
+%<p>Retired instructions can be misleading because Intel/AMD instructions
+%execute in variable amounts of time.  The OProfile tests were repeated
+%using the Intel CPU_CLK_HALTED counter with DMX running on the second
+%back-end machine.  Note that this is a different test configuration that
+%the one described above.  However, these tests show the amount of time
+%(as measured in CPU cycles) that are spent in each routine.  Because
+%<tt/x11perf/ was running on the first back-end machine and because
+%window optimizations were on, the load on the second back-end machine
+%was not significant.
+
+%<p>Using CPU_CLK_HALTED, DMX showed simple drawing
+%tests spending more than 10% of their time in
+%StandardReadRequestFromClient(), with significant time (&gt; 20% total)
+%spent in SecurityLookupIDByClass(), WaitForSomething(), and Dispatch().
+%For these tests, &lt; 5% of the time was spent in Hash(), which explains
+%why optimizing the Hash() routine did not impact <tt/x11perf/ results.
+
+%<p>The trapezoid, text, scrolling, copying, and image tests were
+%dominated by time in ProcFillPoly(), PanoramiXFillPoly(), dmxFillPolygon(),
+%SecurityLookupIDByClass(), SecurityLookupIDByType(), and
+%StandardReadRequestFromClient().  Hash() time was generally above 5% but
+%less than 10% of total time.
 
 <sect2>X Test Suite
 
-- 
cgit v1.2.3