BIB-VERSION:: CS-TR-v2.0
          ID:: STAN//CSL-TR-94-602
       ENTRY:: April 24, 1995
ORGANIZATION:: Stanford University, Computer Systems Laboratory
       TITLE:: Analyzing and Tuning Memory Performance in Sequential and
               Parallel Programs
        TYPE:: Thesis
        TYPE:: Technical Report
      AUTHOR:: Martonosi, Margaret Rose
        DATE:: January 1994
       PAGES:: 188
    ABSTRACT:: Recent architecture and technology trends have led to a
               significant gap between processor and main memory speeds.
               When cache misses are common, memory stalls can significantly
               degrade execution time. To help identify and fix such memory
               bottlenecks, this work presents techniques to efficiently
               collect detailed information about program memory performance
               and effectively organize the data collected. These techniques
               help guide programmers or compilers to memory bottlenecks.
               They apply to both sequential and parallel applications and
               are embodied in the MemSpy performance monitoring system.

               This thesis contends that the natural interrelationship
               between program memory bottlenecks and program data
               structures mandates the use of data oriented statistics, a
               novel approach that associates program performance
               information with application data structures. Data oriented
               statistics, viewed alone or paired with traditional code
               oriented statistics, offer a powerful, new dimension for
               performance analysis. I develop techniques for aggregating
               statistics on similarly-used data structures and for
               extracting intuitive source-code names for statistics. The
               thesis also argues that MemSpy's detailed statistics on the
               frequency and causes of cache misses are crucial in
               understanding memory bottlenecks. Common memory performance
               bugs are often most easily distinguished by noting the causes
               of their resulting cache misses.

               Since collecting such detailed information seems, at first
               glance, to require large execution time slowdowns, this
               dissertation also evaluates techniques to improve the
               performance of MemSpy's simulation-based monitoring. The
               first optimization, hit bypassing, improves simulation
               performance by specializing processing of cache hits. The
               second optimization, reference trace sampling, improves
               performance by simulating only sampled portions out of the
               full reference trace. Together, these optimizations reduce
               simulation time by nearly an order of magnitude. Overall,
               having used MemSpy to tune several applications, these
               experiences demonstrate that MemSpy generates effective
               memory performance profiles, at speeds competitive with
               previous, less detailed approaches.
       NOTES:: [Adminitrivia V1/Prg/19950424]
         END:: STAN//CSL-TR-94-602