perfex man page on IRIX

perfex man page on IRIX
Man page or keyword search:
man Server 31559 pages
apropos Keyword Search (all sections)
Output format


PERFEX(1)							     PERFEX(1)

NAME
     perfex - Command line interface to processor event counters

SYNOPSIS
     perfex [-a | -e event0 [-e event1]] [-mp |-s | -p] [-pp [tid]] [-x] [-k]
     [-y] [-t] [-T] [-o file] [-c file] command

DESCRIPTION
     The given command is executed; after it is complete, perfex prints the
     values of various hardware performance counters.  The counts returned are
     aggregated over all processes that are descendants of the target command,
     as long as their parent process controls the child through wait (see
     wait(2)).

     The R10000 event counters are different from R12000 event counters.  See
     the r10k_counters(5) man page for differences.  For R10000 CPUs, the
     integers event0 and event1 index the following table:
	  0 = Cycles
	  1 = Issued instructions
	  2 = Issued loads
	  3 = Issued stores
	  4 = Issued store conditionals
	  5 = Failed store conditionals
	  6 = Decoded branches.	 (This changes meaning in 3.x
		 versions of R10000.  It becomes resolved branches).
	  7 = Quadwords written back from secondary cache
	  8 = Correctable secondary cache data array ECC errors
	  9 = Primary (L1) instruction cache misses
	  10 = Secondary (L2) instruction cache misses
	  11 = Instruction misprediction from secondary cache way prediction table
	  12 = External interventions
	  13 = External invalidations
	  14 = Virtual coherency conditions.  (This changes meaning in 3.x
		 versions of R10000.  It becomes ALU/FPU forward progress
		 cycles.  On the R12000, this counter is always 0).
	  15 = Graduated instructions
	  16 = Cycles
	  17 = Graduated instructions
	  18 = Graduated loads
	  19 = Graduated stores
	  20 = Graduated store conditionals
	  21 = Graduated floating point instructions
	  22 = Quadwords written back from primary data cache
	  23 = TLB misses
	  24 = Mispredicted branches
	  25 = Primary (L1) data cache misses
	  26 = Secondary (L2) data cache misses
	  27 = Data misprediction from secondary cache way prediction table
	  28 = External intervention hits in secondary cache (L2)
	  29 = External invalidation hits in secondary cache
	  30 = Store/prefetch exclusive to clean block in secondary cache

									Page 1

PERFEX(1)							     PERFEX(1)

	  31 = Store/prefetch exclusive to shared block in secondary cache

     For R12000 CPUs, the integers event0 and event1 index the following
     table:
	  0 = Cycles
	  1 = Decoded instructions
	  2 = Decoded loads
	  3 = Decoded stores
	  4 = Miss handling table occupancy
	  5 = Failed store conditionals
	  6 = Resolved conditional branches
	  7 = Quadwords written back from secondary cache
	  8 = Correctable secondary cache data array ECC errors
	  9 = Primary (L1) instruction cache misses
	  10 = Secondary (L2) instruction cache misses
	  11 = Instruction misprediction from secondary cache way prediction table
	  12 = External interventions
	  13 = External invalidations
	  14 = ALU/FPU progress cycles.	 (This counter in current versions of R12000
		 is always 0).
	  15 = Graduated instructions
	  16 = Executed prefetch instructions
	  17 = Prefetch primary data cache misses
	  18 = Graduated loads
	  19 = Graduated stores
	  20 = Graduated store conditionals
	  21 = Graduated floating-point instructions
	  22 = Quadwords written back from primary data cache
	  23 = TLB misses
	  24 = Mispredicted branches
	  25 = Primary data cache misses
	  26 = Secondary data cache misses
	  27 = Data misprediction from secondary cache way prediction table
	  28 = State of intervention hits in secondary cache (L2)
	  29 = State of invalidation hits in secondary cache
	  30 = Store/prefetch exclusive to clean block in secondary cache
	  31 = Store/prefetch exclusive to shared block in secondary cache

BASIC OPTIONS
     -e event	       Specify an event to be counted.

		       2, 1, or 0 event specifiers may be given, the default
		       events being to count cycles.  Events may also be
		       specified by setting one or both of the environment
		       variables T5_EVENT0 and T5_EVENT1. Command line event
		       specifiers, if present, override the environment
		       variables. The order of events specified is not
		       important.  The counts, together with an event
		       description, are written to stderr unless redirected
		       with the -o option. Two events that must be counted on
		       the same hardware counter (see r10k_counters(5)) will
		       cause a conflicting counters error.

									Page 2

PERFEX(1)							     PERFEX(1)

     -a		       Multiplexes over all events, projecting totals. Ignores
		       event specifiers.

		       The option -a produces counts for all events by
		       multiplexing over 16 events per counter. The OS does
		       the switching round robin at clock interrupt
		       boundaries. The resulting counts are normalized by
		       multiplying by 16 to give an estimate of the values
		       they would have had for exclusive counting. Due to the
		       equal-time nature of the multiplexing, events present
		       in large enough numbers to contribute significantly to
		       the execution time will be fairly represented. Events
		       concentrated in a few short regions (for instance,
		       instruction cache misses) may not be projected very
		       accurately.

     -mp	       Report per-thread counts for multiprocessing programs
		       as well as (default) totals.

		       By default, perfex aggregates the counts of all the
		       child threads and reports this number for each selected
		       event. The -mp option causes the counters for each
		       thread to be collected at thread exit time and printed
		       out; the counts aggregated across all threads are
		       printed next.  The per-thread counts are labeled by
		       process ID (pid).

     -pp	       Report per-pthread counts for multiprocessing programs.

		       perfex -pp tid  displays the counts of the pthread with
		       thread id tid.  perfex -mp -pp displays the counts of
		       all pthreads associated with the process.  If pthread 0
		       is chosen, all of the pthread counts will be displayed.
		       The -pp option causes the counters for the thread to be
		       collected at thread exit time and printed out; The
		       per-pthread counts are labeled by thread ID (tid).

     -o file	       Redirects perfex output to the specified file.

		       In the -mp case, the file name includes the pid of the
		       sproc child thread.

     -s		       Starts (or stops) counting when a SIGUSR1 (or SIGUSR2)
		       signal is received by a perfex process.

									Page 3

PERFEX(1)							     PERFEX(1)

     -p period	       Profiles (samples) the counters with the given period.

		       This option causes perfex to wait until it (i.e., the
		       perfex process) receives a SIGUSR1 before it starts
		       counting (for the child process, the target). It will
		       stop counting if it receives a SIGUSR2. Repeated cycles
		       of this will aggregate counts. If no SIGUSR2 is
		       received (the usual case), the counting will continue
		       until the child exits.  Note that counting for
		       descendants of the child will not be affected, meaning
		       counting for mp programs cannot be controlled with this
		       option.

     -x		       Counts at exception level (as well as the default user
		       level).

		       Exception level includes time spent on behalf of the
		       user during, for example, TLB refill exceptions.	 Other
		       counting modes (kernel, supervisor) are available
		       through the OS ioctl interface (see r10k_counters(5) ).

     -k		       Counts at kernel level (as well as user and exception
		       level, if set), program superuser privileges.

EXAMPLE
     To collect instruction and data secondary cache miss counts on a program
     normally executed by

	% bar < bar.in > bar.out

      would be accomplished by

	% perfex -e 26 -e 10 bar < bar.in > bar.out .

COST ESTIMATE OPTIONS
     -y	  Report statistics and ranges of estimated times per event.

	  Without the -y option, perfex reports the counts recorded by the
	  event counters for the events requested. Since they are simply raw
	  counts, it is difficult to know by inspection which events are
	  responsible for significant portions of the job's run time. The -y
	  option associates time cost with some of the event counts.

	  The reported times are approximate.  Due to the superscalar nature
	  of the R10000 and R12000 CPUs, and their ability to hide latency,
	  stating a precise cost for a single occurrence of many of the events
	  is not possible. Cache misses, for example, can be overlapped with

									Page 4

PERFEX(1)							     PERFEX(1)

	  other operations, so there is a wide range of times possible for any
	  cache miss.

	  To account for the fact that the cost of many events cannot be known
	  precisely, perfex -y reports a range of time costs for each event.
	  "Maximum," "minimum," and "typical" time costs are reported. Each is
	  obtained by consulting an internal table that holds the maximum,
	  minimum, and typical costs for each event, and multiplying this cost
	  by the count for the event. Event costs are usually measured in
	  terms of machine cycles, and so the cost of an event generally
	  depends on the clock speed of the processor, which is also reported
	  in the output.

	  The maximum value contained in the table corresponds to the worst
	  case cost of a single occurrence of the event. Sometimes this can be
	  a very pessimistic estimate. For example, the maximum cost for
	  graduated floating-point instructions assumes that all such
	  instructions are double precision reciprocal square roots, since
	  that is the most costly floating-point instruction.

	  Due to the latency-hiding capabilities of the CPUs, the minimum cost
	  of virtually any event could be zero, since most events can be
	  overlapped with other operations. To avoid simply reporting minimum
	  costs of 0, which would be of no practical use, the minimum time
	  reported by perfex -y corresponds to the "best case" cost of a
	  single occurrence of the event. The best case cost is obtained by
	  running the maximum number of simultaneous occurrences of that event
	  and averaging the cost. For example, two floating-point instructions
	  can complete per cycle, so the best case cost on the R10000 is 0.5
	  cycles per floating-point instruction.

	  The typical cost falls somewhere between minimum and maximum and is
	  meant to correspond to the cost one would expect to see in average
	  programs. For example, to measure the typical cost of a cache miss,
	  stride-1 accesses to an array too big to fit in cache were timed,
	  and the number of cache misses generated was counted. The same
	  number of stride-1 accesses to an in-cache array were then timed.
	  The difference in times corresponds to the cost of the cache misses,
	  and this was used to calculate the average cost of a cache miss.
	  This typical cost is lower than the worst case in which each cache
	  miss cannot be overlapped, and it is higher than the best case, in
	  which several independent, and hence, overlapping, cache misses are
	  generated.  (Note that on Origin systems, this methodology yields
	  the time for secondary cache misses to local memory only.)
	  Naturally, these typical costs are somewhat arbitrary.  If they do
	  not seem right for the application being measuring by perfex, they
	  can be replaced by user-supplied values. See the -c option below.

	  perfex -y prints the event counts and associated cost estimates
	  sorted from most costly to least costly. While resembling a
	  profiling output, it is not a true profile. The event costs reported
	  are only estimates. Furthermore, since events do overlap with each

									Page 5

PERFEX(1)							     PERFEX(1)

	  other, the sum of the estimated times will usually exceed the
	  program's run time.  This output should only be used to identify
	  which events are responsible for significant portions of the
	  program's run time and to get a rough idea of what those costs might
	  be.

	  With this in mind, the built-in cost table does not make an attempt
	  to provide detailed costs for all events. Some events provide
	  summary or redundant information. These events are assigned minimum
	  and typical costs of 0, so that they sort to the bottom of the
	  output.  The maximum costs are set to 1 cycle, so that you can get
	  an indication of the time corresponding to these events.  Issued
	  instructions and graduated instructions are examples of such events.
	  In addition to these summary or redundant events, detailed cost
	  information has not been provided for a few other events, such as
	  external interventions and external invalidations, since it is
	  difficult to assign costs to these asynchronous events. The built-in
	  cost values may be overridden by user-supplied values using the -c
	  option.

	  In addition the event counts and cost estimates, perfex -y also
	  reports a number of statistics derived from the typical costs. The
	  meaning of many of the statistics is self-evident (for example,
	  graduated instructions/cycle). The following are statistics whose
	  definitions require more explanation.	 These are available with both
	  R10000 and R12000 CPUs.

     Data mispredict/Data secondary cache hits

	  This is the ratio of the counts for data misprediction from
	  secondary cache way prediction table and secondary data cache
	  misses.

     Instruction mispredict/Instruction secondary cache hits

	  This is the ratio of the counts for instruction misprediction from
	  secondary cache way prediction table and secondary instruction cache
	  misses.

     Primary cache line reuse

	  The is the number of times, on average, that a primary data cache
	  line is used after it has been moved into the cache. It is
	  calculated as graduated loads plus graduated stores minus primary
	  data cache misses, all divided by primary data cache misses.

									Page 6

PERFEX(1)							     PERFEX(1)

     Secondary Cache Line Reuse

	  The is the number of times, on average, that a secondary data cache
	  line is used after it has been moved into the cache. It is
	  calculated as primary data cache misses minus secondary data cache
	  misses, all divided by secondary data cache misses.

     Primary Data Cache Hit Rate

	  This is the fraction of data accesses that are satisfied from a
	  cache line already resident in the primary data cache. It is
	  calculated as 1.0 - (primary data cache misses divided by the sum of
	  graduated loads and graduated stores).

     Secondary Data Cache Hit Rate

	  This is the fraction of data accesses that are satisfied from a
	  cache line already resident in the secondary data cache. It is
	  calculated as 1.0 - (secondary data cache misses divided by primary
	  data cache misses).

     Time accessing memory/Total time

	  This is the sum of the typical costs of graduated loads, graduated
	  stores, primary data cache misses, secondary data cache misses, and
	  TLB misses, divided by the total program run time. The total program
	  run time is calculated by multiplying cycles by the time per cycle
	  (the inverse of the processor's clock speed).

     Primary-to-secondary bandwidth used (MB/s, average per process)

	  This is the amount of data moved between the primary and secondary
	  data caches, divided by the total program run time. The amount of
	  data moved is calculated as the sum of the number of primary data
	  cache misses multiplied by the primary cache line size and the
	  number of quadwords written back from primary data cache multiplied
	  by the size of a quadword (16 bytes).	 For multiprocess programs,
	  the resulting figure is a per-process average, since the counts
	  measured by perfex are aggregates of the counts for all the threads.
	  You must multiply by the number of threads to get the total program
	  bandwidth.

     Memory bandwidth used (MB/s, average per process)

	  This is the amount of data moved between the secondary data cache
	  and main memory, divided by the total program run time. The amount
	  of data moved is calculated as the sum of the number of secondary
	  data cache misses multiplied by the secondary cache line size and
	  the number of quadwords written back from secondary data cache
	  multiplied by the size of a quadword (16 bytes).  For multiprocess
	  programs, the resulting figure is a per-process average, since the
	  counts measured by perfex are aggregates of the counts for all the

									Page 7

PERFEX(1)							     PERFEX(1)

	  threads. You must multiply by the number of threads to get the total
	  program bandwidth.

     MFLOPS (MB/s, average per process)

	  This is the ratio of the graduated floating-point instructions and
	  the total program run time. Note that while a multiply-add carries
	  out two floating-point operations, it only counts as one
	  instruction, so this statistic may underestimate the number of
	  floating-point operations per second. For multiprocess programs, the
	  resulting figure is a per-process average, since the counts measured
	  by perfex are aggregates of the counts for all the threads. You must
	  multiply by the number of threads to get the total program rate.

     The following statistics are computed only on R12000 CPUs:

     Cache misses in flight per cycle (average)
	  This is the count of event 4 (Miss Handling Table (MHT) population)
	  divided by cycles.  It can range between 0 and 5 and represents the
	  average number of cache misses of any kind that are outstanding per
	  cycle.

     Prefetch miss rate
	  This is the count of event 17 (prefetch primary data cache misses)
	  divided by the count of event 16 (executed prefetch instructions).
	  A high prefetch miss rate (about 1) is desirable, since prefetch
	  hits are wasting instruction bandwidth.

     A statistic is only printed if counts for the events which define it have
     been gathered.

     -c file
	  Load a cost table from file (requires that -y is specified).

	  This option allows you to override the internal event costs used by
	  the -y option. file contains the list of event costs that are to be
	  overridden. This file must be in the same format as the output
	  produced by the -c option. Costs may be specied in units of "clks"
	  (machine cycles) or "nsec" (nanoseconds). You can override all or
	  only a subset of the default costs.

	  You can also use the file /etc/perfex.costs to override event costs.
	  If this file exists, any costs listed in it will override those
	  built into perfex. Costs supplied with the -c option will override
	  those provided by the /etc/perfex.costs file.

     -t	  Print the cost table used for perfex -y cost estimates to stdout.

	  These internal costs can be overridden by specifying different
	  values in the file /etc/perfex.costs or by using the -c file option.

									Page 8

PERFEX(1)							     PERFEX(1)

	  Both file and /etc/perfex.costs must use the format as provided by
	  the -t option. It is recommended that you capture this output to a
	  file and edit it to create a suitable file for /etc/perfex.costs or
	  the -c option. You do not have to specify costs for every event,
	  however.  Lines corresponding to events with values you do not wish
	  to override may simply be deleted from the file.

MIXED CPU OPTION
     The following is an option for systems with both R10000 and R12000 CPUs.

     -T	  Allows experienced users to use perfex on a system of mixed CPUs.

     Although perfex cannot verify it, the specification of this option means
     that you have used either dplace(1) or some other means to ensure that
     the program is using either all R10000 CPUs or all R12000 CPUs.

     When used with this option, the -y option will not produce cost estimates
     due to the fact that the cost estimation cannot know which type of CPU is
     actually targeted.	 Nothing prevents you, however, from loading a cost
     table with -c.  This cost table could be directly dumped from a pure-
     R10000 or pure-R12000 system, depending on which CPU flavor the program
     is running.

CHANGE IN BEHAVIOR OF DEFAULT EVENTS
     Because of limitations of ABI/API compliance with Irix version 6.5/R10000
     in the operating system counter interface, it is only possible to count
     cycles and graduated instructions on counter 0.  Accordingly, when the
     R12000 user specifies an event in the range 0-15 to perfex, either
     through a -e argument or environment variables, cycles cannot be counted
     simultaneously with that event as they can on the R10000.	(perfex only
     multiplexes events for the -a option, never for individually specified
     events).  In these cases perfex will count event 16 (executed prefetch
     instructions) as the second event.

     For similar reasons, perfex no longer remaps events 0, 15, 16, and 17 to
     fit them on two (R10000) counters, since that would induce a different
     behavior for identical arguments on R10000 and R12000 systems. It would
     create problems when mixed-CPU systems are supported.  To be specific,
     prior to 6.5.3 a user could specify:
     % perfex -e 0 -e 15 a.out

     This would execute as if the user had specified:
     % perfex -e 0 -e 17 a.out

     or
     % perfex -e 15 -e 16 a.out

									Page 9

PERFEX(1)							     PERFEX(1)

     After Irix version 6.5.3, this argument combination is an error, and the
     user must decide which of the equivalent (for R10000 only) forms to use.
     It is the lack of equivalence for R12000 that makes this regression
     necessary.

FILES
     /etc/perfex.costs

DEPENDENCIES
     perfex only works on an R10000 or R12000 system.  Programs running on
     mixed R1000 and R12000 CPUs are not supported, although specifying the -T
     option will permit you to verify that only CPUs of the same type are
     being used.  Usually, perfex prints an informative message and fails on
     mixed CPU systems.

     For the -mp option, only binaries linked-shared are currently supported;
     this is due to a dependency on libperfex.so.  The options -s and -mp are
     currently mutually exclusive.

LIMITATIONS
     The signal control interface (-s) can control only the immediate target
     process, not any of its descendants.  This makes it unusable with
     multiprocess targets in their parallel regions.

SEE ALSO
     r10k_counters(5), libperfex(3C), time(1), timex(1)

								       Page 10
[top]

List of man pages available for IRIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome