prof man page on IRIX

prof man page on IRIX
Man page or keyword search:
man Server 31559 pages
apropos Keyword Search (all sections)
Output format


PROF(1)								       PROF(1)

NAME
     prof - analyze SpeedShop performance data

SYNOPSIS
     prof [options] [speedshop_data_file  | pixie_counts_file] ...
     prof [options] executable-name [speedshop_data_file  | pixie_counts_file]
     ...

DESCRIPTION
     prof analyzes one or more data files generated by the SpeedShop
     performance tools and produces a report. (Note that most reports are
     formatted with long lines, and should be viewed in a window that is 135
     characters wide, and printed in wide format.)  The second form is used to
     analyze data files generated by the SpeedShop performance tools if the
     target program is not in the same directory as the data files (in which
     case <executable-name> should be the path to the target program).

     Multiple files can be included only if they are recorded from the same
     executable with the same experiment type.	The default listing for all
     experiments lists functions in descending order of the appropriate
     exclusive (meaning from within the function, rather than included from
     calls it makes) performance metric.  (See below, FUNCTION LIST.)  Options
     allow sorting by calls or by inclusive metrics, for those experiments
     where the recorded data supports them.  Where applicable, the
     -b[utterfly] option also produces a listing of callers and callees for
     each function, with attribution percentages and time or counts.  (See
     below, BUTTERFLY LIST.)  Addition listings may also be generated; see
     below.

     The current implementation supports the following SpeedShop experiments:

     usertime (callstack profiling, user+system time trigger) causes the
     program to be interrupted every 30 milliseconds during its running time
     (user or system mode, but not including any wait time), and to record the
     callstack at each interrupt.  It can show both inclusive and exclusive
     time.  usertime data is statistical in nature, and will show some
     variance from run to run.

     totaltime (callstack profiling, wall-clock time trigger) causes the
     program to be interrupted every 30 milliseconds of wall-clock time during
     the run, and to record the callstack at each interrupt.  It can show both
     inclusive and exclusive time.  totaltime data is statistical in nature,
     and will show some variance from run to run.

     [f]pcsamp[x] asks the kernel to look at the user PC every 10
     milliseconds, and record a histogram of the value of the program counter
     at each clock tick, using 16-bit bins, one for each PC value.  It can
     only show exclusive data, that is data about where the program counter
     was, not the callstack to get there.  The [f] prefix specifies 1
     millisecond profiling, instead of 10 milliseconds.	 The [x] suffix
     specifies 32-bit count bins, instead of 16-bit.  *pcsamp* data is
     statistical in nature, and will show some variance from run to run.

									Page 1

PROF(1)								       PROF(1)

     *_hwc asks the kernel to look at the user PC every time the hardware
     performance counter specified by the experiment overflows, and record a
     histogram of the value of the program counter at overflow.	 These
     experiments can only be run on R10000 machines; other machines do not
     have the hardware performance counters.  There are a number of these
     experiments defined; see speedshop(1).  They can only show exclusive
     data, that is data about where the program counter was, not the callstack
     to get there.  The particular counter used and its overflow value are
     specified in the experiment.  Some of the Hardware Performance counter
     sampling experiments are statistical in nature, and will show some
     variance from run to run; others are exact, provided that the program
     executes the exact same sequence of instructions.

     Among the interesting counter prefixes to use are:
      cy_ for cycle counting;
      gi_ for graduated instructions;
      ic_ for primary instruction cache misses;
      sic_ for secondary instruction cache misses;
      dc_ for primary data cache misses;
      sc_ for secondary data cache misses;
      tlb_ for TLB misses;
      gfp_ for graduated floating-point instructions; and
      fsc_ for failed store-conditional instructions.

     *_hwctime (callstack profiling, R10K hardware counter overflow trigger)
     causes the program to be interrupted at every N overflows of the
     particular counter and to record the callstack at each interrupt.	The
     value of N depends on the particular counter chosen.  It can take the
     same set of prefixes as above, but since callstacks are recorded, it can
     show both inclusive and exclusive time.  *_hwctime data is statistical in
     nature, and will show some variance from run to run.

     ideal causes the code to be instrumented to count the number of times
     each basic block is executed.  (A basic block is a region of the program
     that can be entered only at the beginning and exited only at the end.)
     The data recorded also contains counts for all function pointer calls.
     From this data, a machine model is used to compute the exclusive time
     (cycles) spent in each function.

     Inclusive time computations, performed when the -b[utterfly] flag is
     specified to prof, calculates the exclusive time for each routine as
     above, and then propagates the time to each caller of each routine in
     proportion to the number of calls.	 For example, if sin(x) takes 1000
     cycles, and its callers - procedures foo() and bar() - call it  25 and 75
     times respectively, 250 cycles are attributed  to foo() and 750 to bar().
     By propagating cycles this way, __start() should end up with all the
     cycles counted in the program.

     Note that the propagation according to the number of calls may not be
     reasonable for some routines, and may lead to misleading reports.	For
     example, if a matrix-multiply routine is substituted for sin(x) in the
     above example, and bar's calls are for 2X2 matrices, while foo's calls

									Page 2

PROF(1)								       PROF(1)

     are for 100X100 matrices, the attribution distributes 3/4 of the time to
     bar, whereas nearly all the time really should be attributed to foo.

     fpe asks the floating-point exception library to trace all floating-point
     exceptions, with their callstacks.	 These experiments show listings
     similar to those for usertime experiments, except the data reported is a
     count of floating-point exceptions, rather than projected CPU time.

     io asks for a trace of all IO calls made by the program.  These
     experiments will show calls attributed to functions, and with
     -b[utterfly], can show which functions made which calls.

OUTPUT REPORTS
     prof writes an analysis of the performance data to stdout.	 The first
     thing in the list is a summary of the experiment, and description of the
     environment in which it was recorded.  That is followed by a header that
     summarized the particular data recorded.  Following that is the function
     list;  if -b[utterfly] is given, and the data supports it, a butterfly
     list will be presented.

     If -h[eavy] or -l[ines] is given, for *pcsamp*, *hwc, and ideal
     experiments, a report of data at the source line level is appended,
     sorted by the performance metric computed on a line basis, or by
     functions, and then by line numbers within a function, respectively.  For
     other experiments, these options are ignored.

     If -basicblocks is given, for ideal experiments only, A report of data at
     the basic-block level is appended.	 If -archinfo is given, also for ideal
     experiments only, a summary report of register usage, instruction usage,
     and various other statistics is appended.	For other experiments, these
     options are ignored.

     If -dsolist is given, a list of the DSOs used by the program is appended.

     If -usage is given, a summary of the resources used by the program is
     appended.

FUNCTION LIST
     The default output for any experiment is a function list, sorted in the
     order of exclusive values of the primary metric--the performance cost of
     the function, as computed from the data recorded.	For many experiments,
     this metric is a time, printed in the report in seconds; for others, it
     is a count of events-FPEs, IO calls, or R10K counter overflow counts.

     The report begins with a legend line, naming each of the columns of data.
     Each line in the list has an index; if -b[utterfly] is not specified, the
     index will be in numerical order; if it is specified, the index will be
     in order of functions as sorted by the inclusive metric.  The index
     serves as a poor-person's hyperlink through the butterfly, and between
     the butterfly and the function list.

									Page 3

PROF(1)								       PROF(1)

     To the left of the index, is the primary metric, followed by its
     representation as a percentage of the whole program, followed by the
     cumulative percentage up to that point in the list.  The next two columns
     are inclusive value of the metric., and its representation as a
     percentage.

     Following that may be additional columns of data, as appropriate to the
     particular data recorded.	Finally, rightmost on the line will be the
     name of the function, and the DSO and source file basenames.

BUTTERFLY LIST
     The butterfly list is a set of records that show the callers and callees
     of each function.	The list is sorted in the order of inclusive values of
     the primary metric.  For each function, its callers are shown above it,
     and its callees below.  The center line is for the function itself, and
     shows the index at the beginning and end of the line.  The second and
     third columns are the inclusive percentages of the primary metric and its
     absolute value.  The next two columns of the center line are the
     exclusive percentage and the exclusive value of the metric, followed by
     the function identification.

     Callees are shown below the function, with attribution percentages and
     values lined up below the self percentages and values of the center
     function.	In absolute mode (the default), the percentages, including the
     center node's self value, should add up to the inclusive percentage of
     the center node;  in relative mode, specified with -rel[ative] flag, the
     percentages should add up to 100 %.  In either mode, the attribution
     values should add up to the inclusive time of the central function.  The
     attribution data for callees is followed by the inclusive value for that
     caller and callsite, followed by the callsite identification, with an
     address, a source file, and a line number.	 For callees, the source file
     will be that of the central function.

     The callers are shown above the central function, also with attribution
     percentages and values, but these have a different meaning, and are
     aligned with the central nodes inclusive values to so indicate.  The
     attribution percentage and value in a caller's line represents the
     percentage and value, respectively, of the central function's metric that
     was attributed to that callsite.  In absolute mode, the percentages
     should add up to the central function's percentage; in relative mode,
     they should add up to 100%.  The attribution values should add to the
     central function's values.	 The attribution data for callers is followed
     by the inclusive value for that caller and callsite, followed by the
     callsite identification, with an address, a source file, and a line
     number.  For callers, the source file will be that of the calling
     function.

LINE-LEVEL LISTS
     When invoked with the -h[eavy] or -l[ines] arguments, a list of line-
     level data is produced.  When -h[eavy] is used, the line list is sorted
     by the primary metric associated with each line, with lines from the
     various DSOs and source files intermixed.	When -l[ines] is used, the

									Page 4

PROF(1)								       PROF(1)

     lines are sorted based on the function they came from, with all lines
     with non-zero values from each function printed before any lines from the
     next most important function.

     For each line, the leftmost column contains the value for the primary
     metric computed from the particular experiment, followed by its
     representation as a percentage, followed by the cumulative percentage
     over all lines printed thus far.  These data are followed by any
     additional metrics for the particular experiment, and then by the name of
     the function, its DSO, file and line number.

BASIC BLOCK LIST
     When invoked with the -basicblocks argument (applicable ideal experiments
     only), a list of all the basic blocks in the program that were executed
     is generated.  It is preceded by a header, and a column header line, and
     then the list of basic blocks, in order of total cycles.

     Each basic block is printed with its index, the number of cycles per
     execution of the block, the count of executions, the total cycles, and
     the total cycles represented as a percentage.  Trailing that on each
     line, is the function, and the address of the beginning of the line, with
     the DSO, source file, and source line number.

ARCHITECTURAL INFORMATION REPORT
     When invoked with the -archinfo argument (applicable ideal experiments
     only), a list of various metrics concerning execution of the program is
     printed.  The report consists of a header, and a number of subsections,
     each with the appropriate headers.

     For integer registers, counts are printed of the number of times each
     register was used, and its percentage, the number of times each register
     was used as a base register, and its percentage, and the number of times
     each register is used as a destination register, and its percentage.  For
     floating point registers, the same data, less the base count statistics,
     is printed.

     Following the register usage statistics, a number of counts of
     instruction types or sequences are printed, each with a description,
     followed by a list of all the different instructions used, sorted by the
     number of times each was executed.	 For each instruction, the dynamic
     count of executions, and its representation as a percentage is printed.
     That is followed by the cumulative execution percentage, and the count of
     the number of distinct instructions of that type that were executed one
     or more times, and that count represented as a percentage.

DSO LIST
     When invoked with the -dsolist argument, a report of summary information
     about the DSOs in the execution of the program is printed.	 For each DSO,
     it gives the name, a count of instructions, functions, source-files, and
     source-lines, the high and low addresses, and the full pathname to the
     DSO.  If the DSO is ignored in computation, either because it belongs to
     the SpeedShop runtime, and the -showss flag is not provided, or because

									Page 5

PROF(1)								       PROF(1)

     the DSO was excluded by being listed as with a -xdso argument, or the DSO
     was excluded by virtue of its not being listed with a -dso argument, the
     DSO is flagged as ignored.

RESOURCE USAGE
     When invoked with the -usage argument, a report of summary usage data, as
     measured by the kernel during the run, is produced.  The data consists of
     both per-process (per-file) and per-system metrics.  Included are real-
     time (wall clock time), user, system and wait times.  The per-process
     data is shown as a sum of the data from the experiment files; the real-
     time, and the system-wide statistics are printed as maxima over the data
     in each file.  Included are the accounting timers, giving the time spent
     in each of the various process states.  The sum of the accounting timers
     should be approximately equal to the elapsed real-time, since the process
     must always be in one of the states.  There is some skew in the reading
     of the data, so some discrepancy should be expected.  Other summary
     statistics include bytes read and written, page faults, context switches,
     system calls, and process-size statistics.

DISASSEMBLY AND SOURCE LISTS
     When invoked with the -dis argument, an assembly listing of the program
     is generated.  If the -dis argument is given as well, the source code for
     the disassmebly is intermixed with it.

OPTIONS
     -calipers	[n1,]n2
	  flag causes prof to compute the data between caliper points n1 and
	  n2, rather than for the entire experiment.  If n1 >= n2, an error is
	  reported, otherwise if n1 is negative it is set to the beginning of
	  the experiment, and if n2 is greater than the maximum recorded, it
	  is set to the maximum.  If n1 is omitted, zero is assumed.

     -b[utterfly]
	  or -gprof causes prof to print a report showing the callers and
	  callees of each function, with inclusive time attributed to each.
	  For ideal experiments, the attribution is based on a heuristic,
	  while for the various callstack sampling/tracing experiments the
	  attribution is precise (although usertime and totaltime, as well as
	  some *_hwctime experiments are statistical in nature).  This flag is
	  ignored for experiments where the data does not support inclusive
	  calculations.

     -p[threads] <pthread_id1>,<pthread_id2>,...,<pthread_idn>
	  For usertime, totaltime, *_hwctime, io, and fpe experiments on
	  applications that use pthreads (on Irix 6.5 or later), analyze data
	  only for the specified pthreads.  This flag is ignored for other
	  experiments.

     -u[sage]
	  Print a report on system statistics and timers.

									Page 6

PROF(1)								       PROF(1)

     -dis[assemble]
	  Disassemble and annotate the analyzed object code with cycle times
	  or number of PC samples.  This option can be used when generating
	  reports for ideal, pcsamp, or prof_hwc experiments.

     -S(-source)
	  Disassemble and annotate the analyzed object code with cycle times
	  (or PC samples) and interleave the source code.  This option can be
	  used when generating reports for ideal, pcsamp, or *_hwc
	  experiments.

     -h[eavy]
	  Reports the most heavily used lines in descending order of use.
	  This option can be used when generating reports for ideal, pcsamp,
	  or prof_hwc experiments.  It is ignored for other experiments.

     -l[ines]
	  Like -h[eavy], but group lines by procedure, with procedures sorted
	  in descending order of use.  Within a procedure, lines are listed in
	  source order.	 This option can be used when generating reports for
	  ideal, pcsamp, or prof_hwc experiments.  It is ignored for other
	  experiments.

     -[no]cordfb
	  Disables or enables cord feedback file generation for the executable
	  only. Cord feedback is used to arrange procedures in the binary in
	  an optimal ordering, to improve both paging and instruction-cache
	  performance.	Users can use cord(1) (or, in the near-term future,
	  ld(1)) to actually do the procedure-ordering.

     -cordfball
	  Enables cord feedback for the executable and all DSOs.

     -feedback
	  Produces files with information that can be used to (a) tell the
	  compiler how to optimize compilation of the program next time and
	  (b) arrange procedures in the binary in an optimal ordering.	Users
	  can invoke "cc(1) -fb <cfb-filename>" to use the compilation-
	  optimization feedback file for subsequent compilations.  To disable
	  cord feedback, while producing compiler feedback, use the options
	  -feedback -nocordfb together.
	  Procedures are normally ordered by their measured invocation counts;
	  if -gprof is also specified, procedures are ordered using call graph
	  counts (that capture caller-callee relationships as well), rather
	  than invocation counts.  The cord feedback file is named <a.out>.fb
	  or <lib*so*>.fb.  The -feedback option produces cord feedback for
	  the executable only. To get cord feedback for all the DSOs as well,
	  use the options -feedback -cordfball together.
	  This option also produces a file with information that the compiler
	  system can use to recompile, optimizing by using measured branch
	  frequencies, etc.  The feedback file is produced for the executable
	  only.	 It is named <a.out>.cfb and is a binary file.	It may be

									Page 7

PROF(1)								       PROF(1)

	  dumped using the fbdump(1) routine.  This option can only be used
	  with pixie (ideal-time) data files, and the data should be recorded
	  on a binary that was compiled -O0.  Recording data on binaries with
	  higher optimization will generate a feedback file that does not have
	  the appropriate correspondence between source lines and machine
	  code.

     -ws  Generate a working-set file for the current caliper setting, for the
	  executable only.

     -wsall
	  Generate a working-set file for the current caliper setting, for the
	  executable and all the non-ignored DSOs.

     -showss
	  Enables display of functions from the SpeedShop runtime.  Normally
	  such functions are suppressed from the reports and computations.  In
	  addition, some statistics for prof's own memory usage will be
	  printed.

     -dsolist
	  List all the DSO's in the program and their start and end text
	  addresses.

OUTPUT CONTROLS
     -rel[ative]
	  Show percentage attribution in a butterfly report relative to the
	  central function.  The default is to show percentages as absolute
	  percentages over the whole run.

     -inclusive
	  Sort function list by inclusive data, rather than by exclusive data.
	  This option can only be used when generating reports for those
	  experiments which have inclusive data; it is ignored for others.

     -calls
	  Sort function list by procedure calls rather than by time.  This
	  option can only be used when generating reports for ideal
	  experiments, or for basic block counting data obtained with pixie.

     -q[uit]  n
	  Truncates the	 -h[eavy], -l[ines] and -b[utterfly] listings after
	  the first n procedures or lines have been listed.

     -q[uit] n%
	  Truncates the	 -h[eavy], -l[ines] and -b[utterfly] listings after
	  those procedures or lines up to the one which takes more than n
	  percent of the total.

     -q[uit] ncum%
	  Truncates the	 -h[eavy], -l[ines] and -b[utterfly] listings after
	  those procedures or lines up to the one which brings the cumulative

									Page 8

PROF(1)								       PROF(1)

	  total to n percent.  (For -b[utterfly], it behaves the same as -quit
	  n%.)	For example, "-q[uit] 15" truncates each part of the listing
	  after 15 lines of text, "-q[uit] 15%" truncates each part after the
	  first line that represents less than 15 percent of the whole, and
	  "-q[uit] 15cum%" truncates each part after the line that brought the
	  cumulative percentage above 15 percent.

     -dislimit n
	  Disassemble only those basic blocks with frequency >= n.

     -nh  Suppress various header blocks from the output.

     SELECTIVITY OPTIONS

     -dso dso_name
	  Report only on the named DSO.	 Only the basename of the DSO need be
	  specified not the full pathname to the DSO; the .so suffix is
	  required.  Multiple instances of the -dso flag can be given, and the
	  executable is considered a DSO, like any other.  All the DSOs from
	  an experiment can be listed with the -dsolist flag.

     -xdso dso_name
	  Exclude the named DSO from any reports.  Only the basename of the
	  DSO need be specified not the full pathname to the DSO; the .so
	  suffix is required.  Multiple instances of the -xdso flag can be
	  given.

     -o[nly] procedure_name
	  If you use one or more -o[nly] options, the profile listing includes
	  only the named procedures, rather than the entire program. If any
	  option uses an uppercase "O" for -O[nly], prof uses only the named
	  procedures, rather than the entire program, as the base upon which
	  it calculates percentages.

     -e[xclude] procedure_name
	  If you use one or more -e[xclude] options, the profiler omits the
	  specified procedure from the listing.	 If any option uses an
	  uppercase "E" for -E[xclude] , prof also omits that procedure from
	  the base upon which it calculates percentages.

     CPU OPTIONS

     prof normally uses the scheduling model for the processor on which it is
     being run to perform the analysis.	 The user can override the default
     with any of the following options:

     {-r10000 | -r8000 | -r5000 | -r4000 | -r3000}
	  Note that these options are only meaningful for ideal time and
	  pixie-counts data.

									Page 9

PROF(1)								       PROF(1)

     -clock megahertz
	  Set the CPU clock speed to  megahertz	 MHz.  Alters the appropriate
	  parts of the listing to reflect the clock speed.  The default value
	  is the clock speed of the machine on which the experiment was
	  performed.

     -cycle nanosecond
	  Set cycle time to nanosecond ns. This is the same as -clock
	  1000/nanosecond.

DEBUGGING OPTIONS
     -debug:debug_flags
	  Where debug_flags can be combinations of the following.

	   GPROF_FLAG	   0x00000001

	   COUNTS_FLAG	   0x00000002

	   SAMPLE_FLAG	   0x00000004

	   MISS_FLAG	   0x00000008

	   FEEDBACK_FLAG   0x00000010

	   CORD_FLAG	   0x00000020

	   USERPC_FLAG	   0x00000040

	   MDEBUG_FLAG	   0x00000080

	   BEAD_FLAG	   0x00000100

DIAGNOSTICS
     prof prints warnings and fatal errors on stderr.  With inclusive cycle
     counting, prof prints a list of functions at the end which are called but
     not defined. Any functions starting with _rld listed here is normal
     behavior. They appear because rld is not instrumented.

     One way to sanity-check inclusive cycle counts is to look at the
     percentage cycles for __start().  If the value is anything less than 98-
     99%, the inclusive report is suspect.  Look for other warnings which
     indicate that prof didn't take into account certain procedures.

     There are a number of known cases when prof fails to list cycles of a
     procedure in the inclusive listing.  The reasons can be one of the
     following:

	- init & fini sections, mips stubs are not part of any procedure.

	- calls to procedures which don't use a jump and link are
	  not recognized as procedure calls.

								       Page 10

PROF(1)								       PROF(1)

	- executions of global procedures with same name in different DSOs.
	  In this case, only one of them is listed.

     All these exceptions are listed at the end of the -gprof listing, under a
     separate section.

LIMITATIONS
     With both pc-sampling and basic block counting, prof may report times for
     procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF
     stands for Dummy Function, and indicates cycles spent in parts of text
     which are not in any function, for example, init and fini sections, and
     MIPS.stubs sections.  For pc-sampling, and hardware counter overflow pc-
     sampling, a dummy function, *DF*_Other is used to report sampling hits in
     an overflow bin; these hits include time spent in rld, and any other text
     regions not listed in any DSO.

     If any of the object files linked into the application have been stripped
     of line-number information (with ld -x for example), prof warns about the
     affected procedures.  The instruction counts for such procedures are
     shown as a procedure total, not on a per-basic-block basis.  Where a line
     number would normally appear in a report on a function without line
     numbers question marks appear instead.

     prof does not take into account interactions between basic blocks in
     ideal time runs. prof computes the cycles for one execution of a basic
     block, assuming all registers are free at entry to the block, and it
     multiplies this count by the number of times that basic block is
     executed.	In real programs, a block may be entered with a result not yet
     ready in a register, or with a function unit busy, so the cycle count
     computed could be either higher or lower than the correct value.
     Extending the computations to include inter-block state would be
     prohibitively expensive.

     When run on a program using shared libraries, prof sometimes combines the
     times of real but anonymous procedures in a shared library into the
     preceding (according the library memory layout) externally visible
     function.	The times reported for your procedures are not affected by
     this attribution-error, which is normally minor.

     prof cannot be run on programs which have been stripped.

     Compiler optimization level 3 does procedure inlining.  This can result
     in extremely misleading profiles since the time spent in the inlined
     procedure shows up in the profile as time spent in the procedure into
     which it was inlined.  It is generally better to use compiler
     optimization level 2 or less when gathering an execution profile.

     Fortran alternate entry point times are attributed to the main
     function/subroutine, since there is no general way for prof to separate
     the times for the alternate entries.

								       Page 11

PROF(1)								       PROF(1)

SEE ALSO
     speedshop(1), ssrun(1), ssdump(1), sscord(1), ssorder(1), pixie(1),
     fbdump(1)

								       Page 12
[top]

List of man pages available for IRIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome