mp_set_slave_stacksize man page on IRIX

Man page or keyword search:  
man Server   31559 pages
apropos Keyword Search (all sections)
Output format
IRIX logo
[printable version]



									Page 1

MP(3C)									MP(3C)

NAME
     mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum,
     mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
     mp_suggested_numthreads, mp_unsetlock, mp_barrier, mp_in_doacross_loop,
     mp_set_slave_stacksize - C multiprocessing utility functions

SYNOPSIS
     void mp_block()

     void mp_unblock()

     void mp_blocktime(iters)
     int iters

     void mp_setup()

     void mp_create(num)
     int num

     void mp_destroy()

     int mp_numthreads()

     void mp_set_numthreads(num)
     int num

     int mp_my_threadnum()

     int mp_is_master()

     void mp_setlock()

     void mp_unsetlock()

     void mp_barrier()

     int mp_in_doacross_loop()

     void mp_set_slave_stacksize(size)
     int size

     unsigned int mp_suggested_numthreads(num)
     unsigned int num

DESCRIPTION
     These routines give some measure of control over the parallelism used in
     C programs.  They should not be needed by most users, but will help to
     tune specific applications.

									Page 2

MP(3C)									MP(3C)

     mp_block puts all slave threads to sleep via blockproc(2).	 This frees
     the processors for use by other jobs.  This is useful if it is known that
     the slaves will not be needed for some time, and the machine is being
     shared by several users.  Calls to mp_block may not be nested; a warning
     is issued if an attempt to do so is made.

     mp_unblock wakes up the slave threads that were previously blocked via
     mp_block.	It is an error to unblock threads that are not currently
     blocked; a warning is issued if an attempt is made to do so.

     It is not necessary to explicitly call mp_unblock.	 When a parallel
     region is entered, a check is made, and if the slaves are currently
     blocked, a call is made to mp_unblock automatically.

     mp_blocktime controls the amount of time a slave thread waits for work
     before giving up.	When enough time has elapsed, the slave thread blocks
     itself.  This automatic blocking is independent of the user level
     blocking provided by the mp_block/mp_unblock calls.  Slave threads that
     have blocked themselves will be automatically unblocked upon entering a
     parallel region.  The argument to mp_blocktime is the number of times to
     spin in the wait loop.  By default, it is set to 10,000,000.  This takes
     about .25 seconds on a 200MHz processor.  As a special case, an argument
     of 0 disables the automatic blocking, and the slaves will spin wait
     without limit.  The environment variable MP_BLOCKTIME may be set to an
     integer value.  It acts like an implicit call to mp_blocktime during
     program startup.

     mp_destroy deletes the slave threads.  They are stopped by forcing them
     to call exit(2).  In general, doing this is discouraged.  mp_block can be
     used in most cases.

     mp_create creates and initializes threads.	 It creates enough threads so
     that the total number is equal to the argument.  Since the calling thread
     already counts as one, mp_create will create one less than its argument
     in new slave threads.

     mp_setup also creates and initializes threads.  It takes no arguments.
     It simply calls mp_create using the current default number of threads.
     Normally the default number is equal to the number of cpu's currently on
     the machine.  If the user has not called either of the thread creation
     routines already, then mp_setup is invoked automatically when the first
     parallel region is entered.  If the environment variable MP_SETUP is set,
     then mp_setup is called during initialization, before any user code is
     executed.

     mp_numthreads returns the number of threads that would participate in an
     immediately following parallel region.  If the threads have already been
     created, then it returns the current number of threads.  If the threads
     have not been created, then it returns the current default number of
     threads.  The count includes the master thread. Knowing this count can be
     useful in optimizing certain kinds of parallel loops by hand, but this
     function has the side-effect of freezing the number of threads to the

									Page 3

MP(3C)									MP(3C)

     returned value.  As a result, this routine should be used sparingly. To
     determine the number of threads without this side-effect, see the
     description of mp_suggested_numthreads below.

     mp_set_numthreads sets the current default number of threads to the
     specified value.  Note that this call does not directly create the
     threads, it only specifies the number that a subsequent mp_setup call
     should use.  If the environment variable MP_SET_NUMTHREADS is set, it
     acts like an implicit call to mp_set_numthreads during program startup.
     For convenience when operating among several machines with different
     numbers of cpus, MP_SET_NUMTHREADS may be set to an expression involving
     integer literals, the binary operators + and -, the binary functions min
     and max, and the special symbolic value ALL which stands for "the total
     number of available cpus on the current machine."	Thus, something simple
     like
		 setenv MP_SET_NUMTHREADS 7
     would set the number of threads to seven.	This may be a fine choice on
     an 8 cpu machine, but would be very bad on a 4 cpu machine.  Instead, use
     something like
		 setenv MP_SET_NUMTHREADS "max(1,all-1)"
     which sets the number of threads to be one less than the number of cpus
     on the current machine (but always at least one).	If your configuration
     includes some machines with large numbers of cpus, setting an upper bound
     is a good idea.  Something like:
		 setenv MP_SET_NUMTHREADS "min(all,4)"
     will request (no more than) 4 cpus.

     For compatibility with earlier releases, NUM_THREADS is supported as a
     synonym for MP_SET_NUMTHREADS.

     mp_my_threadnum returns an integer between 0 and n-1 where n is the value
     returned by mp_numthreads.	 The master process is always thread 0.	 This
     is occasionally useful for optimizing certain kinds of loops by hand.

     mp_is_master returns 1 if called by the master process, 0 otherwise.

     mp_setlock provides convenient (though limited) access to the locking
     routines.	The convenience is that no set up need be done; it may be
     called directly without any preliminaries.	 The limitation is that there
     is only one lock.	It is analogous to the ussetlock(3P) routine, but it
     takes no arguments and does not return a value.  This is useful for
     serializing access to shared variables (e.g.  counters) in a parallel
     region.  Note that it will frequently be necessary to declare those
     variables as volatile to ensure that the optimizer does not assign them
     to a register.

     mp_unsetlock is the companion routine for mp_setlock.  It also takes no
     arguments and does not return a value.

     mp_barrier provides a simple interface to a single barrier(3P).  It may
     be used inside a parallel loop to force a barrier synchronization to
     occur among the parallel threads.	The routine takes no arguments,

									Page 4

MP(3C)									MP(3C)

     returns no value, and does not require any initialization.

     mp_in_doacross_loop answers the question "am I currently executing inside
     a parallel loop."	This is needful in certain rare situations where you
     have an external routine that can be called both from inside a parallel
     loop and also from outside a parallel loop, and the routine must do
     different things depending on whether it is being called in parallel or
     not.

     mp_set_slave_stacksize sets the stacksize (in bytes) to be used by the
     slave processes when they are created (via sprocsp(2)).  The default size
     is 16MB.  Note that slave processes only allocate their local data onto
     their stack, shared data (even if allocated on the master's stack) is not
     counted.

     mp_suggested_numthreads uses the supplied value as a hint about how many
     threads to use in subsequent parallel regions, and returns the previous
     value of the number of threads to be employed in parallel regions. It
     does not affect currently executing parallel regions, if any. The
     implementation may ignore this hint depending on factors such as overall
     system load.  This routine may also be called with the value 0, in which
     case it simply returns the number of threads to be employed in parallel
     regions without the side-effect present in mp_numthreads.

     Pragmas or directives

     The MIPSpro C (and C++) compiler allows you to apply the capabilities of
     a Silicon Graphics multiprocessor computer to the execution of a single
     job. By coding a few simple directives, the compiler splits the job into
     concurrently executing pieces, thereby decreasing the wall-clock run time
     of the job.

     Directives enable, disable, or modify a feature of the compiler.
     Essentially, directives are command line options specified within the
     input file instead of on the command line. Unlike command line options,
     directives have no default setting. To invoke a directive, you must
     either toggle it on or set a desired value for its level.	The following
     directives can be used in C (and C++) programs when compiled with the -mp
     option.

     #pragma parallel

	 This pragma denotes the start of a parallel region. The syntax for
	 this pragma has a number of modifiers, but to run a single loop in
	 parallel, the only modifiers you usually use are shared, and local.
	 These options tell the multiprocessing compiler which variables to
	 share between all threads of execution and which variables should be
	 treated as local.

	 In C, the code that comprises the parallel region is delimited by
	 curly braces ({ }) and immediately follows the parallel pragma and

									Page 5

MP(3C)									MP(3C)

	 its modifiers.

	 The syntax for this pragma is:

	 #pragma parallel shared (variables)
	 #pragma local (variables) optional modifiers
	 {code}

	 The parallel pragma has four modifiers: shared, local, if, and
	 numthreads.

	 Their definitions ares:

	     shared ( variable_names )

	     Tells the multiprocessing C compiler the names of all the
	     variables that the threads must share.

	     local ( variable_names )

	     Tells the multiprocessing C compiler the names of all the
	     variables that must be private to each thread. (When PCA sets up
	     a parallel region, it does this for you.)

	     if ( integer_valued_expr )

	     Lets you set up a condition that is evaluated at run time to
	     determine whether or not to run the statement(s) serially or in
	     parallel. At compile time, it is not always possible to judge how
	     much work a parallel region does (for example, loop indices are
	     often calculated from data supplied at run time). Avoid running
	     trivial amounts of code in parallel because you cannot make up
	     the overhead associated with running code in parallel. PCA will
	     also generate this condition as appropriate.  If the if condition
	     is false (equal to zero), then the statement(s) runs serially.
	     Otherwise, the statement(s) run in parallel.

	     numthreads(expr)

	     Tells the multiprocessing C compiler the number of available
	     threads to use when running this region in parallel. (The default
	     is all the available threads.)

	     In general, you should never have more threads of execution than
	     you have processors, and you should specify  numthreads with the
	     MP_SET_NUMTHREADS environmental variable at run time If you want
	     to run a loop in parallel while you run some other code, you can
	     use this option to tell the multiprocessing C compiler to use
	     only some of the available threads.

	     The expression expr should evaluate to a positive integer.

									Page 6

MP(3C)									MP(3C)

	     For example, to start a parallel region in which to run the
	     following code in parallel:

	     for (idx=n; idx; idx--) {

		a[idx] = b[idx] + c[idx];

	     }

	     you must write:

	     #pragma parallel shared( a, b, c ) shared(n) local( idx )

	     or:

	     #pragma parallel

	     #pragma shared( a, b, c )

	     #pragma shared(n)

	     #pragma local(idx)

	     before the statement or compound statement (code in curly braces,
	     { }) that comprises the parallel region.

	     Any code within a parallel region but not within any of the
	     explicit parallel constructs ( pfor, independent, one processor,
	     and critical ) is termed local code. Local code typically
	     modifies only local data and is run by all threads.

     #pragma pfor

	 The pfor is contained within a parallel region.  Use #pragma pfor to
	 run a for loop in parallel only if the loop meets all of these
	 conditions:

	     All the values of the index variable can be computed
	     independently of the iterations.

	     All iterations are independent of each other - that is, data used
	     in one iteration does not depend on data created by another
	     iteration. A quick test for independence: if the loop can be run
	     backwards, then chances are good the iterations are independent.

	     The loop control variable cannot be a field within a
	     class/struct/union or an array element.

	     The number of times the loop must be executed is determined once,
	     upon entry to the loop, and is based on the loop initialization,
	     loop test, and loop increment statements.

									Page 7

MP(3C)									MP(3C)

	     If the number of times the loop is actually executed is different
	     from what is computed above, the results are unpredictable. This
	     can happen if the loop test and increment change during the
	     execution of the loop, or if there is an early exit from within
	     the for loop. An early exit or a change to the loop test and
	     increment during execution may have serious performance
	     implications.

	     The test or the increment should not contain expressions with
	     side effects.

	     The chunksize, if specified, is computed before the loop is
	     executed, and the behavior is unpredictable if its value changes
	     within the loop.

	     If you are writing a pfor loop for the multiprocessing C++
	     compiler, the index variable i can be declared within the for
	     statement via

	     int i = 0;

	     The draft for the C++ standard states that the scope of the index
	     variable declared in a for statement extends to the end of the
	     for statement, as in this example:

	     #pragma pfor for (int i = 0, ...)

	     The C++ compiler doesn't enforce this; in fact, with this
	     compiler the scope extends to the end of the enclosing block. Use
	     care when writing code so that the subsequent change in scope
	     rules for i (in later compiler releases) do not affect the user
	     code.

	 If the code after a pfor is not dependent on the calculations made in
	 the pfor loop, there is no reason to synchronize the threads of
	 execution before they continue. So, if one thread from the pfor
	 finishes early, it can go on to execute the serial code without
	 waiting for the other threads to finish their part of the loop.

	 The #pragma pfor directive takes several modifiers; the only one that
	 is required is iterate. #pragma pfor tells the compiler that each
	 iteration of the loop is unique.  It also partitions the iterations
	 among the threads for execution.

	 The syntax for #pragma pfor is:

	 #pragma pfor iterate ( ) optional_modifiers
	 for ...
	    { code ... }

	 The pfor pragma has several modifiers. Their syntax is:

									Page 8

MP(3C)									MP(3C)

	 iterate (index variable=expr1; expr2; expr3 )
	 local(variable list)
	 lastlocal (variable list)
	 reduction (variable list)
	 affinity (variable) = thread (expression)
	 schedtype (type)
	 chunksize (expr)

	 Where:

	     iterate (index variable=expr1; expr2; expr3 )

	     Gives the multiprocessing C compiler the information it needs to
	     identify the unique iterations of the loop and partition them to
	     particular threads of execution.

		 index variable is the index variable of the for loop you want
		 to run in parallel.

		 expr1 is the starting value for the loop index.

		 expr2 is the number of iterations for the loop you want to
		 run in parallel.

		 expr3 is the increment of the for loop you want to run in
		 parallel.

	     local (variable list)

	     Specifies variables that are local to each process. If a variable
	     is declared as local, each iteration of the loop is given its own
	     uninitialized copy of the variable. You can declare a variable as
	     local if its value does not depend on any other iteration of the
	     loop and if its value is used only within a single iteration. In
	     effect the local variable is just temporary; a new copy can be
	     created in each loop iteration without changing the final answer.

	     lastlocal (variable list)

	     Specifies variables that are local to each process. Unlike with
	     the local clause, the compiler saves only the value of the
	     logically last iteration of the loop when it exits.

	     reduction (variable list)

	     Specifies variables involved in a reduction operation. In a
	     reduction operation, the compiler keeps local copies of the
	     variables and combines them when it exits the loop. An element of
	     the reduction list must be an individual variable (also called a
	     scalar variable) and cannot be an array or struct. However, it
	     can be an individual element of an array. When the reduction
	     modifier is used, it appears in the list with the correct

									Page 9

MP(3C)									MP(3C)

	     subscripts.

	     One element of an array can be used in a reduction operation,
	     while other elements of the array are used in other ways. To
	     allow for this, if an element of an array appears in the
	     reduction list, the entire array can also appear in the share
	     list.

	     The two types of reductions supported are sum(+) and product(*).

	     The compiler confirms that the reduction expression is legal by
	     making some simple checks. The compiler does not, however, check
	     all statements in the do loop for illegal reductions. You must
	     ensure that the reduction variable is used correctly in a
	     reduction operation.

	     affinity (variable) = thread (expression)

	     The effect of thread-affinity is to execute iteration "i" on the
	     thread number given by the user-supplied expression (modulo the
	     number of threads). Since the threads may need to evaluate this
	     expression in each iteration of the loop, the variables used in
	     the expression (other than the loop induction variable) must be
	     declared shared and must not be modified during the execution of
	     the loop. Violating these rules may lead to incorrect results.

	     If the expression does not depend on the loop induction variable,
	     then all iterations will execute on the same thread, and will not
	     benefit from parallel execution.

	     schedtype (type)

	     Tells the multiprocessing C compiler how to share the loop
	     iterations among the processors. The schedtype chosen depends on
	     the type of system you are using and the number of programs
	     executing.	 You can use the following valid types to modify
	     schedtype:

		 simple (the default)

		 tells the run time scheduler to partition the iterations
		 evenly among all the available threads.

		 runtime

		 Tells the compiler that the real schedule type will be
		 specified at run time.

		 dynamic

		 Tells the run time scheduler to give each thread chunksize
		 iterations of the loop. chunksize should be smaller than

								       Page 10

MP(3C)									MP(3C)

		 (number of total iterations)/(number of threads). The
		 advantage of dynamic over simple is that dynamic helps
		 distribute the work more evenly than simple.

		 Depending on the data, some iterations of a loop can take
		 longer to compute than others, so some threads may finish
		 long before the others.  In this situation, if the iterations
		 are distributed by simple, then the thread waits for the
		 others. But if the iterations are distributed by dynamic, the
		 thread doesn't wait, but goes back to get another chunksize
		 iteration until the threads of execution have run all the
		 iterations of the loop.

		 interleave

		 Tells the run time scheduler to give each thread chunksize
		 iterations (described below) of the loop, which are then
		 assigned to the threads in an interleaved way.

		 gss (guided self-scheduling)

		 Tells the run time scheduler to give each processor a varied
		 number of iterations of the loop. This is like dynamic, but
		 instead of a fixed chunksize, the chunk size iterations begin
		 with big pieces and end with small pieces.

		 If I iterations remain and P threads are working on them, the
		 piece size is roughly:	 I/(2P) + 1

		 Programs with triangular matrices should use gss.

		 chunksize (expr)

		 Tells the multiprocessing C/C++ compiler how many iterations
		 to define as a chunk when you use the dynamic or interleave
		 modifier (described above).

		 expr should be positive integer, and should evaluate to the
		 following formula:

		      number of iterations / X

		 where X is between twice and ten times the number of threads.
		 Select twice the number of threads when iterations vary
		 slightly. Reduce the chunk size to reflect the increasing
		 variance in the iterations.  Performance gains may diminish
		 after increasing X to ten times the number of threads.

								       Page 11

MP(3C)									MP(3C)

     #pragma one processor

	 A #pragma one processor directive causes the statement that follows
	 it to be executed by exactly one thread.

	 The syntax of this pragma is:

	 #pragma one processor

	 { code }

     #pragma critical

	 Sometimes the bulk of the work done by a loop can be done in
	 parallel, but the entire loop cannot run in parallel because of a
	 single data-dependent statement. Often, you can move such a statement
	 out of the parallel region.  When that is not possible, you can
	 sometimes use a lock on the statement to preserve the integrity of
	 the data.

	 In the multiprocessing C/C++ compiler, use the critical pragma to put
	 a lock on a critical statement (or compound statement using { }).
	 When you put a lock on a statement, only one thread at a time can
	 execute that statement.  If one thread is already working on a
	 critical protected statement, any other thread that wants to execute
	 that statement must wait until that thread has finished executing it.

	 The syntax of the critical pragma is:

	 #pragma critical (lock_variable)

	 { code }

	 The statement(s) after the critical pragma will be executed by all
	 threads, one at a time. The lock variable lock_variable is an
	 optional integer variable that must be initialized to zero. The
	 parentheses are required. If you don't specify a lock variable, the
	 compiler automatically supplies one.  Multiple critical constructs
	 inside the same parallel region are considered to be independent of
	 each other unless they use the same explicit lock variable.

     #pragma independent

	 Running a loop in parallel is a class of parallelism sometimes called
	 fine-grained parallelism or homogeneous parallelism. It is called
	 homogeneous because all the threads execute the same code on
	 different data.  Another class of parallelism is called coarse-
	 grained parallelism or heterogeneous parallelism. As the name
	 suggests, the code in each thread of execution is different.

								       Page 12

MP(3C)									MP(3C)

	 Ensuring data independence for heterogeneous code executed in
	 parallel is not always as easy as it is for homogeneous code executed
	 in parallel.  (Ensuring data independence for homogeneous code is not
	 a trivial task.)

	 The independent pragma has no modifiers. Use this pragma to tell the
	 multiprocessing C/C++ compiler to run code in parallel with the rest
	 of the code in the parallel region.

	 The syntax for #pragma independent is:

	 #pragma independent

	 { code }

     Synchronization Directives

     To account for data dependencies, it is sometimes necessary for threads
     to wait for all other threads to complete executing an earlier section of
     code.  Two sets of directives implement this coordination: #pragma
     synchronize and #pragma enter/exit gate.

     #pragma synchronize

	  A #pragma synchronize tells the multiprocessing C/C++ compiler that
	  within a parallel region, no thread can execute the statements that
	  follows this pragma until all threads have reached it. This
	  directive is a classic barrier construct.

	  The syntax for this pragma is:

	  #pragma synchronize

     #pragma enter gate

	  #pragma exit gate

	  You can use two additional pragmas to coordinate the processing of
	  code within a parallel region. These additional pragmas work as a
	  matched set.	They are #pragma enter gate and #pragma exit gate.

	  A gate is a special barrier. No thread can exit the gate until all
	  threads have entered it. This construct gives you more flexibility
	  when managing dependencies between the work-sharing constructs
	  within a parallel region.

	  The syntax of the enter gate pragma is:

								       Page 13

MP(3C)									MP(3C)

	  #pragma enter gate

	  For example, construct D may be dependent on construct A, and
	  construct F may be dependent on construct B. However, you do not
	  want to stop at construct D because all the threads have not cleared
	  B. By using enter/exit gate pairs, you can make subtle distinctions
	  about which construct is dependent on which other construct.

	  Put this pragma after the work-sharing construct that all threads
	  must clear before the #pragma exit gate of the same name.

	  The syntax of the exit gate pragma is:

	  #pragma exit gate

	  Put this pragma before the work-sharing construct that is dependent
	  on the preceding #pragma enter gate. No thread enters this work-
	  sharing construct until all threads have cleared the work-sharing
	  construct controlled by the corresponding #pragma enter gate.

     #pragma page_place

	  The syntax of this pragma is:

	  #pragma page_place (addr, size, threadnum)

	  where addr is the starting address, size is the size in bytes, and
	  threadnum is the thread.

	  On a system with physically distributed shared memory, for example,
	  Origin2000), you can explicitly place all data pages spanned by the
	  virtual address range [addr, addr + size-1] in the physical memory
	  of the processor corresponding to the specified thread.

SEE ALSO
     cc(1), f77(1), mp(3f), sync(3c), sync(3f), MIPSpro Power C Programmer's
     Guide, MIPSpro C Language Reference Manual, MIPSpro FORTRAN 77
     Programmer's Guide

								       Page 14

[top]

List of man pages available for IRIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net