regexp man page on SmartOS

Man page or keyword search:  
man Server   16655 pages
apropos Keyword Search (all sections)
Output format
SmartOS logo
[printable version]

REGEXP(5)							     REGEXP(5)

       regexp,	compile, step, advance - simple regular expression compile and
       match routines

       #define INIT declarations
       #define GETC(void) getc code
       #define PEEKC(void) peekc code
       #define UNGETC(void) ungetc code
       #define RETURN(ptr) return code
       #define ERROR(val) error code

       extern char *loc1, *loc2, *locs;

       #include <regexp.h>

       char *compile(char *instring, char *expbuf, const char *endfug, int eof);

       int step(const char *string, const char *expbuf);

       int advance(const char *string, const char *expbuf);

       Regular Expressions  (REs)  provide  a  mechanism  to  select  specific
       strings from a set of character strings. The Simple Regular Expressions
       described below differ from the	Internationalized Regular  Expressions
       described on the regex(5) manual page in the following ways:

	   o	  only Basic Regular Expressions are supported

	   o	  the  Internationalization  features—character class, equiva‐
		  lence class,	and  multi-character  collation—are  not  sup‐

       The functions step(), advance(), and compile() are general purpose reg‐
       ular expression matching routines to be used in programs	 that  perform
       regular	expression  matching. These functions are defined by the <reg‐
       exp.h> header.

       The functions step() and advance() do pattern matching given a  charac‐
       ter string and a compiled regular expression as input.

       The  function  compile() takes as input a regular expression as defined
       below and produces a compiled expression that can be used  with	step()
       or advance().

   Basic Regular Expressions
       A  regular expression specifies a set of character strings. A member of
       this set of strings is said to be matched by  the  regular  expression.
       Some characters have special meaning when used in a regular expression;
       other characters stand for themselves.

       The following one-character REs match a single character:

	      An ordinary character ( not one of those discussed in 1.2 below)
	      is a one-character RE that matches itself.

	      A backslash (\) followed by any special character is a one-char‐
	      acter RE that matches the special character itself. The  special
	      characters are:

		    ., *, [, and \ (period, asterisk, left square bracket, and
		    backslash, respectively), which are always special, except
		    when  they	appear	within	square	brackets  ([]; see 1.4

		    ^ (caret or circumflex), which is special at the beginning
		    of	an entire RE (see 4.1 and 4.3 below), or when it imme‐
		    diately follows the left of a pair of square brackets ([])
		    (see 1.4 below).

		    $  (dollar sign), which is special at the end of an entire
		    RE (see 4.2 below).

		    The character used to bound (that is, delimit)  an	entire
		    RE,	 which	is  special  for that RE (for example, see how
		    slash (/) is used in the g command, below.)

	      A period (.) is a one-character RE that  matches	any  character
	      except new-line.

	      A	 non-empty  string  of	characters enclosed in square brackets
	      ([]) is a one-character RE that matches  any  one	 character  in
	      that string. If, however, the first character of the string is a
	      circumflex (^),  the  one-character  RE  matches	any  character
	      except  new-line and the remaining characters in the string. The
	      ^ has this special meaning  only	if  it	occurs	first  in  the
	      string. The minus (-) may be used to indicate a range of consec‐
	      utive  characters;  for  example,	  [0-9]	  is   equivalent   to
	      [0123456789].  The  -  loses  this  special meaning if it occurs
	      first (after an initial ^, if any) or last in  the  string.  The
	      right  square  bracket (]) does not terminate such a string when
	      it is the first character within it  (after  an  initial	^,  if
	      any);  for example, []a-f] matches either a right square bracket
	      (]) or one of the ASCII letters a through f inclusive. The  four
	      characters  listed  in  1.2.a  above stand for themselves within
	      such a string of characters.

       The following rules may be used to  construct  REs  from	 one-character

	      A one-character RE is a RE that matches whatever the one-charac‐
	      ter RE matches.

	      A one-character RE followed by an asterisk  (*)  is  a  RE  that
	      matches  0 or more occurrences of the one-character RE. If there
	      is any choice, the longest leftmost string that permits a	 match
	      is chosen.

	      A one-character RE followed by \{m\}, \{m,\}, or \{m,n\} is a RE
	      that matches a range of occurrences of the one-character RE. The
	      values  of  m and n must be non-negative integers less than 256;
	      \{m\} matches exactly m occurrences; \{m,\} matches at  least  m
	      occurrences; \{m,n\} matches any number of occurrences between m
	      and n inclusive. Whenever a choice exists,  the  RE  matches  as
	      many occurrences as possible.

	      The  concatenation of REs is a RE that matches the concatenation
	      of the strings matched by each component of the RE.

	      A RE enclosed between the character sequences \( and \) is a  RE
	      that matches whatever the unadorned RE matches.

	      The  expression  \n matches the same string of characters as was
	      matched by an expression enclosed between \( and \)  earlier  in
	      the  same RE. Here n is a digit; the sub-expression specified is
	      that beginning with the n-th occurrence of \( counting from  the
	      left. For example, the expression ^\(.*\)\1$ matches a line con‐
	      sisting of two repeated appearances of the same string.

       An RE may be constrained to match words.

	      \< constrains a RE to match the beginning of a string or to fol‐
	      low  a character that is not a digit, underscore, or letter. The
	      first character matching the RE must be a digit, underscore,  or

	      \>  constrains a RE to match the end of a string or to precede a
	      character that is not a digit, underscore, or letter.

       An entire RE may be constrained to match only  an  initial  segment  or
       final segment of a line (or both).

	      A	 circumflex  (^)  at  the beginning of an entire RE constrains
	      that RE to match an initial segment of a line.

	      A dollar sign ($) at the end of an entire RE constrains that  RE
	      to match a final segment of a line.

	      The  construction	 ^entire RE$ constrains the entire RE to match
	      the entire line.

       The null RE (for example, //) is equivalent to the last RE encountered.

   Addressing with REs
       Addresses are constructed as follows:

	   1.	  The character "." addresses the current line.

	   2.	  The character "$" addresses the last line of the buffer.

	   3.	  A decimal number n addresses the n-th line of the buffer.

	   4.	  'x addresses the line marked with the mark name character x,
		  which	 must  be  an ASCII lower-case letter (a-z). Lines are
		  marked with the k command described below.

	   5.	  A RE enclosed by slashes (/) addresses the first line	 found
		  by  searching	 forward  from	the line following the current
		  line toward the end of the buffer and stopping at the	 first
		  line	containing a string matching the RE. If necessary, the
		  search wraps around to the beginning of the buffer and  con‐
		  tinues  up  to  and  including the current line, so that the
		  entire buffer is searched.

	   6.	  A RE enclosed in question marks (?) addresses the first line
		  found by searching backward from the line preceding the cur‐
		  rent line toward the beginning of the buffer and stopping at
		  the  first line containing a string matching the RE. If nec‐
		  essary, the search wraps around to the end of the buffer and
		  continues up to and including the current line.

	   7.	  An  address  followed by a plus sign (+) or a minus sign (-)
		  followed by a decimal number	specifies  that	 address  plus
		  (respectively minus) the indicated number of lines. A short‐
		  hand for .+5 is .5.

	   8.	  If an address begins with + or -, the addition  or  subtrac‐
		  tion is taken with respect to the current line; for example,
		  -5 is understood to mean .-5.

	   9.	  If an address ends with + or -, then 1 is added to  or  sub‐
		  tracted  from the address, respectively. As a consequence of
		  this rule and of Rule 8, immediately above,  the  address  -
		  refers  to the line preceding the current line. (To maintain
		  compatibility with earlier versions of the editor, the char‐
		  acter ^ in addresses is entirely equivalent to -.) Moreover,
		  trailing + and - characters have a cumulative effect, so  --
		  refers to the current line less 2.

	   10.	  For  convenience,  a	comma  (,) stands for the address pair
		  1,$, while a semicolon (;) stands for the pair .,$.

   Characters With Special Meaning
       Characters that have special meaning except  when  they	appear	within
       square  brackets ([]) or are preceded by \ are:	., *, [, \. Other spe‐
       cial characters, such as $ have special meaning in more restricted con‐

       The  character ^ at the beginning of an expression permits a successful
       match only immediately after a newline, and the character $ at the  end
       of an expression requires a trailing newline.

       Two characters have special meaning only when used within square brack‐
       ets. The character - denotes a range, [c-c], unless it  is  just	 after
       the  open  bracket or before the closing bracket, [-c] or [c-] in which
       case it has no special meaning. When used within brackets, the  charac‐
       ter  ^ has the meaning complement of if it immediately follows the open
       bracket (example: [^c]); elsewhere between brackets (example: [c^])  it
       stands for the ordinary character ^.

       The  special meaning of the \ operator can be escaped only by preceding
       it with another \, for example \\.

       Programs must have  the	following  five	 macros	 declared  before  the
       #include	 <regexp.h>  statement. These macros are used by the compile()
       routine.	 The macros GETC, PEEKC, and UNGETC  operate  on  the  regular
       expression given as input to compile().

		      This  macro  returns  the	 value	of  the next character
		      (byte) in the  regular  expression  pattern.  Successive
		      calls  to	  GETC	should return successive characters of
		      the regular expression.

		      This macro returns the next character (byte) in the reg‐
		      ular expression.	Immediately successive calls to	 PEEKC
		      should return the same character, which should  also  be
		      the next character returned by GETC.

		      This  macro  causes the argument c to be returned by the
		      next call to GETC and PEEKC. No more than one  character
		      of pushback is ever needed and this character is guaran‐
		      teed to be the last character read by GETC.  The	return
		      value of the macro UNGETC(c) is always ignored.

		      This  macro is used on normal exit of the compile() rou‐
		      tine. The value of the argument ptr is a pointer to  the
		      character after the last character of the compiled regu‐
		      lar expression. This is useful to	 programs  which  have
		      memory allocation to manage.

		      This  macro  is  the  abnormal return from the compile()
		      routine. The argument val is an error number (see ERRORS
		      below for meanings).  This call should never return.

       The syntax of the compile() routine is as follows:

	 compile(instring, expbuf, endbuf, eof)

       The  first  parameter,  instring,  is never used explicitly by the com‐
       pile() routine but is useful for	 programs  that	 pass  down  different
       pointers to input characters. It is sometimes used in the INIT declara‐
       tion (see below). Programs which call functions to input characters  or
       have characters in an external array can pass down a value of (char *)0
       for this parameter.

       The next parameter, expbuf, is a character pointer. It  points  to  the
       place where the compiled regular expression will be placed.

       The  parameter  endbuf  is  one more than the highest address where the
       compiled regular expression may be placed. If the  compiled  expression
       cannot fit in (endbuf-expbuf) bytes, a call to ERROR(50) is made.

       The  parameter  eof is the character which marks the end of the regular
       expression. This character is usually a /.

       Each program that includes the  <regexp.h>  header  file	 must  have  a
       #define	statement  for INIT. It is used for dependent declarations and
       initializations. Most often it is used to set a	register  variable  to
       point  to the beginning of the regular expression so that this register
       variable can be used in the declarations for GETC, PEEKC,  and  UNGETC.
       Otherwise  it  can  be used to declare external variables that might be
       used by GETC, PEEKC and UNGETC. (See EXAMPLES below.)

   step(), advance()
       The first parameter to the step() and advance() functions is a  pointer
       to a string of characters to be checked for a match. This string should
       be null terminated.

       The second parameter, expbuf, is the compiled regular expression	 which
       was obtained by a call to the function compile().

       The  function  step()  returns  non-zero	 if  some  substring of string
       matches the regular expression in expbuf and  0 if there is  no	match.
       If  there is a match, two external character pointers are set as a side
       effect to the call to step(). The variable loc1	points	to  the	 first
       character that matched the regular expression; the variable loc2 points
       to the character after the last	character  that	 matches  the  regular
       expression.   Thus  if  the regular expression matches the entire input
       string, loc1 will point to the first character of string and loc2  will
       point to the null at the end of string.

       The  function  advance()	 returns  non-zero if the initial substring of
       string matches the regular expression in expbuf. If there is  a	match,
       an external character pointer, loc2, is set as a side effect. The vari‐
       able loc2 points to the next character in string after the last charac‐
       ter that matched.

       When  advance() encounters a * or \{ \} sequence in the regular expres‐
       sion, it will advance its pointer to the string to be matched as far as
       possible	 and  will recursively call itself trying to match the rest of
       the string to the rest of the regular expression. As long as  there  is
       no  match,  advance()  will  back  up along the string until it finds a
       match or reaches the point in the string that initially matched the   *
       or \{ \}.  It is sometimes desirable to stop this backing up before the
       initial point in the string  is	reached.  If  the  external  character
       pointer locs is equal to the point in the string at sometime during the
       backing up process, advance() will break out of the loop that backs  up
       and will return zero.

       The external variables circf, sed, and nbra are reserved.

       Example 1 Using Regular Expression Macros and Calls

       The  following  is  an example of how the regular expression macros and
       calls might be defined by an application program:

	 #define INIT	    register char *sp = instring;
	 #define GETC()	    (*sp++)
	 #define PEEKC()    (*sp)
	 #define UNGETC(c)  (--sp)
	 #define RETURN(c)  return;
	 #define ERROR(c)   regerr()

	 #include <regexp.h>
	  . . .
	       (void) compile(*argv, expbuf, &expbuf[ESIZE],'\0');
	  . . .
	       if (step(linebuf, expbuf))

       The function compile() uses the macro RETURN on success and  the	 macro
       ERROR on failure (see above). The functions step() and advance() return
       non-zero on a successful match and zero if there is  no	match.	Errors

	     range endpoint too large.

	     bad number.

	     \ digit out of range.

	     illegal or missing delimiter.

	     no remembered search string.

	     \( \) imbalance.

	     too many \(.

	     more than 2 numbers given in \{ \}.

	     } expected after \.

	     first number exceeds second in \{ \}.

	     [ ] imbalance.

	     regular expression overflow.


				 May 20, 2002			     REGEXP(5)

List of man pages available for SmartOS

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
Vote for polarhome
Free Shell Accounts :: the biggest list on the net