pcrepartial man page on OpenBSD

Man page or keyword search:  
man Server   11362 pages
apropos Keyword Search (all sections)
Output format
OpenBSD logo
[printable version]



PCREPARTIAL(3)					   PCREPARTIAL(3)

NAME
       PCRE - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE

       In  normal  use	of  PCRE,  if  the  subject  string  that is passed to
       pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
       short  to  match	 the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
       There are circumstances where it might be helpful to  distinguish  this
       case from other cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

	 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error	 as  soon  as  a  mistake  is made, by beeping and not
       reflecting the character that has been typed, for example. This immedi-
       ate  feedback is likely to be a better user interface than a check that
       is delayed until the entire string has been entered.  Partial  matching
       can  also  sometimes be useful when the subject string is very long and
       is not all available at once.

       PCRE supports partial matching by means of  the	PCRE_PARTIAL_SOFT  and
       PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
       pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
       for PCRE_PARTIAL_SOFT. The essential difference between the two options
       is whether or not a partial match is preferred to an  alternative  com-
       plete  match,  though the details differ between the two matching func-
       tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.

       Setting a partial matching option disables two of PCRE's optimizations.
       PCRE  remembers the last literal byte in a pattern, and abandons match-
       ing immediately if such a byte is not present in	 the  subject  string.
       This  optimization cannot be used for a subject string that might match
       only partially. If the pattern was  studied,  PCRE  knows  the  minimum
       length  of  a  matching string, and does not bother to run the matching
       function on shorter strings. This optimization  is  also	 disabled  for
       partial matching.

PARTIAL MATCHING USING pcre_exec()

       A partial match occurs during a call to pcre_exec() whenever the end of
       the subject string is reached successfully, but	matching  cannot  con-
       tinue because more characters are needed. However, at least one charac-
       ter must have been matched. (In other words, a partial match can	 never
       be an empty string.)

       If  PCRE_PARTIAL_SOFT  is  set,	the  partial  match is remembered, but
       matching continues as normal, and other alternatives in the pattern are
       tried.	If  no	complete  match	 can  be  found,  pcre_exec()  returns

								1

PCREPARTIAL(3)					   PCREPARTIAL(3)

       PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
       two slots in the offsets vector, the first of them is set to the offset
       of the earliest character that was inspected when the partial match was
       found.  For  convenience,  the  second  offset points to the end of the
       string so that a substring can easily be identified.

       For the majority of patterns, the first offset identifies the start  of
       the  partially matched string. However, for patterns that contain look-
       behind assertions, or \K, or begin with \b or  \B,  earlier  characters
       have been inspected while carrying out the match. For example:

	 /(?<=abc)123/

       This pattern matches "123", but only if it is preceded by "abc". If the
       subject string is "xyzabc12", the offsets after a partial match are for
       the  substring  "abc12",	 because  all  these  characters are needed if
       another match is tried with extra characters added.

       If there is more than one partial match, the first one that  was	 found
       provides the data that is returned. Consider this pattern:

	 /123\w+X|dogY/

       If  this is matched against the subject string "abc123dog", both alter-
       natives fail to match, but the end of the  subject  is  reached	during
       matching,    so	  PCRE_ERROR_PARTIAL	is    returned	  instead   of
       PCRE_ERROR_NOMATCH. The	offsets	 are  set  to  3  and  9,  identifying
       "123dog"	 as  the first partial match that was found. (In this example,
       there are two partial matches,  because	"dog"  on  its	own  partially
       matches the second alternative.)

       If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
       TIAL as soon as a partial match is found, without continuing to	search
       for  possible  complete matches. The difference between the two options
       can be illustrated by a pattern such as:

	 /dog(sbody)?/

       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
       the  longer  string  if	possible). If it is matched against the string
       "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for	"dog".
       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
       On the other hand, if the pattern is made ungreedy the result  is  dif-
       ferent:

	 /dog(sbody)??/

       In  this case the result is always a complete match because pcre_exec()
       finds that first, and it never continues	 after	finding	 a  match.  It
       might  be easier to follow this explanation by thinking of the two pat-
       terns like this:

	 /dog(sbody)?/	  is the same as  /dogsbody|dog/
	 /dog(sbody)??/	  is the same as  /dog|dogsbody/

								2

PCREPARTIAL(3)					   PCREPARTIAL(3)

       The second pattern will never  match  "dogsbody"	 when  pcre_exec()  is
       used, because it will always find the shorter match first.

PARTIAL MATCHING USING pcre_dfa_exec()

       The  pcre_dfa_exec()  function moves along the subject string character
       by character, without backtracking, searching for all possible  matches
       simultaneously.	If the end of the subject is reached before the end of
       the pattern, there is the possibility of a partial  match,  again  pro-
       vided that at least one character has matched.

       When  PCRE_PARTIAL_SOFT	is set, PCRE_ERROR_PARTIAL is returned only if
       there have been no complete matches. Otherwise,	the  complete  matches
       are  returned.	However,  if PCRE_PARTIAL_HARD is set, a partial match
       takes precedence over any complete matches. The portion of  the	string
       that  was  inspected when the longest partial match was found is set as
       the first matching string, provided there are at least two slots in the
       offsets vector.

       Because	pcre_dfa_exec()	 always searches for all possible matches, and
       there is no difference between  greedy  and  ungreedy  repetition,  its
       behaviour  is  different	 from pcre_exec when PCRE_PARTIAL_HARD is set.
       Consider the string "dog" matched against the  ungreedy	pattern	 shown
       above:

	 /dog(sbody)??/

       Whereas	pcre_exec()  stops  as soon as it finds the complete match for
       "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
       so returns that when PCRE_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES

       If  a  pattern ends with one of sequences \b or \B, which test for word
       boundaries, partial matching with PCRE_PARTIAL_SOFT can	give  counter-
       intuitive results. Consider this pattern:

	 /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following  character  cannot  take  place, so a partial match is found.
       However, pcre_exec() carries on with normal matching, which matches  \b
       at  the	end  of	 the subject when the last character is a letter, thus
       finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
       TIAL.  The  same	 thing	happens	 with pcre_dfa_exec(), because it also
       finds the complete match.

       Using PCRE_PARTIAL_HARD in this	case  does  yield  PCRE_ERROR_PARTIAL,
       because then the partial match takes precedence.

FORMERLY RESTRICTED PATTERNS

       For releases of PCRE prior to 8.00, because of the way certain internal

								3

PCREPARTIAL(3)					   PCREPARTIAL(3)

       optimizations  were  implemented	 in  the  pcre_exec()  function,   the
       PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
       used with all patterns. From release 8.00 onwards, the restrictions  no
       longer  apply,  and  partial matching with pcre_exec() can be requested
       for any pattern.

       Items that were formerly restricted were repeated single characters and
       repeated	 metasequences. If PCRE_PARTIAL was set for a pattern that did
       not conform to the restrictions, pcre_exec() returned  the  error  code
       PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
       PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
       pattern can be used for partial matching now always returns 1.

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

       If  the	escape	sequence  \P  is  present in a pcretest data line, the
       PCRE_PARTIAL_SOFT option is used for  the  match.  Here	is  a  run  of
       pcretest that uses the date example quoted above:

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 25jun04\P
	  0: 25jun04
	  1: jun
	 data> 25dec3\P
	 Partial match: 23dec3
	 data> 3ju\P
	 Partial match: 3ju
	 data> 3juj\P
	 No match
	 data> j\P
	 No match

       The  first  data	 string	 is  matched completely, so pcretest shows the
       matched substrings. The remaining four strings do not  match  the  com-
       plete pattern, but the first two are partial matches. Similar output is
       obtained when pcre_dfa_exec() is used.

       If the escape sequence \P is present more than once in a pcretest  data
       line, the PCRE_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()

       When a partial match has been found using pcre_dfa_exec(), it is possi-
       ble to continue the match by  providing	additional  subject  data  and
       calling	pcre_dfa_exec()	 again	with the same compiled regular expres-
       sion, this time setting the PCRE_DFA_RESTART option. You must pass  the
       same working space as before, because this is where details of the pre-
       vious partial match are stored. Here  is	 an  example  using  pcretest,
       using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D
       specifies the use of pcre_dfa_exec()):

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 23ja\P\D
	 Partial match: 23ja

								4

PCREPARTIAL(3)					   PCREPARTIAL(3)

	 data> n05\R\D
	  0: n05

       The first call has "23ja" as the subject, and requests  partial	match-
       ing;  the  second  call	has  "n05"  as	the  subject for the continued
       (restarted) match.  Notice that when the match is  complete,  only  the
       last  part  is  shown;  PCRE  does not retain the previously partially-
       matched string. It is up to the calling program to do that if it	 needs
       to.

       You  can	 set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
       PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
       This  facility  can  be	used  to  pass	very  long  subject strings to
       pcre_dfa_exec().

MULTI-SEGMENT MATCHING WITH pcre_exec()

       From release 8.00, pcre_exec() can also be  used	 to  do	 multi-segment
       matching.  Unlike  pcre_dfa_exec(),  it	is not possible to restart the
       previous match with a new segment of data. Instead, new	data  must  be
       added  to  the  previous	 subject  string, and the entire match re-run,
       starting from the point where the partial match occurred. Earlier  data
       can be discarded.  Consider an unanchored pattern that matches dates:

	   re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
	 data> The date is 23ja\P
	 Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next segment, and call pcre_exec()	again.	Unlike
       pcre_dfa_exec(),	 the  entire matching string must always be available,
       and the complete matching process occurs for each call, so more	memory
       and more processing time is needed.

       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
       with \b or \B, the string that is returned for  a  partial  match  will
       include	characters  that  precede the partially matched string itself,
       because these must be retained when adding on  more  characters	for  a
       subsequent matching attempt.

ISSUES WITH MULTI-SEGMENT MATCHING

       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains tests for the beginning or end  of  a	 line,
       you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
       ate, when the subject string for any call does not contain  the	begin-
       ning or end of a line.

       2.  Lookbehind  assertions at the start of a pattern are catered for in
       the offsets that are returned for a partial match. However, in  theory,
       a  lookbehind assertion later in the pattern could require even earlier
       characters to be inspected, and it might not have been reached  when  a

								5

PCREPARTIAL(3)					   PCREPARTIAL(3)

       partial	match occurs. This is probably an extremely unlikely case; you
       could guard against it to a certain extent by  always  including	 extra
       characters at the start.

       3.  Matching  a subject string that is split into multiple segments may
       not always produce exactly the same result as matching over one	single
       long  string,  especially  when	PCRE_PARTIAL_SOFT is used. The section
       "Partial Matching and Word Boundaries" above describes  an  issue  that
       arises  if  the	pattern ends with \b or \B. Another kind of difference
       may occur when there are multiple  matching  possibilities,  because  a
       partial match result is given only when there are no completed matches.
       This means that as soon as the shortest match has been found, continua-
       tion  to	 a  new subject segment is no longer possible.	Consider again
       this pcretest example:

	   re> /dog(sbody)?/
	 data> dogsb\P
	  0: dog
	 data> do\P\D
	 Partial match: do
	 data> gsb\R\P\D
	  0: g
	 data> dogsbody\D
	  0: dogsbody
	  1: dog

       The first data line passes the string "dogsb" to	 pcre_exec(),  setting
       the  PCRE_PARTIAL_SOFT  option.	Although the string is a partial match
       for "dogsbody", the  result  is	not  PCRE_ERROR_PARTIAL,  because  the
       shorter	string	"dog" is a complete match. Similarly, when the subject
       is presented to pcre_dfa_exec() in several parts ("do" and "gsb"	 being
       the first two) the match stops when "dog" has been found, and it is not
       possible to continue. On the other hand, if "dogsbody" is presented  as
       a single string, pcre_dfa_exec() finds both matches.

       Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
       when matching multi-segment data. The example above then	 behaves  dif-
       ferently:

	   re> /dog(sbody)?/
	 data> dogsb\P\P
	 Partial match: dogsb
	 data> do\P\D
	 Partial match: do
	 data> gsb\R\P\P\D
	 Partial match: gsb

       4. Patterns that contain alternatives at the top level which do not all
       start with the  same  pattern  item  may	 not  work  as	expected  when
       PCRE_DFA_RESTART	 is  used  with pcre_dfa_exec(). For example, consider
       this pattern:

	 1234|3789

								6

PCREPARTIAL(3)					   PCREPARTIAL(3)

       If the first part of the subject is "ABC123", a partial	match  of  the
       first  alternative  is found at offset 3. There is no partial match for
       the second alternative, because such a match does not start at the same
       point  in  the  subject	string. Attempting to continue with the string
       "7890" does not yield a match  because  only  those  alternatives  that
       match  at  one  point in the subject are remembered. The problem arises
       because the start of the second alternative matches  within  the	 first
       alternative.  There  is	no  problem with anchored patterns or patterns
       such as:

	 1234|ABCD

       where no string can be a partial match for both alternatives.  This  is
       not  a  problem if pcre_exec() is used, because the entire match has to
       be rerun each time:

	   re> /1234|3789/
	 data> ABC123\P
	 Partial match: 123
	 data> 1237890
	  0: 3789

       Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
       running the entire match can also be used with pcre_dfa_exec(). Another
       possibility is to work with two buffers. If a partial match at offset n
       in  the first buffer is followed by "no match" when PCRE_DFA_RESTART is
       used on the second buffer, you can then try a  new  match  starting  at
       offset n+1 in the first buffer.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

REVISION

       Last updated: 19 October 2009
       Copyright (c) 1997-2009 University of Cambridge.

								7

[top]

List of man pages available for OpenBSD

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net