pcrepartial man page on PC-BSD

Man page or keyword search:  
man Server   9747 pages
apropos Keyword Search (all sections)
Output format
PC-BSD logo
[printable version]

PCREPARTIAL(3)							PCREPARTIAL(3)

NAME
       PCRE - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE

       In  normal  use	of  PCRE,  if  the  subject  string  that is passed to
       pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
       short  to  match	 the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
       There are circumstances where it might be helpful to  distinguish  this
       case from other cases in which there is no match.

       Consider, for example, an application where a human is required to type
       in data for a field with specific formatting requirements.  An  example
       might be a date in the form ddmmmyy, defined by this pattern:

	 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the user's keystrokes one by one, and can check
       that what has been typed so far is potentially valid,  it  is  able  to
       raise  an  error	 as  soon  as  a  mistake  is made, by beeping and not
       reflecting the character that has been typed, for example. This immedi‐
       ate  feedback is likely to be a better user interface than a check that
       is delayed until the entire string has been entered.  Partial  matching
       can  also be useful when the subject string is very long and is not all
       available at once.

       PCRE supports partial matching by means of  the	PCRE_PARTIAL_SOFT  and
       PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
       pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
       for PCRE_PARTIAL_SOFT. The essential difference between the two options
       is whether or not a partial match is preferred to an  alternative  com‐
       plete  match,  though the details differ between the two matching func‐
       tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.

       Setting a partial matching option disables two of PCRE's optimizations.
       PCRE  remembers the last literal byte in a pattern, and abandons match‐
       ing immediately if such a byte is not present in	 the  subject  string.
       This  optimization cannot be used for a subject string that might match
       only partially. If the pattern was  studied,  PCRE  knows  the  minimum
       length  of  a  matching string, and does not bother to run the matching
       function on shorter strings. This optimization  is  also	 disabled  for
       partial matching.

PARTIAL MATCHING USING pcre_exec()

       A partial match occurs during a call to pcre_exec() when the end of the
       subject string is reached successfully, but  matching  cannot  continue
       because	more characters are needed. However, at least one character in
       the subject must have been inspected. This character need not form part
       of  the	final  matched string; lookbehind assertions and the \K escape
       sequence provide ways of inspecting characters before the  start	 of  a
       matched	substring. The requirement for inspecting at least one charac‐
       ter exists because an empty string can always be matched; without  such
       a  restriction there would always be a partial match of an empty string
       at the end of the subject.

       If there are at least two slots in the offsets vector when  pcre_exec()
       returns	with  a	 partial match, the first slot is set to the offset of
       the earliest character that was inspected when the  partial  match  was
       found. For convenience, the second offset points to the end of the sub‐
       ject so that a substring can easily be identified.

       For the majority of patterns, the first offset identifies the start  of
       the  partially matched string. However, for patterns that contain look‐
       behind assertions, or \K, or begin with \b or  \B,  earlier  characters
       have been inspected while carrying out the match. For example:

	 /(?<=abc)123/

       This pattern matches "123", but only if it is preceded by "abc". If the
       subject string is "xyzabc12", the offsets after a partial match are for
       the  substring  "abc12",	 because  all  these  characters are needed if
       another match is tried with extra characters added to the subject.

       What happens when a partial match is identified depends on which of the
       two partial matching options are set.

   PCRE_PARTIAL_SOFT with pcre_exec()

       If  PCRE_PARTIAL_SOFT  is  set  when  pcre_exec()  identifies a partial
       match, the partial match is remembered, but matching continues as  nor‐
       mal,  and  other	 alternatives in the pattern are tried. If no complete
       match can be found, pcre_exec() returns PCRE_ERROR_PARTIAL  instead  of
       PCRE_ERROR_NOMATCH.

       This  option  is "soft" because it prefers a complete match over a par‐
       tial match.  All the various matching items in a pattern behave	as  if
       the  subject string is potentially complete. For example, \z, \Z, and $
       match at the end of the subject, as normal, and for \b and \B  the  end
       of the subject is treated as a non-alphanumeric.

       If  there  is more than one partial match, the first one that was found
       provides the data that is returned. Consider this pattern:

	 /123\w+X|dogY/

       If this is matched against the subject string "abc123dog", both	alter‐
       natives	fail  to  match,  but the end of the subject is reached during
       matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set	 to  3
       and  9, identifying "123dog" as the first partial match that was found.
       (In this example, there are two partial matches, because "dog"  on  its
       own partially matches the second alternative.)

   PCRE_PARTIAL_HARD with pcre_exec()

       If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR‐
       TIAL as soon as a partial match is found, without continuing to	search
       for possible complete matches. This option is "hard" because it prefers
       an earlier partial match over a later complete match. For this  reason,
       the  assumption is made that the end of the supplied subject string may
       not be the true end of the available data, and so, if \z, \Z,  \b,  \B,
       or  $  are  encountered	at  the	 end  of  the  subject,	 the result is
       PCRE_ERROR_PARTIAL.

       Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8
       subject	strings	 for  validity.	 Normally,  an	invalid UTF-8 sequence
       causes the error PCRE_ERROR_BADUTF8. However, in the special case of  a
       truncated  UTF-8 character at the end of the subject, PCRE_ERROR_SHORT‐
       UTF8 is returned when PCRE_PARTIAL_HARD is set.

   Comparing hard and soft partial matching

       The difference between the two partial matching options can  be	illus‐
       trated by a pattern such as:

	 /dog(sbody)?/

       This  matches either "dog" or "dogsbody", greedily (that is, it prefers
       the longer string if possible). If it is	 matched  against  the	string
       "dog"  with  PCRE_PARTIAL_SOFT,	it  yields a complete match for "dog".
       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
       On  the	other hand, if the pattern is made ungreedy the result is dif‐
       ferent:

	 /dog(sbody)??/

       In this case the result is always a complete match because  pcre_exec()
       finds  that  first,  and	 it  never continues after finding a match. It
       might be easier to follow this explanation by thinking of the two  pat‐
       terns like this:

	 /dog(sbody)?/	  is the same as  /dogsbody|dog/
	 /dog(sbody)??/	  is the same as  /dog|dogsbody/

       The  second  pattern  will  never  match "dogsbody" when pcre_exec() is
       used, because it will always find the shorter match first.

PARTIAL MATCHING USING pcre_dfa_exec()

       The pcre_dfa_exec() function moves along the subject  string  character
       by  character, without backtracking, searching for all possible matches
       simultaneously. If the end of the subject is reached before the end  of
       the  pattern,  there  is the possibility of a partial match, again pro‐
       vided that at least one character has been inspected.

       When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
       there  have  been  no complete matches. Otherwise, the complete matches
       are returned.  However, if PCRE_PARTIAL_HARD is set,  a	partial	 match
       takes  precedence  over any complete matches. The portion of the string
       that was inspected when the longest partial match was found is  set  as
       the first matching string, provided there are at least two slots in the
       offsets vector.

       Because pcre_dfa_exec() always searches for all possible	 matches,  and
       there  is no difference between greedy and ungreedy repetition, its be‐
       haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con‐
       sider  the  string  "dog"  matched  against  the ungreedy pattern shown
       above:

	 /dog(sbody)??/

       Whereas pcre_exec() stops as soon as it finds the  complete  match  for
       "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
       so returns that when PCRE_PARTIAL_HARD is set.

PARTIAL MATCHING AND WORD BOUNDARIES

       If a pattern ends with one of sequences \b or \B, which test  for  word
       boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
       intuitive results. Consider this pattern:

	 /\bcat\b/

       This matches "cat", provided there is a word boundary at either end. If
       the subject string is "the cat", the comparison of the final "t" with a
       following character cannot take place, so a  partial  match  is	found.
       However,	 pcre_exec() carries on with normal matching, which matches \b
       at the end of the subject when the last character  is  a	 letter,  thus
       finding a complete match. The result, therefore, is not PCRE_ERROR_PAR‐
       TIAL. The same thing happens  with  pcre_dfa_exec(),  because  it  also
       finds the complete match.

       Using  PCRE_PARTIAL_HARD	 in  this  case does yield PCRE_ERROR_PARTIAL,
       because then the partial match takes precedence.

FORMERLY RESTRICTED PATTERNS

       For releases of PCRE prior to 8.00, because of the way certain internal
       optimizations   were  implemented  in  the  pcre_exec()	function,  the
       PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)	could  not  be
       used  with all patterns. From release 8.00 onwards, the restrictions no
       longer apply, and partial matching with pcre_exec()  can	 be  requested
       for any pattern.

       Items that were formerly restricted were repeated single characters and
       repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
       not  conform  to	 the restrictions, pcre_exec() returned the error code
       PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
       PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
       pattern can be used for partial matching now always returns 1.

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

       If the escape sequence \P is present  in	 a  pcretest  data  line,  the
       PCRE_PARTIAL_SOFT  option  is  used  for	 the  match.  Here is a run of
       pcretest that uses the date example quoted above:

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 25jun04\P
	  0: 25jun04
	  1: jun
	 data> 25dec3\P
	 Partial match: 23dec3
	 data> 3ju\P
	 Partial match: 3ju
	 data> 3juj\P
	 No match
	 data> j\P
	 No match

       The first data string is matched	 completely,  so  pcretest  shows  the
       matched	substrings.  The  remaining four strings do not match the com‐
       plete pattern, but the first two are partial matches. Similar output is
       obtained when pcre_dfa_exec() is used.

       If  the escape sequence \P is present more than once in a pcretest data
       line, the PCRE_PARTIAL_HARD option is set for the match.

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()

       When a partial match has been found using pcre_dfa_exec(), it is possi‐
       ble  to	continue  the  match  by providing additional subject data and
       calling pcre_dfa_exec() again with the same  compiled  regular  expres‐
       sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
       same working space as before, because this is where details of the pre‐
       vious  partial  match  are  stored.  Here is an example using pcretest,
       using the \R escape sequence to set  the	 PCRE_DFA_RESTART  option  (\D
       specifies the use of pcre_dfa_exec()):

	   re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
	 data> 23ja\P\D
	 Partial match: 23ja
	 data> n05\R\D
	  0: n05

       The  first  call has "23ja" as the subject, and requests partial match‐
       ing; the second call  has  "n05"	 as  the  subject  for	the  continued
       (restarted)  match.   Notice  that when the match is complete, only the
       last part is shown; PCRE does  not  retain  the	previously  partially-
       matched	string. It is up to the calling program to do that if it needs
       to.

       You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD	 options  with
       PCRE_DFA_RESTART	 to  continue partial matching over multiple segments.
       This facility can  be  used  to	pass  very  long  subject  strings  to
       pcre_dfa_exec().

MULTI-SEGMENT MATCHING WITH pcre_exec()

       From  release  8.00,  pcre_exec()  can also be used to do multi-segment
       matching. Unlike pcre_dfa_exec(), it is not  possible  to  restart  the
       previous	 match	with  a new segment of data. Instead, new data must be
       added to the previous subject string,  and  the	entire	match  re-run,
       starting	 from the point where the partial match occurred. Earlier data
       can be discarded. It is best to use PCRE_PARTIAL_HARD  in  this	situa‐
       tion,  because it does not treat the end of a segment as the end of the
       subject when matching \z, \Z, \b, \B, and  $.  Consider	an  unanchored
       pattern that matches dates:

	   re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
	 data> The date is 23ja\P\P
	 Partial match: 23ja

       At  this stage, an application could discard the text preceding "23ja",
       add on text from the next segment, and call pcre_exec()	again.	Unlike
       pcre_dfa_exec(),	 the  entire matching string must always be available,
       and the complete matching process occurs for each call, so more	memory
       and more processing time is needed.

       Note:  If  the pattern contains lookbehind assertions, or \K, or starts
       with \b or \B, the string that is returned for  a  partial  match  will
       include	characters  that  precede the partially matched string itself,
       because these must be retained when adding on  more  characters	for  a
       subsequent matching attempt.

ISSUES WITH MULTI-SEGMENT MATCHING

       Certain types of pattern may give problems with multi-segment matching,
       whichever matching function is used.

       1. If the pattern contains a test for the beginning of a line, you need
       to  pass	 the  PCRE_NOTBOL  option when the subject string for any call
       does start at the beginning of a line.  There  is  also	a  PCRE_NOTEOL
       option, but in practice when doing multi-segment matching you should be
       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.

       2. Lookbehind assertions at the start of a pattern are catered  for  in
       the  offsets that are returned for a partial match. However, in theory,
       a lookbehind assertion later in the pattern could require even  earlier
       characters  to  be inspected, and it might not have been reached when a
       partial match occurs. This is probably an extremely unlikely case;  you
       could  guard  against  it to a certain extent by always including extra
       characters at the start.

       3. Matching a subject string that is split into multiple	 segments  may
       not  always produce exactly the same result as matching over one single
       long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
       "Partial	 Matching  and	Word Boundaries" above describes an issue that
       arises if the pattern ends with \b or \B. Another  kind	of  difference
       may  occur when there are multiple matching possibilities, because (for
       PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
       no completed matches. This means that as soon as the shortest match has
       been found, continuation to a new subject segment is no	longer	possi‐
       ble. Consider again this pcretest example:

	   re> /dog(sbody)?/
	 data> dogsb\P
	  0: dog
	 data> do\P\D
	 Partial match: do
	 data> gsb\R\P\D
	  0: g
	 data> dogsbody\D
	  0: dogsbody
	  1: dog

       The  first  data line passes the string "dogsb" to pcre_exec(), setting
       the PCRE_PARTIAL_SOFT option. Although the string is  a	partial	 match
       for  "dogsbody",	 the  result  is  not  PCRE_ERROR_PARTIAL, because the
       shorter string "dog" is a complete match. Similarly, when  the  subject
       is  presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
       the first two) the match stops when "dog" has been found, and it is not
       possible	 to continue. On the other hand, if "dogsbody" is presented as
       a single string, pcre_dfa_exec() finds both matches.

       Because of these problems, it is best  to  use  PCRE_PARTIAL_HARD  when
       matching	 multi-segment	data.  The  example above then behaves differ‐
       ently:

	   re> /dog(sbody)?/
	 data> dogsb\P\P
	 Partial match: dogsb
	 data> do\P\D
	 Partial match: do
	 data> gsb\R\P\P\D
	 Partial match: gsb

       4. Patterns that contain alternatives at the top level which do not all
       start  with  the	 same  pattern	item  may  not	work  as expected when
       PCRE_DFA_RESTART is used with pcre_dfa_exec().  For  example,  consider
       this pattern:

	 1234|3789

       If  the	first  part of the subject is "ABC123", a partial match of the
       first alternative is found at offset 3. There is no partial  match  for
       the second alternative, because such a match does not start at the same
       point in the subject string. Attempting to  continue  with  the	string
       "7890"  does  not  yield	 a  match because only those alternatives that
       match at one point in the subject are remembered.  The  problem	arises
       because	the  start  of the second alternative matches within the first
       alternative. There is no problem with  anchored	patterns  or  patterns
       such as:

	 1234|ABCD

       where  no  string can be a partial match for both alternatives. This is
       not a problem if pcre_exec() is used, because the entire match  has  to
       be rerun each time:

	   re> /1234|3789/
	 data> ABC123\P\P
	 Partial match: 123
	 data> 1237890
	  0: 3789

       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
       running the entire match can also be used with pcre_dfa_exec(). Another
       possibility is to work with two buffers. If a partial match at offset n
       in the first buffer is followed by "no match" when PCRE_DFA_RESTART  is
       used  on	 the  second  buffer, you can then try a new match starting at
       offset n+1 in the first buffer.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

REVISION

       Last updated: 07 November 2010
       Copyright (c) 1997-2010 University of Cambridge.

								PCREPARTIAL(3)
[top]

List of man pages available for PC-BSD

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net