This document describes the POSIX 1003.2 extended RE (ERE) syntax and the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style notation is used to describe the grammar.
extended-regexp ::= branch | extended-regexp "|" branch |
An extended regexp (ERE) is one or more branches, separated by |. An ERE matches anything that matches one or more of the branches.
branch ::= piece | branch piece |
A branch is one or more pieces concatenated. It matches a match for the first piece, followed by a match for the second piece, and so on.
piece ::= atom | atom repeat-operator | atom approx-settings |
A piece is an atom possibly followed by a repeat operator or an expression controlling approximate matching parameters for the atom.
atom ::= "(" extended-regexp ")" | bracket-expression | "." | assertion | literal | back-reference | "(?#" comment-text ")" | "(?" options ")" extended-regexp | "(?" options ":" extended-regexp ")" |
An atom is either an ERE enclosed in parenthesis, a bracket expression, a . (period), an assertion, or a literal.
The dot (.) matches any single character.
If the REG_NEWLINE
compilation flag (see API manual) is specified, the newline
character is not matched.
Comment-text can contain any characters except for a closing parenthesis ). The text in the comment is completely ignored by the regex parser and it used solely for readability purposes.
repeat-operator ::= "*" | "+" | "?" | bound | "*?" | "+?" | "??" | bound ? |
An atom followed by * matches a sequence of 0 or more matches of the atom. + is similar to *, matching a sequence of 1 or more matches of the atom. An atom followed by ? matches a sequence of 0 or 1 matches of the atom.
A bound is one of the following, where m and m are unsigned decimal integers between 0 and RE_DUP_MAX:
An atom followed by [1] matches a sequence of m through n (inclusive) matches of the atom. An atom followed by [2] matches a sequence of m or more matches of the atom. An atom followed by [3] matches a sequence of exactly m matches of the atom.
Adding a ? to a repeat operator makes the subexpression minimal, or non-greedy. Normally a repeated expression is greedy, that is, it matches as many characters as possible. A non-greedy subexpression matches as few characters as possible. Note that this does not (always) mean the same thing as matching as many or few repetitions as possible. Also note that minimal repetitions are not currently supported for approximate matching.
approx-settings ::= "{" count-limits* ","? cost-equation? "}" count-limits ::= "+" number? | "-" number? | "#" number? | "~" number? cost-equation ::= ( cost-term "+"? " "? )+ "<" number cost-term ::= number "i" | number "d" | number "s" |
The approximate matching settings for a subpattern can be changed by appending approx-settings to the subpattern. Limits for the number of errors can be set and an expression for specifying and limiting the costs can be given.
The count-limits can be used to set limits for the number of insertions (+), deletions (-), substitutions (#), and total number of errors (~). If the number part is omitted, the specified error count will be unlimited.
The cost-equation can be thought of as a mathematical equation, where i, d, and s stand for the number of insertions, deletions, and substitutions, respectively. The equation can have a multiplier for each of i, d, and s. The multiplier is the cost of the error, and the number after < is the maximum allowed cost of a match. Spaces and pluses can be inserted to make the equation readable. In fact, when specifying only a cost equation, adding a space after the opening { is required.
Examples:
bracket-expression ::= "[" item+ "]" | "[^" item+ "]" |
A bracket expression specifies a set of characters by enclosing a nonempty list of items in brackets. Normally anything matching any item in the list is matched. If the list begins with ^ the meaning is negated; any character matching no item in the list is matched.
An item is any of the following:
LC_CTYPE
category of the
current locale, but the following names are valid in all locales:
To include a literal - in the list, make it either the first or last item, the second endpoint of a range, or enclose it in [. and .] to make it a collating element. To include a literal ] in the list, make it either the first item, the second endpoint of a range, or enclose it in [. and .]. To use a literal - as the first endpoint of a range, enclose it in [. and .].
assertion ::= "^" | "$" | "\" assertion-character |
The expressions ^ and $ are called "left anchor" and "right anchor", respectively. The left anchor matches the empty string at the beginning of the string. The right anchor matches the empty string at the end of the string. The behaviour of both anchors can be varied by specifying certain execution and compilation flags; see the API manual.
An assertion-character can be any of the following:
literal ::= ordinary-character | "\x" ["1"-"9" "a"-"f" "A"-"F"]{0,2} | "\x{" ["1"-"9" "a"-"f" "A"-"F"]* "}" | "\" character |
A literal is either an ordinary character (a character that has no other significance in the context), an 8 bit hexadecimal encoded character (e.g. \x1B), a wide hexadecimal encoded character (e.g. \x{263a}), or an escaped character. An escaped character is a \ followed by any character, and matches that character. Escaping can be used to match characters which have a special meaning in regexp syntax. A \ cannot be the last character of an ERE. Escaping also allows you to include a few non-printable characters in the regular expression. These special escape sequences include:
An ordinary character is just a single character with no other significance, and matches that character. A { followed by something else than a digit is considered an ordinary character.
back-reference ::= "\" ["1"-"9"] |
A back reference is a backslash followed by a single non-zero decimal digit d. It matches the same sequence of characters matched by the dth parenthesized subexpression.
Back references are not defined for POSIX EREs (for BREs they are), but many matchers, including TRE, implement back references for both EREs and BREs.
options ::= ["i" "n" "r" "U"]* ("-" ["i" "n" "r" "U"]*)? |
The obsolete basic regexp (BRE) syntax differs from the ERE syntax as follows: