#include <tre/regex.h>
int
regcomp(regex_t *preg,
const char
*regex, int
cflags);
int regncomp(regex_t
*preg, const
char *regex,
size_t len,
int cflags);
int regwcomp(regex_t
*preg, const
wchar_t *regex, int cflags);
int regwncomp(regex_t
*preg, const
wchar_t *regex, size_t
len, int
cflags);
void regfree(regex_t
*preg);
The regcomp() function compiles the regex string pointed to by regex to an internal representation and stores the result in the pattern buffer structure pointed to by preg. The regncomp() function is like regcomp(), but regex is not terminated with the null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwcomp() and regwncomp() functions work like regcomp() and regncomp(), respectively, but take a wide-character (wchar_t) string instead of a byte string.
The cflags argument is a the bitwise inclusive OR of zero or more of the following flags (defined in the header <tre/regex.h>):
- REG_EXTENDED
- Use POSIX Extended Regular Expression (ERE) compatible syntax when compiling regex. The default syntax is the POSIX Basic Regular Expression (BRE) syntax, but it is considered obsolete.
- REG_ICASE
- Ignore case. Subsequent searches with the regexec family of functions using this pattern buffer will be case insensitive.
- REG_NOSUB
- Do not report submatches. Subsequent searches with the regexec family of functions will only report whether a match was found or not and will not fill the submatch array.
- REG_NEWLINE
- Normally the newline character is treated as an ordinary character. When this flag is used, the newline character ('\n', ASCII code 10) is treated specially as follows:
- The match-any-character operator (dot "." outside a bracket expression) does not match a newline.
- A non-matching list ([^...]) not containing a newline does not match a newline.
- The match-beginning-of-line operator ^ matches the empty string immediately after a newline as well as the empty string at the beginning of the string (but see the
REG_NOTBOL
regexec()
flag below).- The match-end-of-line operator $ matches the empty string immediately before a newline as well as the empty string at the end of the string (but see the
REG_NOTEOL
regexec()
flag below).- REG_LITERAL
- Interpret the entire regex argument as a literal string, that is, all characters will be considered ordinary. This is a nonstandard extension, compatible with but not specified by POSIX.
- REG_NOSPEC
- Same as REG_LITERAL. This flag is provided for compatibility with BSD.
- REG_RIGHT_ASSOC
- By default, concatenation is left associative in TRE, as per the grammar given in the base specifications on regular expressions of Std 1003.1-2001 (POSIX). This flag flips associativity of concatenation to right associative. Associativity can have an effect on how a match is divided into submatches, but does not change what is matched by the entire regexp.
- REG_UNGREEDY
- By default, repetition operators are greedy in TRE as per Std 1003.1-2001 (POSIX) and can be forced to be non-greedy by appending a ? character. This flag reverses this behavior by making the operators non-greedy by default and greedy when a ? is specified.
After a successful call to regcomp it is possible to use the preg pattern buffer for searching for matches in strings (see below). Once the pattern buffer is no longer needed, it should be freed with regfree to free the memory allocated for it.
The regex_t structure has the following fields that the application can read:
- size_t re_nsub
- Number of parenthesized subexpressions in regex.
The regcomp function returns zero if the compilation was successful, or one of the following error codes if there was an error:
- REG_BADPAT
- Invalid regexp. TRE returns this only if a multibyte character set is used in the current locale, and regex contained an invalid multibyte sequence.
- REG_ECOLLATE
- Invalid collating element referenced. TRE returns this whenever equivalence classes or multicharacter collating elements are used in bracket expressions (they are not supported yet).
- REG_ECTYPE
- Unknown character class name in [[:name:]].
- REG_EESCAPE
- The last character of regex was a backslash (\).
- REG_ESUBREG
- Invalid back reference; number in \digit invalid.
- REG_EBRACK
- [] imbalance.
- REG_EPAREN
- \(\) or () imbalance.
- REG_EBRACE
- \{\} or {} imbalance.
- REG_BADBR
- {} content invalid: not a number, more than two numbers, first larger than second, or number too large.
- REG_ERANGE
- Invalid character range, e.g. ending point is earlier in the collating order than the starting point.
- REG_ESPACE
- Out of memory, or an internal limit exceeded.
- REG_BADRPT
- Invalid use of repetition operators: two or more repetition operators have been chained in an undefined way.
#include <tre/regex.h>
int regexec(const
regex_t *preg, const char *string,
size_t nmatch,
regmatch_t pmatch[], int
eflags);
int regnexec(const
regex_t *preg, const char *string,
size_t len,
size_t nmatch, regmatch_t
pmatch[], int eflags);
int regwexec(const
regex_t *preg, const wchar_t *string,
size_t nmatch,
regmatch_t pmatch[], int
eflags);
int regwnexec(const
regex_t *preg, const wchar_t *string,
size_t len,
size_t nmatch, regmatch_t
pmatch[], int eflags);
The regexec() function matches the null-terminated string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions. The regnexec() function is like regexec(), but string is not terminated with a null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwexec() and regwnexec() functions work like regexec() and regnexec(), respectively, but take a wide character (wchar_t) string instead of a byte string. The eflags argument is a bitwise OR of zero or more of the following flags:
REG_NOTBOL
When this flag is used, the match-beginning-of-line operator ^ does not match the empty string at the beginning of string. If
REG_NEWLINE
was used when compiling preg the empty string immediately after a newline character will still be matched.REG_NOTEOL
When this flag is used, the match-end-of-line operator $ does not match the empty string at the end of string. If
REG_NEWLINE
was used when compiling preg the empty string immediately before a newline character will still be matched.These flags are useful when different portions of a string are passed to
regexec
and the beginning or end of the partial string should not be interpreted as the beginning or end of a line.
If REG_NOSUB
was used when compiling preg, nmatch is zero, or pmatch is NULL
, then the
pmatch argument is ignored.
Otherwise, the submatches corresponding to the parenthesized
subexpressions are filled in the elements of pmatch, which must be dimensioned to have
at least nmatch elements.
The regmatch_t structure contains at least the following fields:
- regoff_t rm_so
- Offset from start of string to start of substring.
- regoff_t rm_eo
- Offset from start of string to the first character after the substring.
The length of a submatch can be computed by subtracting rm_eo
and
rm_so
. If a parenthesized subexpression did not participate in a
match, the rm_so
and rm_eo
fields for the
corresponding pmatch
element are set to -1
. Note
that when a multibyte character set is in effect, the submatch offsets are
given as byte offsets, not character offsets.
The regexec()
functions return zero if a match was found,
otherwise they return REG_NOMATCH
to indicate no match,
or REG_ESPACE
to indicate that enough temporary memory
could not be allocated to complete the matching operation.
#include <tre/regex.h>
typedef struct {
int (*get_next_char)(tre_char_t *c, unsigned int *pos_add,
void *context);
void (*rewind)(size_t pos, void *context);
int (*compare)(size_t pos1, size_t pos2, size_t len, void *context);
void *context;
} tre_str_source;
int reguexec(const
regex_t *preg, const tre_str_source *string,
size_t nmatch,
regmatch_t pmatch[], int
eflags);
The reguexec() function works just like the other regexec() functions, except that the input string is read from user specified callback functions instead of a character array. This makes it possible, for example, to match regexps over arbitrary user specified data structures.
The tre_str_source structure contains the following fields:
- get_next_char
- This function must retrieve the next available character. If a character is not available, the space pointed to by c must be set to zero and it must return a nonzero value. If a character is available, it must be stored to the space pointed to by c, and the integer pointer to by pos_add must be set to the number of units advanced in the input (the value must be >=1), and zero must be returned.
- rewind
- This function must rewind the input stream to the position specified by pos. Unless the regexp uses back references, rewind is not needed and can be set to NULL.
- compare
- This function compares two substrings in the input streams starting at the positions specified by pos1 and pos2 of length len. If the substrings are equal, compare must return zero, otherwise a nonzero value must be returned. Unless the regexp uses back references, compare is not needed and can be set to NULL.
- context
- This is a context variable, passed as the last argument to all of the above functions for keeping track of the internal state of the users code.
The position in the input stream is measured in size_t units. The current position is the sum of the increments gotten from pos_add (plus the position of the last rewind, if any). The starting position is zero. Submatch positions filled in the pmatch[] array are, of course, given using positions computed in this way.
For an example of how to use reguexec(), see the tests/test-str-source.c file in the TRE source code distribution.
#include <tre/regex.h>
typedef struct {
int
cost_ins;
int
cost_del;
int
cost_subst;
int
max_cost;
int
max_ins;
int
max_del;
int
max_subst;
int
max_err;
} regaparams_t;
typedef struct {
size_t
nmatch;
regmatch_t
*pmatch;
int
cost;
int
num_ins;
int
num_del;
int
num_subst;
} regamatch_t;
int regaexec(const
regex_t *preg, const char *string,
regamatch_t
*match,
regaparams_t
params,
int
eflags);
int reganexec(const
regex_t *preg, const char *string,
size_t len,
regamatch_t
*match,
regaparams_t
params,
int eflags);
int regawexec(const
regex_t *preg, const wchar_t *string,
regamatch_t
*match,
regaparams_t
params,
int
eflags);
int
regawnexec(
const
regex_t
*preg,
const
wchar_t
*string,
size_t
len,
regamatch_t
*match,
regaparams_t
params,
int
eflags);
The regaexec() function searches for the best match in string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions.
The reganexec() function is like regaexec(), but string is not terminated by a null byte. Instead, the len argument is used to tell the length of the string, and the string may contain null bytes. The regawexec() and regawnexec() functions work like regaexec() and reganexec(), respectively, but take a wide character (wchar_t) string instead of a byte string.
The eflags argument is like for the regexec() functions.
The params struct controls the approximate matching parameters:
- int cost_ins
- The default cost of an inserted character, that is, an extra character in string.
- int cost_del
- The default cost of a deleted character, that is, a character missing from string.
- int cost_subst
- The default cost of a substituted character.
- int max_cost
- The maximum allowed cost of a match. If this is set to zero, an exact matching is searched for, and results equivalent to those returned by the regexec() functions are returned.
- int max_ins
- Maximum allowed number of inserted characters.
- int max_del
- Maximum allowed number of deleted characters.
- int max_subst
- Maximum allowed number of substituted characters.
- int max_err
- Maximum allowed number of errors (inserts + deletes + substitutes).
The match argument points to a
regamatch_t structure. The
nmatch and pmatch field must be filled by the caller. If
REG_NOSUB
was used when compiling the regexp, or
match->nmatch
is zero, or
match->pmatch
is NULL
, the
match->pmatch
argument is ignored. Otherwise, the
submatches corresponding to the parenthesized subexpressions are
filled in the elements of match->pmatch
, which must be
dimensioned to have at least match->nmatch
elements.
The match->cost
field is set to the cost of the match
found, and the match->num_ins
,
match->num_del
, and match->num_subst
fields are set to the number of inserts, deletes, and substitutes in
the match, respectively.
The regaexec() functions return zero if a match with cost
smaller than params->max_cost
was found, otherwise
they return REG_NOMATCH
to indicate no match, or
REG_ESPACE
to indicate that enough temporary memory could
not be allocated to complete the matching operation.
#include <tre/regex.h>
int tre_have_backrefs(const
regex_t *preg);
int tre_have_approx(const
regex_t *preg);
The tre_have_backrefs() and tre_have_approx() functions return 1 if the compiled pattern has back references or uses approximate matching, respectively, and 0 if not.
#include <tre/regex.h>
char *tre_version(void);
int tre_config(int query, void *result);
The tre_config() function can be used to retrieve information of which optional features have been compiled into the TRE library and information of other parameters that may change between releases.
The query argument is an integer telling what information is requested for. The result argument is a pointer to a variable where the information is returned. The return value of a call to tre_config() is zero if query was recognized, REG_NOMATCH otherwise.
The following values are recognized for query:
- TRE_CONFIG_APPROX
- The result is an integer that is set to one if approximate matching support is available, zero if not.
- TRE_CONFIG_WCHAR
- The result is an integer that is set to one if wide character support is available, zero if not.
- TRE_CONFIG_MULTIBYTE
- The result is an integer that is set to one if multibyte character set support is available, zero if not.
- TRE_CONFIG_SYSTEM_ABI
- The result is an integer that is set to one if TRE has been compiled to be compatible with the system regex ABI, zero if not.
- TRE_CONFIG_VERSION
- The result is a pointer to a static character string that gives the version of the TRE library.
The tre_version() function returns a short human readable character string which shows the software name, version, and license.
The header <tre/regex.h> defines certain C preprocessor symbols.
The following definitions may be useful for checking whether a new enough version is being used. Note that it is recommended to use the pkg-config tool for version and other checks in Autoconf scripts.
- TRE_VERSION
- The version string.
- TRE_VERSION_1
- The major version number (first part of version string).
- TRE_VERSION_2
- The minor version number (second part of version string).
- TRE_VERSION_3
- The micro version number (third part of version string).
The following definitions may be useful for checking whether all necessary features are enabled. Use these only if compile time checking suffices (linking statically with TRE). When linking dynamically tre_config() should be used instead.
- TRE_APPROX
- This is defined if approximate matching support is enabled. The prototypes for approximate matching functions are defined only if TRE_APPROX is defined.
- TRE_WCHAR
- This is defined if wide character support is enabled. The prototypes for wide character matching functions are defined only if TRE_WCHAR is defined.
- TRE_MULTIBYTE
- This is defined if multibyte character set support is enabled. If this is not set any locale settings are ignored, and the default locale is used when parsing regexps and matching strings.