rex
Synopsis
cheia.load “rex” |
|
|
start, finish, captures = rex.find(“text”, “regular expression”) |
|
matches = rex.gsub(“text”, “regular expression”, |
|
function(match, captures) ... end) |
|
|
regex = rex.comp(“regular expression”) |
|
start, finish, captures = regex:match(“text”) |
|
matches = regex:gmatch(“text”, |
|
function(match, captures) ... end) |
|
|
Description
This module provides facilities for pattern matching using POSIX
extended regular expressions (‘regexes’). Regular expressions are
templates which match certain classes of string. In this respect, they
are much like shell ‘globs’ (i.e. pathnames containing the wildcard
characters ‘*’ and ‘?’). However, they are much more powerful. I
will not attempt to describe them completely here, but to summarise
the main facilities:
- If no special characters are used, a regular expression is simply a
string which may be searched for inside another, e.g. the regex “is”
matches twice in the string “this is a test”: first on the last two
characters of ‘this’, and then again on the word ‘is’.
- The ‘.’ character matches any character except a newline, thus
“t.” also matches twice: the first two letters of ‘this’, and the
first two letters of ‘test’, however the final ‘t’ is not matched,
because there is no following character for the ‘.’ to match.
- The ‘*’ character causes the preceeding atom (for the moment an atom
is either a single letter or a dot) to be matched by zero or more
repetitions, so “s.*t” matches once: “s is a test”. Notice that
there are a number of shorter substrings that this expression could
also match. These are not matched because regexes are ‘greedy’: they
take the first match they find and try to make it as long as possible.
- The ‘+’ character is like ‘*’, but causes the preceeding atom to
match one or more times. The ‘?’ character causes it to match zero
or one times.
- Two expressions can be made alternatives using ‘|’,
e.g. “this|test” matches ‘this’ and ‘test’.
- A set of characters can be specified as alternatives using square
brackets. Such a grouping as treated as a single atom, e.g. “[his]+”
means ‘one or more occurrences of h, i or s in a row’, and matches 3
times: the last three letters of ‘this’, the word ‘is’, and and
the ‘s’ from ‘test’. If the first character of a bracketed group
is a ‘^’, then all further characters not listed will
match. Character ranges can also be specified, e.g. “[0-9]” matches
any digit.
- Parts of a regex can become ‘captures’ by placing them in
parentheses. A parenthesised group is also treated as a single
atom. Captured substrings are returned separately by the functions
below.
- Any special character can be escaped by preceding it with a
backslash. Be aware that the backslash itself will need escaping
within quoted strings. For this reason the [[...]] string notation
may be preferable.
Examples
Compile a regex to match identifiers, then print each identifier found
in a string:
ids = rex.comp(“[A-Za-z_][A-Za-z0-9_]*”) |
|
ids:gmatch(“_X6 = 5 * sin(theta)”, |
|
function(s,t) print(s) end) |
» _X6 |
|
» sin |
|
» theta |
|
Reference
- rex.comp(regex)
-
Compiles the supplied regular expression. Once it has been
compiled, it may be used repeatedly without incurring the overhead of
recompiling.
Parameters: |
regex : string
|
The regular expression string.
|
Returns: |
userdata
|
The compiled form.
|
- rex.find(text, regex)
-
Searches for regex within text, and returns the start and end
of the first matching substring followed by a table of
captures. Returns nils if no match is found. Largely equivalent to
the string.find function included with Lua, but uses the more
standard POSIX notation. Avoid using this function repeatedly with the
same regular expression and consider compiling the expression in
advance instead.
Note: string.find returns captures as additional results.
rex.find returns a table of captures.
Parameters: |
text : string
|
The text to search.
|
regex : string
|
The regular expression to find.
|
Returns: |
number
|
The 1-based index of the first character of the first matching substring.
|
number
|
The 1-based index of the last character of the first matching substring.
|
table
|
An array of captured strings.
|
- rex.gsub(text, regex, fn, [n])
-
Searches for regex within text, and for each matching substring
calls fn, passing the whole matching substring as the first
argument, and a table (array) of captures as the second. Provides some
of the functionality of the string.gsub function included with Lua,
but uses the more standard POSIX notation. Avoid using this function
repeatedly with the same regular expression and consider compiling the
expression in advance instead.
Note: Despite the name, this function does not do substitution. It
does not accept a string as the third argument, and it does not return
a modified string. The supplied function is also called with different
parameters to string.gsub.
Parameters: |
text : string
|
The text to search.
|
regex : string
|
The regular expression to find.
|
fn : function
|
The function to call when a matching substring is found.
|
n : number
|
The maximum number of substitutions to perform.
|
Returns: |
number
|
The number of matches found.
|
Compiled Regular Expressions
The following methods are available for the compiled regular
expressions returned by rex.comp:
- regex:match(text)
-
Searches for regex within text, and returns the start and end
of the first matching substring followed by a table of captures.
Parameters: |
text : string
|
The text to search.
|
Returns: |
number
|
The 1-based index of the first character of the first matching substring.
|
number
|
The 1-based index of the last character of the first matching substring.
|
table
|
An array of captured strings.
|
- regex:gmatch(text, fn, [n])
-
Searches for regex within text, and for each matching substring
calls fn, passing the whole matching substring as the first
argument, and a table (array) of captures as the second.
Parameters: |
text : string
|
The text to search.
|
fn : function
|
The function to call when a matching substring is found.
|
n : number
|
The maximum number of substitutions to perform.
|
Returns: |
number
|
The number of matches found.
|
Issues
The differences between the rex and string functions of the same
name could be confusing.
See also
[Unknown reference: pcre].
The UNIX manual page regex(7).
Revision history
Added in LuaCheia 5.0.
Credits
Based on the regex library by Henry Spencer.
Lua module by Reuben Thomas.
Modifications for LuaCheia by Thatcher Ulrich.
Documentation for LuaCheia by Jamie Webb.
|