regexp, regexpi (MATLAB Functions)

Match regular expression

Syntax

Each of these syntaxes apply to both regexp and regexpi. The regexp function is case sensitive in matching regular expressions to a string, and regexpi is case insensitive:

regexp('str', 'expr')
[start end extents match tokens names] = regexp('str', 'expr')
[v1 v2 ...] = regexp('str', 'expr', 'q1', 'q2', ...)
[v1 v2 ...] = regexp('str', 'expr', 'q1', 'q2', ..., 'once')
regexp 'str' 'expr' 'q1' 'q2' ... 'once'

Description

The following descriptions apply to both regexp and regexpi:

regexp('str', 'expr') returns a row vector containing the starting index of each substring of str that matches the regular expression string expr. If no matches are found, regexp returns an empty array. The str and expr arguments can also be cell arrays of strings. See the guidelines listed below under Multiple Strings and Expressions.

[start end extents match tokens names] = regexp('str', 'expr') returns up to six values, one for each output variable you specify, and in the default order (as shown in the table below).

[v1 v2 ...] = regexp('str', 'expr', q1, q2, ...) returns up to six values, one for each output variable you specify, and ordered according to the order of the qualifier arguments, q1, q2, etc.

Return Values for Regular Expressions
Default Order
Description
Qualifier

1
Row vector containing the starting index of each substring of str that matches expr
start

2
Row vector containing the ending index of each substring of str that matches expr
end

3
Cell array containing the starting and ending indices of each substring of str that matches a token in expr
tokenExtents

4
Cell array containing the text of each substring of str that matches expr
match

5
Cell array containing the text of each token captured by regexp.
tokens

6
Structure array containing the name and text of each named token captured by regexp. If there are no named tokens in expr, regexp returns a structure array with no fields.
Field names of the returned structure are set to the token names, and field values are the text of those tokens. Named tokens are generated by the expression (?<tokenname>).
names

**Return Values for Regular Expressions**
Default Order	Description	Qualifier
1	Row vector containing the starting index of each substring of `str` that matches `expr`	`start`
2	Row vector containing the ending index of each substring of `str` that matches `expr`	`end`
3	Cell array containing the starting and ending indices of each substring of `str` that matches a token in `expr`	`tokenExtents`
4	Cell array containing the text of each substring of `str` that matches `expr`	`match`
5	Cell array containing the text of each token captured by `regexp`.	`tokens`
6	Structure array containing the name and text of each named token captured by `regexp`. If there are no named tokens in `expr`, `regexp` returns a structure array with no fields. Field names of the returned structure are set to the token names, and field values are the text of those tokens. Named tokens are generated by the expression `(?<tokenname>)`.	`names`

[v1 v2 ...] = regexp('str', 'expr', 'q1', 'q2', ..., 'once') returns just the first match found. The keyword once must come last in the argument list. Output and qualifier arguments are not required.

regexp 'str' 'expr' 'q1' 'q2' ... 'once' is the command syntax for this function. Only the 'str' and 'expr' arguments are required.

Remarks

Multiple Strings and Expressions

Either the str or expr argument, or both, can be a cell array of strings, according to the following guidelines:

If str is a cell array of strings, then each of the regexp outputs is a cell array having the same dimensions as str.
If str is a single string but expr is a cell array of strings, then each of the regexp outputs is a cell array having the same dimensions as expr.
If both str and expr are cell arrays of strings, these two cell arrays must contain the same number of elements.

See Regular Expressions in the MATLAB documentation for a listing of all regular expression elements supported by MATLAB.

regexp does not support international character sets.

Examples

Example 1

Return a row vector of indices that match words that start with c, end with t, and contain one or more vowels between them. Make the matches insensitive to letter case (by using regexpi):

str = 'bat cat can car COAT court cut ct CAT-scan';
regexpi(str, 'c[aeiou]+t')
ans =
     5    17    28    35

Example 2

Return a cell array of row vectors of indices that match capital letters and white spaces in the cell array of strings str:

str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};
s1 = regexp(str, '[A-Z]');
s2 = regexp(str, '\s');

Capital letters, '[A-Z]', were found at these str indices:

s1{:}
ans =
     1     9
ans =
     1    11
ans =
     1     2     3     4     5     6

Space characters, '\s', were found at these str indices:

s2{:}
ans =
     8
ans =
     6    10
ans =
     7    10

Example 3

Return the text and the starting and ending indices of words containing the letter x:

str = 'regexp helps you relax';
[m s e] = regexp(str, '\w*x\w*', 'match', 'start', 'end')
m = 
    'regexp'    'relax'
s =
     1    18
e =
     6    22

Example 4

Search a string for opening and closing HTML tags. Use the expression <(\w+) to find the opening tag (e.g., '<tagname') and to create a token for it. Use the expression </\1> to find another occurrence of the same token, but formatted as a closing tag (e.g., '</tagname>'):

str = 'if <code>A</code> == x<sup>2</sup>, <em>disp(x)</em>';
expr = '<(\w+).*?>.*?</\1>';

[tok mat] = regexp(str, expr, 'tokens', 'match');

tok{:}
ans = 
    'code'
ans = 
    'sup'
ans = 
    'em'

mat{:}
ans =
    <code>A</code>
ans =
    <sup>2</sup>
ans =
    <em>disp(x)</em>

See "Tokens" in the MATLAB Programming documentation for information on using tokens.

Example 5

Enter a string containing two names, the first and last names being in a different order:

str = sprintf('John Davis\nRogers, James')
str =
    John Davis
    Rogers, James

Create an expression that generates first and last name tokens, assigning the names first and last to the tokens. Call regexp to get the text and names of each token found:

expr = ...
   '(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)';

[tokens names] = regexp(str, expr, 'tokens', 'names');

Examine the tokens cell array that was returned. The first and last name tokens appear in the order in which they were generated: first name-last name, then last name-first name:

tokens{:}
ans = 
    'John'    'Davis'
ans = 
    'Rogers'    'James'

Now examine the names structure that was returned. First and last names appear in a more usable order:

names(:,1)
ans = 
    first: 'John'
     last: 'Davis'

names(:,2)
ans = 
    first: 'James'
     last: 'Rogers'

See Also

regexprep, strfind, findstr, strmatch, strcmp, strcmpi, strncmp, strncmpi

refreshdata regexprep