Tokens :: Basic Program Components (Programming)

Programming

Tokens

Parentheses used in a regular expression not only group elements of that expression together, but also designate any matches found for that group as tokens. You can use tokens to match other parts of the same string. One advantage of using tokens is that they remember what they matched, so you can recall and reuse matched text in the process of searching or replacing.

This section covers

Introduction to Using Tokens

You can turn any pattern being matched into a token by enclosing the pattern in parentheses within the expression. For example, to create a token for a dollar amount, you could use '(\$\d+)'. Each token in the expression is assigned a number from 1 to 255 going from left to right. To make a reference to a token later in the expression, refer to it using a backslash followed by the token number. For example, when referencing a token generated by the third set of parentheses in the expression, use \3.

As a simple example, if you wanted to search for identical sequential letters in a string, you could capture the first letter as a token and then search for a matching character immediately afterwards. In the expression shown below, the (\S) phrase creates a token whenever regexp matches any non-white-space character in the string. The second part of the expression, '\1', looks for a second instance of the same character immediately following the first:

poestr = ['While I nodded, nearly napping, ' ...
          'suddenly there came a tapping,'];

[mat tok ext] = regexp(poestr, '(\S)\1', 'match', ...
   'tokens', 'tokenExtents');
mat
mat = 
    'dd'    'pp'    'dd'    'pp'

The tokens returned in cell array tok are:

```
'd', 'p', 'd', 'p'
```

Starting and ending indices for each token in the input string poestr are:

```
11 11,  26 26,  35 35,  57 57
```

Using the token Parameter

You can have regexp and regexpi return the actual tokens rather than token indices by specifying the optional 'token' parameter in the command. The following example is the same as the one above, except that it returns the text of the tokens found by the pattern \S.

tok = regexp(poestr, '(\S)\1', 'tokens')
tok = 
    {1x1 cell}    {1x1 cell}    {1x1 cell}    {1x1 cell}

tok{:}
ans = 
    'd'
ans = 
    'p'
ans = 
    'd'
ans = 
    'p'

Operators Used with Tokens

Here are the operators you can use with tokens in MATLAB.

Operator
Usage

(expr)
Capture in a token all characters matched by the expression within the parentheses.

\N
Match the N^th token generated by this command. That is, use \1 to match the first token, \2 to match the second, and so on.

$N
Insert the match for the N^th token in a replacement string. Used only by the regexprep function.

(?<name>expr)
Capture in a token all characters matched by the expression within the parentheses. Assign a name to the token.

\k<name>
Match the token referred to by name.

(?(tok)expr)
If token tok is generated, then match expression expr.

(?(tok)expr₁|
expr₂)
If token tok is generated, then match expression expr₁. Otherwise, match expression expr₂.

Operator	Usage
`(expr)`	Capture in a token all characters matched by the expression within the parentheses.
`\N`	Match the `N`^th token generated by this command. That is, use `\1` to match the first token, `\2` to match the second, and so on.
$N	Insert the match for the `N`^th token in a replacement string. Used only by the `regexprep` function.
`(?<name>expr)`	Capture in a token all characters matched by the expression within the parentheses. Assign a `name` to the token.
`\k<name>`	Match the token referred to by `name`.
(?(tok)expr)	If token `tok` is generated, then match expression `expr`.
(?(tok)expr₁\| expr₂)	If token `tok` is generated, then match expression `expr`₁. Otherwise, match expression `expr`₂.

Using Tokens -- Example 1

Here is an example of how tokens are assigned values. Suppose that you are going to search the following text:

```
andy ted bob jim andrew andy ted mark
```

You choose to search the above text with the following search pattern:

```
and(y|rew)|(t)e(d)
```

This pattern has three parenthetical expressions that generate tokens. When you finally perform the search, the following tokens are generated for each match.

Match
Token 1
Token 2

andy
y

ted
t
d

andrew
rew

andy
y

ted
t
d

Match	Token 1	Token 2
`andy`	`y`
`ted`	`t`	`d`
`andrew`	`rew`
`andy`	`y`
`ted`	`t`	`d`

Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is used, token 1 is assigned the value andrew.

Using Tokens -- Example 2

Use (expr) and \N to capture pairs of matching HTML tags (e.g., <a> and <\a>) and the text between them. The expression used for this example is

```
expr = '<(\w+).*?>.*?</\1>';
```

The first part of the expression, '<(\w+)', matches an opening bracket (<) followed by one or more alphabetic, numeric, or underscore characters. The enclosing parentheses capture token characters following the opening bracket.

The second part of the expression, '.*?>.*?', matches the remainder of this HTML tag (characters up to the >), and any characters that may precede the next opening bracket.

The last part, '</\1>', matches all characters in the ending HTML tag. This tag is composed of the sequence </tag>, where tag is whatever characters were captured as a token.

hstr = '<!comment><a name="752507"></a><b>Default</b><br>';
expr = '<(\w+).*?>.*?</\1>';

[mat tok] = regexp(hstr, expr, 'match', 'tokens');
mat{:}
ans =
    <a name="752507"></a>
ans =
    <b>Default</b>

tok{:}
ans = 
    'a'
ans = 
    'b'

Using Tokens in a Replacement String

When using tokens in a replacement string, reference them using $1, $2, etc. instead of \1, \2, etc. This example captures two tokens and reverses their order. The first, $1, is 'Norma Jean' and the second, $2, is 'Baker'. Note that regexprep returns the modified string, not a vector of starting indices, by default:

regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')
ans =
    Baker, Norma Jean

Named Capture -- (?<name>expr)

If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than having to keep track of which token number is assigned to which token. Use the operator (?<name>expr) to assign name to the token matching expression expr.

When referencing a named token within the expression, use the syntax \k<name> instead of the numeric \1, \2, etc.:

poestr = ['While I nodded, nearly napping, ' ...
          'suddenly there came a tapping,'];

regexp(poestr, '(?<anychar>.)\k<anychar>', 'match')
ans = 
    'dd'    'pp'    'dd'    'pp'

Conditional Expressions -- (?(token)expr1|expr2)

With conditional regular expressions, you can select which pattern to match, depending on whether a token elsewhere in the string is found. The expression appears as

```
(?(token)expr₁|expr₂)
```

This expression can be translated as an if-then-else statement, as follows:

if the specified token is found
   then match expression expr₁   else match expression expr₂

The next example uses the conditional expression expr to match the string regardless of the gender used. The expression creates a token if Mr is followed by the letter s. It later matches either her or his, depending on whether this token was found. The phrase (?(1)her|his) means that if token 1 is found, then match her, else match his:

expr = 'Mr(s?)\..*?(?(1)her|his) son';

[mat tok] = regexp('Mr. Clark went to see his son', ...
   expr, 'match', 'tokens')
mat = 
    'Mr. Clark went to see his son'
tok = 
    {1x2 cell}

tok{:}
ans = 
     ''    'his'

In the second part of the example, the token s is found and MATLAB matches the word her:

[mat tok] = regexp('Mrs. Clark went to see her son', ...
expr, 'match', 'tokens')
mat = 
    'Mrs. Clark went to see her son'
tok = 
    {1x2 cell}

tok{:}
ans = 
    's'    'her'

Note The MATLAB regular expression functions support both if-then and if-then-else statements.

Quantifiers Handling Multiple Strings