-
Notifications
You must be signed in to change notification settings - Fork 88
/
Copy pathregular-expression-guide.txt
477 lines (348 loc) · 17.5 KB
/
regular-expression-guide.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
==============================================================================
Regex Guide
==============================================================================
This is a small summary of most of the regex operators and how to use them.
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line
| Alternation (or)
[abc] Match only a, b or c
[^abc] Match anything except a, b and c
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
You compose regular expressions from operators. Most operators have more than
one representation as characters.
====================
1. Operator Overview
====================
Match-self Operator Ordinary characters.
Match-any-character Operator .
Concatenation Operator Juxtaposition.
Repetition Operators * + ? {}
Alternation Operator |
List Operators [...] [^...]
Grouping Operators (...)
Back-reference Operator \digit
Anchoring Operators ^ $
The Match-self Operator (ORDINARY CHARACTER)
============================================
This operator matches the character itself. All ordinary characters
represent this operator. For example, `f' is always an ordinary character, so
the regular expression `f' matches only the string `f'. In particular, it
does *not* match the string `ff'.
The Match-any-character Operator (`.')
======================================
This operator matches any single printing or nonprinting character
except it won't match a newline.
The `.' (period) character represents this operator. For example,
`a.b' matches any three-character string beginning with `a' and ending
with `b'.
The Concatenation Operator
==========================
This operator concatenates two regular expressions A and B. No character
represents this operator; you simply put B after A. The result is a regular
expression that will match a string if A matches its first part and B matches
the rest. For example, `xy' (two match-self operators) matches `xy'.
Repetition Operators
====================
Repetition operators repeat the preceding regular expression a specified
number of times.
Match-zero-or-more *
Match-one-or-more +
Match-zero-or-one ?
Interval {}
The Match-zero-or-more Operator (`*')
-------------------------------------
This operator repeats the smallest possible preceding regular
expression as many times as necessary (including zero) to match the
pattern. `*' represents this operator. For example, `o*' matches any
string made up of zero or more `o's. Since this operator operates on
the smallest preceding regular expression, `fo*' has a repeating `o',
not a repeating `fo'. So, `fo*' matches `f', `fo', `foo', and so on.
Since the match-zero-or-more operator is a suffix operator, it may
be useless as such when no regular expression precedes it. This is
the case when it:
* is first in a regular expression, or
* follows a match-beginning-of-line, open-group, or alternation
operator.
The matcher processes a match-zero-or-more operator by first
matching as many repetitions of the smallest preceding regular
expression as it can. Then it continues to match the rest of the
pattern.
If it can't match the rest of the pattern, it backtracks (as many
times as necessary), each time discarding one of the matches until it
can either match the entire pattern or be certain that it cannot get a
match. For example, when matching `ca*ar' against `caaar', the
matcher first matches all three `a's of the string with the `a*' of
the regular expression. However, it cannot then match the final `ar'
of the regular expression against the final `r' of the string. So it
backtracks, discarding the match of the last `a' in the string. It
can then match the remaining `ar'.
The Match-one-or-more Operator (`+')
------------------------------------
This operator is similar to the match-zero-or-more operator except
that it repeats the preceding regular expression at least once.
For example, `ca+r' matches, e.g., `car' and `caaaar', but not `cr'.
The Match-zero-or-one Operator (`?')
------------------------------------
This operator is similar to the match-zero-or-more operator except
that it repeats the preceding regular expression once or not at all.
For example, `ca?r' matches both `car' and `cr', but nothing else.
Interval Operators (`{' ... `}'
-------------------------------
`{COUNT}'
matches exactly COUNT occurrences of the preceding regular
expression.
`{MIN,}'
matches MIN or more occurrences of the preceding regular
expression.
`{MIN, MAX}'
matches at least MIN but no more than MAX occurrences of the
preceding regular expression.
The interval expression (but not necessarily the regular expression
that contains it) is invalid if:
* MIN is greater than MAX, or
* any of COUNT, MIN, or MAX are outside the range zero to
`RE_DUP_MAX' (which symbol `regex.h' defines).
The Alternation Operator (`|')
==============================
Alternatives match one of a choice of regular expressions: if you put the
character(s) representing the alternation operator between any two regular
expressions A and B, the result matches the union of the strings that A and B
match. For example, supposing that `|' is the alternation operator, then
`foo|bar|quux' would match any of `foo', `bar' or `quux'.
The alternation operator operates on the *largest* possible surrounding
regular expressions. (Put another way, it has the lowest precedence of any
regular expression operator.) Thus, the only way you can delimit its arguments
is to use grouping. For example, if `(' and `)' are the open and close-group
operators, then `fo(o|b)ar' would match either `fooar' or `fobar'. (`foo|bar'
would match `foo' or `bar'.)
The matcher usually tries all combinations of alternatives so as to match
the longest possible string. For example, when matching
`(fooq|foo)*(qbarquux|bar)' against `fooqbarquux', it cannot take, say, the
first ("depth-first") combination it could match, since then it would be
content to match just `fooqbar'.
List Operators (`[' ... `]')
============================
"Lists", also called "bracket expressions", are a set of one or more items.
An "item" is a character, a character class expression, or a range expression.
The syntax bits affect which kinds of items you can put in a list. We explain
the last two items in subsections below. Empty lists are invalid.
A "matching list" matches a single character represented by one of the list
items. You form a matching list by enclosing one or more items within an
"open-matching-list operator" (represented by `[') and a "close-list operator"
(represented by `]').
For example, `[ab]' matches either `a' or `b'. `[ad]*' matches the empty
string and any string composed of just `a's and `d's in any order. Regex
considers invalid a regular expression with a `[' but no matching `]'.
"Nonmatching lists" are similar to matching lists except that they match a
single character *not* represented by one of the list items. You use an
"open-nonmatching-list operator" (represented by `[^'(1)) instead of an
open-matching-list operator to start a nonmatching list.
For example, `[^ab]' matches any character except `a' or `b'.
Most characters lose any special meaning inside a list. The special
characters inside a list follow.
`]'
ends the list if it's not the first list item. So, if you want to
make the `]' character a list item, you must put it first.
`[:'
represents the open-character-class operator if what follows is a valid
character class expression.
`:]'
represents the close-character-class operator if what precedes it is an
open-character-class operator followed by a valid character class name.
`-'
represents the range operator if it's not first or last in a list or the
ending point of a range.
All other characters are ordinary. For example, `[.*]' matches `.' and
`*'.
Character Class [:class:]
Range start-end
---------- Footnotes ----------
(1) Regex therefore doesn't consider the `^' to be the first
character in the list. If you put a `^' character first in (what you
think is) a matching list, you'll turn it into a nonmatching list.
Character Class Operators (`[:' ... `:]')
-----------------------------------------
A "character class expression" matches one character from a given
class. You form a character class expression by putting a character
class name between an "open-character-class operator" (represented by
`[:') and a "close-character-class operator" (represented by `:]').
The character class names and their meanings are:
`alnum'
letters and digits
`alpha'
letters
`blank'
system-dependent; for GNU, a space or tab
`cntrl'
control characters (in the ASCII encoding, code 0177 and codes
less than 040)
`digit'
digits
`graph'
same as `print' except omits space
`lower'
lowercase letters
`print'
printable characters (in the ASCII encoding, space tilde--codes
040 through 0176)
`punct'
neither control nor alphanumeric characters
`space'
space, carriage return, newline, vertical tab, and form feed
`upper'
uppercase letters
`xdigit'
hexadecimal digits: `0'-`9', `a'-`f', `A'-`F'
These correspond to the definitions in the C library's `<ctype.h>'
facility. For example, `[:alpha:]' corresponds to the standard
facility `isalpha'. Regex recognizes character class expressions only
inside of lists; so `[[:alpha:]]' matches any letter, but `[:alpha:]'
outside of a bracket expression and not followed by a repetition
operator matches just itself.
The Range Operator (`-')
------------------------
Regex recognizes "range expressions" inside a list. They represent
those characters that fall between two elements in the current
collating sequence. You form a range expression by putting a "range
operator" between two characters.(1) `-' represents the range
operator. For example, `a-f' within a list represents all the
characters from `a' through `f' inclusively.
Since `-' represents the range operator, if you want to make a `-'
character itself a list item, you must do one of the following:
* Put the `-' either first or last in the list.
* Include a range whose starting point collates strictly lower than
`-' and whose ending point collates equal or higher. Unless a
range is the first item in a list, a `-' can't be its starting
point, but *can* be its ending point. That is because Regex
considers `-' to be the range operator unless it is preceded by
another `-'. For example, in the ASCII encoding, `)', `*', `+',
`,', `-', `.', and `/' are contiguous characters in the collating
sequence. You might think that `[)-+--/]' has two ranges: `)-+'
and `--/'. Rather, it has the ranges `)-+' and `+--', plus the
character `/', so it matches, e.g., `,', not `.'.
* Put a range whose starting point is `-' first in the list.
For example, `[-a-z]' matches a lowercase letter or a hyphen (in
English, in ASCII).
---------- Footnotes ----------
(1) You can't use a character class for the starting or ending point
of a range, since a character class is not a single character.
Grouping Operators (`(' ... `)')
================================
A "group", also known as a "subexpression", consists of an "open-group
operator", any number of other operators, and a "close-group operator". Regex
treats this sequence as a unit, just as mathematics and programming languages
treat a parenthesized expression as a unit.
Therefore, using "groups", you can:
* delimit the argument(s) to an alternation operator or a repetition
operator
* keep track of the indices of the substring that matched a given
group. for a precise explanation. This lets you:
* use the back-reference operator
* use registers
The Back-reference Operator ("\"DIGIT)
======================================
A back reference matches a specified preceding group. The back reference
operator is represented by `\DIGIT' anywhere after the end of a regular
expression's DIGIT-th group.
DIGIT must be between `1' and `9'. The matcher assigns numbers 1 through 9
to the first nine groups it encounters. By using one of `\1' through `\9'
after the corresponding group's close-group operator, you can match a
substring identical to the one that the group does.
Back references match according to the following (in all examples below, `('
represents the open-group, `)' the close-group, `{' the open-interval and `}'
the close-interval operator):
* If the group matches a substring, the back reference matches an
identical substring. For example, `(a)\1' matches `aa' and
`(bana)na\1bo\1' matches `bananabanabobana'. Likewise, `(.*)\1' matches
any newline-free string that is composed of two identical halves; the
`(.*)' matches the first half and the `\1' matches the second half.
* If the group matches more than once (as it might if followed by,
e.g., a repetition operator), then the back reference matches the
substring the group *last* matched. For example, `((a*)b)*\1\2' matches
`aabababa'; first group 1 (the outer one) matches `aab' and group 2 (the
inner one) matches `aa'. Then group 1 matches `ab' and group 2 matches
`a'. So, `\1' matches `ab' and `\2' matches `a'.
* If the group doesn't participate in a match, i.e., it is part of an
alternative not taken or a repetition operator allows zero repetitions of
it, then the back reference makes the whole match fail. For example,
`(one()|two())-and-(three\2|four\3)' matches `one-and-three' and
`two-and-four', but not `one-and-four' or `two-and-three'. For example,
if the pattern matches `one-and-', then its group 2 matches the empty
string and its group 3 doesn't participate in the match. So, if it then
matches `four', then when it tries to back reference group 3--which it
will attempt to do because `\3' follows the `four'--the match will fail
because group 3 didn't participate in the match.
You can use a back reference as an argument to a repetition operator. For
example, `(a(b))\2*' matches `a' followed by two or more `b's. Similarly,
`(a(b))\2{3}' matches `abbbb'.
If there is no preceding DIGIT-th subexpression, the regular expression is
invalid.
Anchoring Operators
===================
These operators can constrain a pattern to match only at the
beginning or end of the entire string or at the beginning or end of a
line.
Match-beginning-of-line ^
Match-end-of-line $
The Match-beginning-of-line Operator (`^')
------------------------------------------
This operator can match the empty string either at the beginning of
the string or after a newline character. Thus, it is said to "anchor"
the pattern to the beginning of a line.
In the cases following, `^' represents this operator. (Otherwise,
`^' is ordinary.)
* It (the `^') is first in the pattern, as in `^foo'.
* It follows an open-group or alternation operator, as in `a\(^b\)'
and `a\|^b'.
The Match-end-of-line Operator (`$')
------------------------------------
This operator can match the empty string either at the end of the
string or before a newline character in the string. Thus, it is said
to "anchor" the pattern to the end of a line.
It is always represented by `$'. For example, `foo$' usually
matches, e.g., `foo' and, e.g., the first three characters of
`foo\nbar'.
============================
2. Word and Buffer Operators
============================
Word Operators
Match-word-boundary \b
This operator (represented by `\b') matches the empty string at
either the beginning or the end of a word. For example, `\brat\b'
matches the separate word `rat'.
Match-within-word \B
This operator (represented by `\B') matches the empty string within
a word. For example, `c\Brat\Be' matches `crate', but `dirty \Brat'
doesn't match `dirty rat'.
Match-beginning-of-word \<
This operator (represented by `\<') matches the empty string at the
beginning of a word.
Match-end-of-word \>
This operator (represented by `\>') matches the empty string at the
end of a word.
Match-word-constituent \w
This operator (represented by `\w') matches any word-constituent
character.
Match-non-word-constituent \W
This operator (represented by `\W') matches any character that is not
word-constituent.
Buffer Operators
("buffers" in mail2sms are the whole input strings that you can match.)
Match-beginning-of-buffer \`
This operator (represented by `\`') matches the empty string at the
beginning of the buffer.
Match-end-of-buffer \'
This operator (represented by `\'') matches the empty string at the
end of the buffer.
Shamelessly stolen from http://daniel.haxx.se/projects/mail2sms/regex.shtml