Book review – Introducing Regular Expressions · guh.me

Regular expressions are scary, especially if don’t understand them. When I started programming, I thought of them as black magic, and that they were evil and complicated (unsavvy kid, uhm?). As I got more skilled, I perceived that they would help me solve problems fast (and sometimes would double the problems ;), I knew it was time to read a book on it. I chose Introducing Regular Expressions, by Michael Fitzgerald, because it was very short and it goes straight to the point - perfect for beginners who like fast paced books. I’ve greatly improved my ~~witchcraft~~ regex skills, and now people even seek advice from me, lol. Next I’m reading Mastering Regular Expressions, which is pretty hardcore.

Book on Amazon: http://www.amazon.com/gp/product/1449392687

My Personal Notes

Chapter 1: What’s a regular expression?

Regular expressions are specially encoded text strings used as patterns for matching sets of strings.
String literal: literal representation of a string. It will always match.
- [0-9] matches any digit in the range 0-9.
Metacharacter: has special meaning in a regular expression and is reserved (e.g.: \, *, [, etc).
- [0143] is a limited range/character class (character set between brackets).
- \d matches all arabic digits (\d == character shorthand).
- \D matches any character that’s not a digit.
- . matches any character (“wildcard”).

Capturing groups and back references

(\d)\d\1          707-827-7019

(\d) => capturing group, matches 7
\d => matches 0
\1 => back reference (matches a previous captured string, 7 in this case, which in turn was captured by (\d)).

Quantifiers

Numbers in curly braces tell the number of occurrences that you want to look for.

\d{3}-?\d{3}-?\d{4}

? : zero or one occurrences quantifier.
+ : one or more occurrences.
***** : zero or more occurrences.

\d{3,4} : minimum and maximum quantity to match.

Expression analysis: (\d{3,4}[.-]?)+

( ) => capturing group
\d => digit shorthand
[ ] => character class
? => zero or one quantifier
+ => one or more quantifier
{3,4} => range quantifier
[.-] => literal dot and hyphen

Quoting literals

^ : beginning of a line or string
$ : end of a line or string
| : alternation, given choice of alternatives
\X : literal X

Final expression (that matches an american phone)

^(\(\d{3}\)|^\d{3}[.-]?)\d{3}[.-]?\d{4}

Chapter 2: Simple pattern matching

Regex test tools

Pattern matching

Matching string literals: use string literals, doh!
Matching digits: [0-9], [059], \d
Matching non-digits: \D, [^0-9], [^\d]
Matching word and non word characters:
- All letters and numbers: \w
- Non-word characters: \W
- Word boundary: \b
- Non-word boundary: \B
- Carriage return: \r
- Newline: \n
- Space character: \s
- Horizontal tab: \t
- Null character: \0
- Horizontal space: \h
- Non-horizontal space: \H
Match all characters: . (except newlines, when without modifier)

Chapter 3: Boundaries

Assertions mark boundaries, but they don’t consume characters, i.e., they won’t be returned in the results.
Boundaries don’t match characters, but location in strings.
The beginning and end of a line:
- ^ matches line or string beginning
- $ matches line or string end
Word and non-word boundaries
- \b matches word boundary
- \B matches non-word boundary
Other anchors
- \A : matches the start of a subject (line)
- \Z : matches the end of a subject (line)
Quoting a group of characters as literals
- Reserved characters can be escaped in a pattern: \QXXXXXXXX\E
- Where X are caracters. E.g.: \Q$-\E will match *$*- in a string.

Chapter 4: Alternation, groups and back references

Character classes help you match specific characters, or a sequence of specific characters.
- \b[1][0-9]\b matches 10 to 19
Negated character class
- [^a-z] matches everything except a-z (minuscule)
Union and difference: character classes can act like sets, therefore they can be united.
- [0-3[6-9]] = {0, 1, 2, 3} U {6, 7, 8, 9} = {0, 1, 2, 3, 6, 7, 8, 9}

Chapter 5: Matching Unicode and other characters

Matching unicode characters:
- \u00e9 => hex value
\0 : null character
**\e **: escape character
[\b] : backspace

Chapter 6: Quantifiers

Quantifiers are greedy (it first tries to match the whole string) by default. This takes a lot of resources.
A lazy quantifier starts at the beginning of the target, trying to find a match. It goes one char a time, and tries to match the whole string at the end. A lazy quantifier must have a question mark ? appended.
A possessive quantifier grabs the whole target and tries to find a match in a single attempt. It must have a plus sign + appended.
***** is called the Kleene Star (LOL).
Examples:
- 9* : matches zero or more 9s
- 9+ : matches one or more 9s
- 9? : matches one or zero 9s
Matching a specific number of times:
- {x} : matches _x_ occurrences os .
- {x, y} : matches _x_ to _y_ occurrences of .
- {x, } : matches at least _X_ occurrences of .
Lazy quantifiers: they match as few characters as possible (including zero/no characters).
- lazy zero or more => matches 0
- lazy zero or one => matches 0
- lazy one or more => matches 1
- lazy {n} or more => matches n
Possessive quantifiers: it grabs as much as it can, but it doesn’t backtrack, i.e., it doesn’t give up anything it finds. They are FASTER!
- Input: 00000000 => 0.*+ : matches all
- steps 0 - input[0]
- 00 - input[0, 1]
- 000 - input[0, 1, 2]
- … - input[0, 1, … , n]
- It has no backtracking: .*+ matches nothing
- steps 0 - input[0]
- 0 - input[1]
- 0 - input[2]
- … - input[n]
Possessive quantifiers
- ?+ : possessive zero or one
- ++ : possessive one or more
- *+ : possessive zero or more
- {n}+ : possessive n
- {n, }+ : possessive n or more
- {m, n}+ : possessive {m, n}

Chapter 7: Lookarounds

Lookarounds are non-capturing groups that match patterns based on what they find in front of or before a pattern.
Positive lookahead: finds an occurrence of a string that is followed by a certain string.
- (?i)ancyent (?=marinere)
- “Find the occurrences of ‘ancyent’ (case insensitive) that are followed by the word ‘marinere’”.
Negative lookahead: it tries to find a match that’s not followed by the lookahead.
- (?i)ancyent (?!marinere)
- “Find the occurrences of ‘ancyent’ (case insensitive) that are not followed by the word ‘marinere’”.
Positive lookbehind: finds an occurrence of a string that is preceded by a certain string.
- (?i)(?<=ancyent) marinere
- “Find the occurrences of ‘marinere’ (case insensitive) that are preceded by the word ‘ancyent’”.
Negative lookbehind: finds an occurrence of a string that is not preceded by a certain string.
- (?i)(?<!ancyent) marinere
- “Find the occurrences of ‘marinere’ (case insensitive) that are not preceded by the word ‘ancyent’”.

Chapter 8: Marking up a document with HTML

Regex modifiers and flags
- **/g **: global matching
- /m: multiline strings
- /i: case insensitive