Regular expressions are scary, especially if don’t understand them. When I started programming, I thought of them as black magic, and that they were evil and complicated (unsavvy kid, uhm?). As I got more skilled, I perceived that they would help me solve problems fast (and sometimes would double the problems ;), I knew it was time to read a book on it. I chose Introducing Regular Expressions, by Michael Fitzgerald, because it was very short and it goes straight to the point - perfect for beginners who like fast paced books. I’ve greatly improved my witchcraft regex skills, and now people even seek advice from me, lol. Next I’m reading Mastering Regular Expressions, which is pretty hardcore.

Book on Amazon: http://www.amazon.com/gp/product/1449392687

My Personal Notes

Chapter 1: What’s a regular expression?

  • Regular expressions are specially encoded text strings used as patterns for matching sets of strings.
  • String literal: literal representation of a string. It will always match.
    • [0-9] matches any digit in the range 0-9.
  • Metacharacter: has special meaning in a regular expression and is reserved (e.g.: \, *, [, etc).
    • [0143] is a limited range/character class (character set between brackets).
    • \d matches all arabic digits (\d == character shorthand).
    • \D matches any character that’s not a digit.
    • . matches any character (“wildcard”).

Capturing groups and back references

(\d)\d\1          707-827-7019
  • (\d) => capturing group, matches 7
  • \d => matches 0
  • \1 => back reference (matches a previous captured string, 7 in this case, which in turn was captured by (\d)).

Quantifiers

Numbers in curly braces tell the number of occurrences that you want to look for.

\d{3}-?\d{3}-?\d{4}
  • ? : zero or one occurrences quantifier.
  • + : one or more occurrences.
  • ***** : zero or more occurrences.

\d{3,4} : minimum and maximum quantity to match.

Expression analysis: (\d{3,4}[.-]?)+
  • ( ) => capturing group
  • \d => digit shorthand
  • [ ] => character class
  • ? => zero or one quantifier
  • + => one or more quantifier
  • {3,4} => range quantifier
  • [.-] => literal dot and hyphen

Quoting literals

  • ^ : beginning of a line or string
  • $ : end of a line or string
  • | : alternation, given choice of alternatives
  • \X : literal X

Final expression (that matches an american phone)

^(\(\d{3}\)|^\d{3}[.-]?)\d{3}[.-]?\d{4}

Chapter 2: Simple pattern matching

Regex test tools

Pattern matching

  • Matching string literals: use string literals, doh!
  • Matching digits: [0-9], [059], \d
  • Matching non-digits: \D, [^0-9], [^\d]
  • Matching word and non word characters:
    • All letters and numbers: \w
    • Non-word characters: \W
    • Word boundary: \b
    • Non-word boundary: \B
    • Carriage return: \r
    • Newline: \n
    • Space character: \s
    • Horizontal tab: \t
    • Null character: \0
    • Horizontal space: \h
    • Non-horizontal space: \H
  • Match all characters: . (except newlines, when without modifier)

Chapter 3: Boundaries

  • Assertions mark boundaries, but they don’t consume characters, i.e., they won’t be returned in the results.
  • Boundaries don’t match characters, but location in strings.
  • The beginning and end of a line:
    • ^ matches line or string beginning
    • $ matches line or string end
  • Word and non-word boundaries
    • \b matches word boundary
    • \B matches non-word boundary
  • Other anchors
    • \A : matches the start of a subject (line)
    • \Z : matches the end of a subject (line)
  • Quoting a group of characters as literals
    • Reserved characters can be escaped in a pattern: \QXXXXXXXX\E
    • Where X are caracters. E.g.: \Q$-\E will match *$*- in a string.

Chapter 4: Alternation, groups and back references

  • Character classes help you match specific characters, or a sequence of specific characters.
    • \b[1][0-9]\b matches 10 to 19
  • Negated character class
    • [^a-z] matches everything except a-z (minuscule)
  • Union and difference: character classes can act like sets, therefore they can be united.
    • [0-3[6-9]] = {0, 1, 2, 3} U {6, 7, 8, 9} = {0, 1, 2, 3, 6, 7, 8, 9}

Chapter 5: Matching Unicode and other characters

  • Matching unicode characters:
    • \u00e9 => hex value
  • \0 : null character
  • **\e **: escape character
  • [\b] : backspace

Chapter 6: Quantifiers

  • Quantifiers are greedy (it first tries to match the whole string) by default. This takes a lot of resources.
  • A lazy quantifier starts at the beginning of the target, trying to find a match. It goes one char a time, and tries to match the whole string at the end. A lazy quantifier must have a question mark ? appended.
  • A possessive quantifier grabs the whole target and tries to find a match in a single attempt. It must have a plus sign + appended.
  • ***** is called the Kleene Star (LOL).
  • Examples:
    • 9* : matches zero or more 9s
    • 9+ : matches one or more 9s
    • 9? : matches one or zero 9s
  • Matching a specific number of times:
    • {x} : matches _x_ occurrences os .
    • {x, y} : matches _x_ to _y_ occurrences of .
    • {x, } : matches at least _X_ occurrences of .
  • Lazy quantifiers: they match as few characters as possible (including zero/no characters).
    • lazy zero or more => matches 0
    • lazy zero or one => matches 0
    • lazy one or more => matches 1
    • lazy {n} or more => matches n
  • Possessive quantifiers: it grabs as much as it can, but it doesn’t backtrack, i.e., it doesn’t give up anything it finds. They are FASTER!
    • Input: 00000000 => 0.*+ : matches all
    • steps 0 - input[0]
    • 00 - input[0, 1]
    • 000 - input[0, 1, 2]
    • … - input[0, 1, … , n]
    • It has no backtracking: .*+ matches nothing
    • steps 0 - input[0]
    • 0 - input[1]
    • 0 - input[2]
    • … - input[n]
  • Possessive quantifiers
    • ?+ : possessive zero or one
    • ++ : possessive one or more
    • *+ : possessive zero or more
    • {n}+ : possessive n
    • {n, }+ : possessive n or more
    • {m, n}+ : possessive {m, n}

Chapter 7: Lookarounds

  • Lookarounds are non-capturing groups that match patterns based on what they find in front of or before a pattern.
  • Positive lookahead: finds an occurrence of a string that is followed by a certain string.
    • (?i)ancyent (?=marinere)
    • “Find the occurrences of ‘ancyent’ (case insensitive) that are followed by the word ‘marinere’”.
  • Negative lookahead: it tries to find a match that’s not followed by the lookahead.
    • (?i)ancyent (?!marinere)
    • “Find the occurrences of ‘ancyent’ (case insensitive) that are not followed by the word ‘marinere’”.
  • Positive lookbehind: finds an occurrence of a string that is preceded by a certain string.
    • (?i)(?<=ancyent) marinere
    • “Find the occurrences of ‘marinere’ (case insensitive) that are preceded by the word ‘ancyent’”.
  • Negative lookbehind: finds an occurrence of a string that is not preceded by a certain string.
    • (?i)(?<!ancyent) marinere
    • “Find the occurrences of ‘marinere’ (case insensitive) that are not preceded by the word ‘ancyent’”.

Chapter 8: Marking up a document with HTML

  • Regex modifiers and flags
    • **/g **: global matching
    • /m: multiline strings
    • /i: case insensitive