Regular Expression (Regex) Tutorial

Regular expression, or regex or regexp in short, is extremely powerful for searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming code.

Regex supports all scripting languages (such as Perl, Python, PHP and JavaScript); as well as general purpose programming languages such as Java; and even word processors such as Word to search for texts. Getting started with regex may not be easy due to its geeky syntax, but it’s certainly worth the investment of your time.

Regex for examples

This section is intended for those who need to refresh their memory. For beginners, go to the next section to learn the syntax, before looking at these examples.

Regex Syntax Summary

Character: All characters, except those that have a special meaning in regex, match each other. For example, regex x matches the substring “x”; regex 9 matches “9”; regex = matches “=”

    ; and regex @ matches “@”.

  • Special Regex characters: These characters have a special meaning in regex (which will be discussed later): ., +, *, ?, ^, $, (, ), [, ], {, }, |
  • , \

  • .
  • Escape sequences (\char):
    • To match a character that has special meaning in regex, you must use an escape sequence prefix with a backslash (\). \. matches “.”; regex \+ matches “+”; and regex \( matches “(“.
    • You must also use regex \\ to match “\” (backslash).
    • Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for an octal number up to 3 digits, \xhh for a two-digit hexadecimal code, \uhhhh for a 4-digit Unicode, \uhhhh for an 8-digit Unicode.
  • A sequence of characters (

  • or string): Strings can be paired by combining a sequence of characters (called subexpressions). For example, regex Saturday matches “Saturday”. Matching, by default, is case-sensitive, but can be set to case-sensitive using a switch.
  • OR operator (|): For example, regex four|4 accepts strings “four” or “4”.
  • Character class (or bracket list): […]: Accept ANY of the characters inside the bracket, for example, [aeiou] matches “a”, “e”, “i”, “o”

    • or “u”. [.-.] (Range expression):

    • Accept ANY of the characters
    • in

    • the range, for example, [0-9] matches any digit; [A-Za-z] matches any uppercase or lowercase letter.
    • [^…]: NONE of the characters, for example, [^0-9] matches any non-digit
    • digit.

    • Only these four characters require an escape sequence within the bracket list: ^, -, ], \.
  • Occurrence indicators (or repeat operators): +: one or more (1+), for example, [0-9]+ matches one or more digits such as ‘123’, ‘000’. *: zero or more (0+), for example, [0-9]* matches zero or

    • more digits.

    • Accepts all of [0-9]+ plus the
    • empty string. ?

    • : zero or one (optional), for example, [+-]? matches an optional “+”, “-” or an empty string.
    • {m, n}: m to n (both inclusive
    • ) {m}:

    • exactly m times {
    • m

    • ,}: m or more (m+)
  • Metacharacters: matches a character
    • . (period): ANY CHARACTER except newline. Same as [^\n]
    • \

    • d, \D: ANY digit/non-digit character. The digits are [0-9]
    • \

    • w, \W: ANY word/non-word character. For ASCII, the word characters are [a-zA-Z0-9_]
    • \

    • s, \S: ANY space/non-space character. For ASCII, the whitespace characters are [

    \n\r\t\f]

  • Position anchors: does not match the character, but rather the position as line start, line end, word start, and word end. ^, $: line start and end of line respectively. For example, ^[0-9]$ matches a numeric string. \b: word boundary, i.e. beginning of the word or end of the word. For example, \bcat\b matches the word “cat” in the input string. \B: Inverse of \b, that is, no word start or word end. \<, \
      >

    • : Word start
    • and word end respectively, similar to \b. For example
    • , \<cat\> matches the word “cat” in the input string. \

    • A, \Z: Input start and input end respectively.
  • Reverse references in parentheses: Use
    • parentheses ( ) to create a subsequent reference. Use
    • $

    • 1, $2, … (Java, Perl, JavaScript) or \1, \2, … (Python) to retrieve previous references in sequential order.
  • Laziness (curb greed for repeat operators): *?, +?, ??, {m,n}?, {m,}?

Example: Numbers [0-9]+ or \d+

  1. A regex (regular expression) consists of a sequence of subexpressions. In this example, [0-9] and +.
  2. The […], known as the character class (or bracket list), includes a list of characters. Matches any UNIQUE character in the list. In this example, [0-9] matches any UNIQUE character between 0 and 9 (i.e. one digit), where hyphen (-) denotes the range.
  3. The +, known as the occurrence indicator (or repeat operator), indicates one or more occurrences (1+) of the previous subexpression. In this case, [0-9]+ matches one or more digits.
  4. A regex can match a part of the entry (that is, substring) or the entire entry. In fact, it could match zero or more substrings of the input (with global modifier).
  5. This regex matches any numeric substring (digits 0 through 9) in the entry. For example, if the entry is “abc123xyz”, it matches the substring “123”. If the entry is “
    1. abcxyz”, it doesn’t match anything. If the entry is
    2. “abc00123xyz456_0”,

    3. it matches the substrings “00123”, “456” and “0” (three matches).

    Note that this regex matches the number with leading zeros, such as “000”, “0123” and “0001”, which may not be desirable.

  6. You can also type \d+, where \d is known as a metacharacter that matches any digit (same as [0-9]). There is more than one way to write a regex! Note that many programming languages (C, Java, JavaScript, Python) use backslash \ as a prefix for escape sequences (for example, \n for newline), and you should type “\\d+” instead.

Code examples (Python,

Java, JavaScript, Perl, PHP)

Python

code example

See “Python

‘s re module for Regular Expression” for complete coverage

.

Python supports Regex through the re module. Python also uses backslash (\) for escape sequences (i.e. you must type \\ for \, \\d for \d), but supports raw r-shaped strings ‘…’, which ignore the interpretation of escape sequences, ideal for writing regex.

# Try under Python command line interpreter $ python3 …… >>> import re # Need module ‘re’ for regular expression # Try find: re.findall(regexStr, inStr) -> matchedSubstringsList # r’…’ denotes raw strings that ignore the escape code, i.e. r’\n’ is ‘\’+’n’ >>> re.findall(r'[0-9]+’, ‘abc123xyz’) [‘123′] # Return a list of matching substrings >>> re.findall(r'[0-9]+’, ‘abcxyz’) [] >>> re.findall(r'[0-9]+’, ‘abc00123xyz456_0’) [‘00123’, ‘456’, ‘0’] >>> re.findall(r’\d+‘, ‘abc00123xyz456_0’) [‘00123’, ‘456’, ‘0’] # Try substitute: re.sub(regexStr, replacementStr, inStr) -> outStr >>> re.sub(r'[0-9]+’, r’*’, ‘abc00123xyz456_0’) ‘abc*xyz*_*’ # Try substitute with count: re.subn(regexStr, replacementStr, inStr) -> (outStr, count) >>> re.subn(r'[0-9]+‘, r’*’, ‘abc00123xyz456_0 ‘) (‘abc*xyz*_*’, 3) # Return an output string tuple and count

Java code

example See “

Regular Expressions (Regex) in Java” for complete coverage

.

Java supports Regex in the java.util.regex package.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 import java.util.regex.Pattern; import java.util.regex.Matcher; public class TestRegexNumbers { public static void main(String[] args) { String inputStr = “abc00123xyz456_0”; // Input string to match String regexStr = “[0-9]+”; // Regex to match // Step 1: Compile a regex using the static method Pattern.compile(), the default is case-sensitive Pattern pattern = Pattern.compile(regexStr); // Pattern.compile(regex, Pattern.CASE_INSENSITIVE); // for indistinguishable matching in case // Step 2: Assign a matching engine from the compiled regex pattern, // and bind to the input string Matcher matcher = pattern.matcher(inputStr); Step 3: Match and process matching results // Try Matcher.find(), which finds the next match while (matcher.find()) { System.out.println(“find() found substring \”” + matcher.group() + “\” starting at index ” + matcher.start() + ” and ending at index ” + matcher.end()); } // Matcher.matches() test, which tries to match the FULL entry (^…$) if (matcher.matches()) { System.out.println(“matches() found substring \”” + matcher.group() + “\” starting at index ” + matcher.start() + ” and ending at index ” + matcher.end()); } else { System.out.println(“matches() found nothing”); } // Try Matcher.lookingAt(), which tries to match from the START of the entry (^…) if (matcher.lookingAt()) { System.out.println(“lookingAt() found substring \”” + matcher.group() + “\” starting at index ” + matcher.start() + ” and ending at index ” + matcher.end()); } else { System.out.println(“lookingAt() found nothing”); } // Try Matcher.replaceFirst(), which replaces the first match String replacementStr = “**”; String outputStr = matcher.replaceFirst(replacementStr); first matches only System.out.println(outputStr); Try Matcher.replaceAll(), which replaces all replacementStr = “++” matches; outputStr = matcher.replaceAll(replacementStr); all System.out.println(outputStr) matches; } }

The result is:

find() found substring “00123” starting at index 3 and ending at index 8 find() found substring “456” starting at index 11 and ending at index 14 find() found substring “0” starting at index 15 and ending at index 16 matches() found nothing lookingAt() found nothing abc**xyz456_0 abc++xyz++++++ Perl

code example

See “

Regular Expression (Regex) in Perl” for complete coverage.

Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/. You can use built-in operators

: m/regex/modifier or /regex/modifier:

  • Match
  • regex. m is optional. s/regex/replacement

  • /modifier: Replace matching substrings with replacement.

In Perl

, you can use the noninterpolating string of single quotation marks ‘….’ to type regex to disable the interpretation of the backslash (\) by

Perl. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #!/usr/bin/env strict use of Perl; use warnings; my $inStr = ‘abc00123xyz456_0’; # input string my $regex = ‘[0-9]+’; # regex pattern string in noninterpolating string # Try to match /regex/modifiers (or m/regex/modifiers) my @matches = ($inStr =~ /$regex/g); # Match regex $inStr with global modifier # Store all matches in an array print “@matches\n”; # Output: 00123 456 0 while ($inStr =~ /$regex/g) { # The built-in array variables @- and @+ maintain the start and end positions # of the matches, where $-[0] and $+[0] is the complete match, and # $-[n] and $+[n] for later references $1, $2, etc. print substr($inStr, $-[0], $+[0] – $-[0]), ‘, ‘; # Output: 00123, 456, 0, } print “\n”; # Try replacing s/regex/replacement/switches $inStr =~ s/$regex/**/g; # with global print switch “$inStr\n”; # Output: abc**xyz**_**

JavaScript code

example See “

Regular expression in JavaScript” for complete coverage

.

In JavaScript (and Perl), a regex is delimited by a pair of forward slashes, in the form of /…/. There are two sets of methods, emitting through a RegEx object or a String object.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 <! DOCTYPE html> <!- JSRegexNumbers.html -> <html lang=”en”> <head> <meta charset=”utf-8″> <title>JavaScript Example: Regex</title> <script> var inStr = “abc123xyz456_7_00”; Use RegExp.test(inStr) to check if inStr contains the pattern console.log(/[0-9]+/.test(inStr)); true // Use String.search(regex) to check if the string contains the pattern // Returns the starting position of the matching substring or -1 if there is no matching console.log(inStr.search(/[0-9]+/)); 3 // Use String.match() or RegExp.exec() to find the matching substring, // previous references, and the string index console.log(inStr.match(/[0-9]+/)); [“123″, input:”abc123xyz456_7_00″, index:3, length:”1”] console.log(/[0-9]+/.exec(inStr)); [“123″, input:”abc123xyz456_7_00″, index:3, length:”1”] // With g (global) option console.log(inStr.match(/[0-9]+/g)); [“123”, “456”, “7”, “00”, length:4] // RegExp.exec() with the g flag can be issued repeatedly. The search resumes after the last position found (maintained in the RegExp.lastIndex property). var pattern = /[0-9]+/g; var result; while (result = pattern.exec(inStr)) { console.log(result); console.log(pattern.lastIndex); // [“123”], 6 // [“456”], 12 // [“7”], 14 // [“00”], 17 } // String.replace(regex, replacement): console.log(inStr.replace(/\d+/, “**”)); abc**xyz456_7_00 console.log(inStr.replace(/\d+/g, “**”)); abc**xyz**_**_** </script> </head> <body> <h1>Hello,</h1> </body> </html>

PHP Code

Example [

TODO]

Example: Full numeric strings ^[0-9]+$ or ^\d+$

  1. The initial ^ and the final $ are known as position anchors, which coincide with the start and end positions of the line, respectively. As a result, the entire input string must completely match, rather than a part of the input chain (substring).
  2. This regex matches any non-empty numeric string (comprising the digits 0 through 9), for example, “0” and “12345”. Does not match “” (empty string), “abc”, “a123”, “abc123xyz”, etc. However, it also matches “000”, “0123” and “0001” with leading zeros.

Example: Positive integer literals [1-9][0-9]*|0 or [1-9]

d*|0 [

  1. 1-9] match any character between 1 and 9; [0-9]* matches zero or more digits. The * is an indicator of occurrence that represents zero or more occurrences. Together, [1-9][0-9]* match any number without a leading zero.
  2. | represents the OR operator; which is used to include the number 0.
  3. This expression matches “0” and “123”; but does not match “000” and “

  4. 0123″ (but see below).
  5. You can replace [0-9] with the \d metacharacter, but not [1-9].
  6. We do not use ^ and $ position anchors in this regex. Therefore, it can match any part of the input chain. For example, if the input string is “abc123xyz”, it matches the substring “123”. If the input string is “abcxyz”, it doesn’t match anything. If the input string is “abc123xyz456_0”, it matches the substrings “123”, “456”, and “0” (three matches
      ).

    1. If the input string is
    2. “0012300”,

    3. it matches the substrings: “0”, “0”, and “12300” (three matches)!!!

Example: full integer literals ^[+-]?[ 1-9][0-9]*|0$ or ^[+-]? [1-9]\d*|0$

  1. This regex coincides with an integer literal (for the entire chain with the position anchors), both positive, negative and zero
  2. . [+-

  3. ] matches the + sign or -. ? is an indicator of occurrence denoting 0 or 1 occurrence, i.e. optional. Therefore, [+-]? matches an optional + or – initial sign.
  4. We have covered three indicators of occurrence: + for one or more, * for zero or more, and ? for zero or one.

Example: Identifiers (or names)

[a-zA-Z_][0-9a-zA-Z_]* or [a-zA-Z_]\w*

  1. Start with an underletter or underscore, followed by zero or more digits, letters, and underscores.
  2. You can use the \w metacharacter for a word character [a-zA-Z0-9_]. Remember that the \d metacharacter can be used for a digit [0-9].

Example: Image filenames ^\w+\.( gif|png|jpg|jpeg)

$

  1. The position anchors ^ and $ match the beginning and end of the input string, respectively. That is, this regex must match the entire input string, rather than a part of the input string (substring).
  2. w+ matches one or more word characters (same as [a-zA-Z0-9_]+).
  3. . matches the dot character (.). We need to use \. to represent . how. It has a special meaning in Regex. The \ is known as the escape code, which restores the original literal meaning of the next character. Similarly, *, +, ? (indicators of occurrence), ^, $ (position anchors) have a special meaning in Regex. You need to use an escape code to match these characters.
  4. (gif|png|jpg|jpeg) matches “gif”, “png”, “jpg” or “jpeg”. The | denotes the “OR” operator. Parentheses are used to group selections.
  5. The i switch after regex specifies case-insensitive matching (applicable only to some languages such as Perl and JavaScript). That is, it accepts “test.GIF” and “TesT.Gif”.

Example: E-mail addresses ^\w+([.-]?\w+)*@\w+([.

-]?\w+)*(\.\w{2,3})+$

  1. The position anchors ^ and $ coincide with the beginning and end of the input string, respectively. That is, this regex must match the entire input string, rather than a part of the input string (substring).
  2. w+ matches 1 or more word characters (same as [a-zA-Z0-9_]+).
  3. [.-]? matches an optional character . or -. Although period (.) has a special meaning in regex, in a character class (square brackets) any character except ^, -, ] or \ is a literal, and does not require an escape sequence.
  4. ([.-]?\w+)* matches 0 or more occurrences of [.-]?\w+.
  5. The subexpression \w+

  6. ([.-]?\w+)* is used to match the user name in the email, before the @ sign. Starts with at least one word character [a-zA-Z0-9_], followed by more word characters or . or-. However, a . or – must be followed by a word character [a-zA-Z0-9_]. That is, the input string cannot begin with . or-; and cannot contain “.. “, “-“, “.-” or “-.”. Examples of valid strings are “a.1-2-3”.
  7. The @ matches itself. In regex, all characters other than those that have special meanings match themselves, for example, a matches a, b matches b, etc.
  8. Again, the \w+

  9. ([.-]?\w+)* subexpression is used to match the domain name of the email, using the same pattern as the user name described above.
  10. The subexpression \.\

  11. w{2,3} matches a . followed by two or three word characters, for example,
  12. “.com”, “.edu”, “.us”, “.uk”, “.co”. (\

  13. .\w{2,3})+ specifies that the above subexpression could occur one or more times, for example,
  14. “.com”, “.co.uk”, “.edu.sg”, and so on.

Exercise: Interpret this regex, which provides another representation of the email address: ^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\. [ a-zA-z0-9] {2,4}$.

Example: Word exchange using retrospective references between parentheses

^(\S+)\s+(\S+)$ and $2 $1

  1. The ^ and $ coincide with the beginning and end of the input string, respectively.
  2. The \s (

  3. lowercase s) matches a blank space (blank, tab \t and new line \r or \n). On the other hand, the \S+ (capital S) matches anything that does NOT match \s, i.e. non-white space. In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for the word character and \W for the wordless character; \d for digit and \D or no digit.
  4. The regex above matches two words (no blanks) separated by one or more blanks.
  5. Parentheses () have two meanings in regex:
    1. grouping subexpressions, for example, (abc)*
    2. to

    3. provide a subsequent reference call to capture and extract matches.
  6. The parentheses of (\S+), called the inverse reference in parentheses, are used to extract the matching substring from the input string. In this regex, there are two (\S+), matching the first two words, separated by one or more blank spaces \s+. The two matching words are extracted from the input string and are usually kept in special variables $1 and $2 (or \1 and \2 in Python), respectively.
  7. To exchange the two words, you can access the special variables and print “$2 $1” (via a programming language); or substitute the “s/(\S+)\s+(\S+)/$2 $1/” operator (in Perl).
Python code example

Python keeps parenthetical references in \1, \2, …. In addition, \0 keeps all the match.

$ python3 >>> re.findall(r’^(\S+)\s+(\S+)$’, ‘apple orange’) [(‘apple’, ‘orange’)] # A list of tuples if the pattern has more than one subsequent reference # The above references are kept at \1, \2, \3, and so on. >>> re.sub(r’^(\S+)\s+(\S+)$’, r’\2 \1′, ‘apple orange’) # Prefix r for raw string ignoring escape ‘orange apple’ >>> re.sub(r’^(\S+)\s+(S+)$ ‘, ‘\\2 \\1’, ‘apple orange’) # I need to use \\ for \ for the regular string ‘orange apple’ Java

code example Java

keeps parenthetical references at $1, $2, ….

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import java.util.regex.Pattern; import java.util.regex.Matcher; public class TestRegexSwapWords { public static void main(String[] args) { String inputStr = “apple orange”; String regexStr = “^(\\S+)\\s+(\\S+)$”; Regex pattern to be paired String replacementStr = “$2 $1”; Replacement pattern with previous references // Step 1: Assign a Pattern object to compile a pattern regex Pattern = Pattern.compile(regexStr); Step 2: Assign a Matcher object from the master and provide the entry Matcher matcher = pattern.matcher(inputStr); Step 3: Match and process the matching result String outputStr = matcher.replaceFirst(replacementStr); first matches only System.out.println(outputStr); Output: orange apple } } Example: HTTP addresses ^http:///\/\S+(\/\

S+)*(\/)?$

  1. Start with http://. Note that you may need to type /as \/ with an escape code in some languages (JavaScript, Perl).
  2. Followed by

  3. S+, one or more spaces other than blank spaces, for the domain name.
  4. Followed by (\/\

  5. S+)*, zero or more “/…”, for subdirectories.
  6. Followed by (

  7. /)?, an optional /trail (0 or 1), for the directory request.

Example: Regex patterns in AngularJS

AngularJS uses the following regex patterns quite complex in JavaScript syntax

: var ISO_DATE_REGEXP = /^\d{4,}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+(?:[+-][0-2]\d

:

[0-5]\d|Z)$/; var URL_REGEXP = /^[a-z][a-z\d.+-]*:\/*(?:[^:@]+(?::[^@]+)?@)? (?:[^\s:/?#]+|\[[a-f\d:]+]) (?::\d+)? (?: /[^?#]*)? (?: ? [^#]*)? (?:#.*)?$/i; var EMAIL_REGEXP = /^(?=.{ 1.254}$)(?=.{ 1,64}@)[-!#$%&’*+/0-9=? A-Z^_’a-z{|} ~]+(\. [-!#$%&’*+/0-9=? A-Z^_’a-z{|} ~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])? (\. [A-Za-z0-9] ([A-Za-z0-9-]{0,61}[A-Za-z0-9])?) *$/; Match uppercase and lowercase letters, between single quotes but not double quotation marks var NUMBER_REGEXP = /^\s*(-|\+)? (\d+| (\d*(\.\d*))) ([eE][+-]?\d+)?\s*$/; var DATE_REGEXP = /^(\d{4,})-(\d{2})-(\d{2})$/; var DATETIMELOCAL_REGEXP = /^(\d{4,})-(\d\d)-(\d\d)T(\d\d):(\d\d)(?::(\d\d)(\.\d{1,3})?)? $/; var WEEK_REGEXP = /^(\d{4,})-W(\d\d)$/; var MONTH_REGEXP = /^(\d{4,})-(\d\d)$/; var TIME_REGEXP = /^(\d\d):(\d\d)(?::(\d\d)(\.\d{1,3})?)? $/;

Example: Example Regex in Perl

s/^\s+// # Remove leading whitespace (replace with empty string) s/\s+$// # Remove trailing whitespace s/^\s+.*\s+$// # Remove leading and trailing whitespace Regular Expression

Syntax

(

Regex) A regular expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.

A regex consists of a sequence of characters, metacharacters (such as ., \d, \D, \s, \S, \w,

W), and operators (such as +, *, ?, |, ^). They are constructed by combining many smaller subexpressions.

Single character matching

The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters (a-z and A-Z) and digits (0-9), match each other. For example, regex x matches the substring “x”; z matches “z”; and 9 matches “9”.

Non-alphanumeric characters with no special meaning in regex also match. For example, = matches “=”; @ matches “@”.

Regex Special Characters and Escape

Sequences Regex Special Characters

These characters have a special meaning in regex (I will discuss in detail in the later sections):

metacharacter:

    period (.)

  • List of brackets: [ ]
  • Position anchors: ^, $
  • Occurrence indicators: +, *, ?, { }
  • parentheses
  • : ( )

  • or: |
  • Escape

  • and metacharacter: backslash (\)
Sequences of

escape The characters listed above have special meanings in regex. To match these characters, we need to prepend it with a backslash (\), known as an escape sequence. For example, \+ matches “+”; \[

matches “[“; and \. matches “.”. Regex also recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for an octal number up to 3 digits, \xhh for a two-digit hexadecimal code, \uhhhh for a 4-digit Unicode,

uhhhhhhhh for an 8-digit Unicode.

Python code

example $ python3 >>> import re # Need module ‘re’ for regular expression # Try find: re.findall(regexStr, inStr) -> matchedStrList # r’…’ denotes raw strings that ignore the escape code, i.e. r’\n’ is ‘\’+’n’ >>> re.findall(r’a’, ‘abcabc’) [‘a’, ‘a’] >>> re.findall(r‘=’, ‘abc=abc ‘) # ‘=’ is not a special regex character [‘=’] >>> re.findall(r’\.’, ‘abc.com’) # ‘.’ is a special regex character, you need regex escape sequence [‘.’] >>> re.findall(‘\\.’, ‘abc.com’) # You must type \\ for \ in the normal Python string [‘.‘] JavaScript Code Example [TODO]

Java Code

Example

[TODO]

Matching a

Character sequence (string or text)

Subexpressions

A regex is constructed by combining many subexpressions or smaller atoms. For example, regex Friday matches the string “Friday”. Matching, by default, is case-sensitive, but can be set to case-sensitive using a switch.

O (|) Operator

You can provide alternatives using the “O” operator, indicated by a vertical bar ‘|’. For example, regex four|for|floor|4 accepts strings

“four”, “for”, “floor”, or “4”.

List of square brackets (character class) […], [^…], [.-.]

An expression in square brackets is a list of characters enclosed by [ ], also called a character class. Matches ANY CHARACTER in the list. However, if the first character in the list is the caret (^), then it matches ANY CHARACTER NOT in the list. For example, regex [02468] matches a single digit 0, 2, 4, 6, or 8; The regex [^02468] matches any character other than 0, 2, 4, 6, or 8.

Instead of enumerating all characters, you can use a range expression inside the bracket. A range expression consists of two characters separated by a hyphen (-). It matches any individual character that is classified between the two characters, inclusive. For example, [a-d] is the same as [abcd]. You can include an caret (^) in front of the range to reverse the match. For example, [^a-d] is equivalent to [^abcd].

Most special regex characters lose their meaning within the bracket list, and can be used as is; except ^, -, ] or \.

  • To include a ], place it first in the list or use escape \].
  • To include a ^, place it anywhere but first, or use \^ escape.
  • To include a – place it last, or use escape \-.
  • To include a \, use escape
  • \\. No escape is needed for the other characters such as ., +, *, ?, (, ), {, }, etc., within the bracket list

  • You can also include metacharacters (which will be explained in the next section), such as
  • \w, \W, \d, \D, \s, \S

  • within the bracket list.
Name of the character classes in the bracket list (Perl only?)

Named character classes (POSIXs) are predefined within bracketed expressions. They are

: [:alnum:], [:alpha:], [

  • :d igit:]: letters+digits, letters, digits
  • . [:xdigit:]:

  • hexadecimal digits
  • . [:lower:], [:upper:]:

  • lowercase/uppercase letters
  • . [:

  • cntrl
  • :]: Control [:graph:]: Printable characters, except space. [:p rint:]: printable characters, include space. [

  • UNCT :p]: Printable characters, excluding letters and digits. [
  • :

  • space:]: white space

For example, [[:alnum:]] means [0-9A-Za-z]. (Note that the square brackets in these class names are part of the symbolic names and must be included in addition to the square brackets that delimit the list of square brackets.)

Metacharacters

., \w, \W, \d, \D, \s, \S A metacharacter is a symbol with a

special meaning within a regex.

  • The metacharacter point (.) matches any character except the new line \n (same as [^\n]). For example… matches any 3 characters (including alphabets, numbers, white space, but except new line); the.. matches “there”, “these”, “the “, and so on.
  • w (word character) matches any letter, number, or underscore (same as [a-zA-Z0-9_]). The uppercase counterpart \W (no word character) matches any individual character that does not match \w (same as [^a-zA-Z0-9_]).
  • In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart.
  • d (digit) matches any digit (same as [0-9]). The uppercase counterpart \D (no digit) matches any individual character other than a digit (same as [^0-9]).
  • s (space) matches any single white space (same as [ \t\n\r\f], blank, tab, new line, carriage return, and form feed). The uppercase counterpart \S (no space) matches any individual character that does not match \s (same as [^ \t\n\r\f]).

Examples: \s\s # Matches two spaces \S\S\s # Two non-spaces followed by one space \s+ # One or more spaces \S+\s\S+ # Two words (non-spaces) separated by a space Backslash (\) and Regex

Escape sequences Regex

uses the backslash (\) for two purposes: for

metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (

  1. non-word).
  2. To escape special regex characters, for example, \. for ., \+ for +, \* for

  3. *, \? for?. You should also type \\ for \ in regex to avoid ambiguity.
  4. Regex also recognizes \n for new line, \t for tab, etc.

Note that in many programming languages (C, Java, Python), the backslash (\) is also used for string escape sequences, for example, “\n” for newline, “\t” for tab, and you also need to type “\\” for \. Consequently, to write the regex pattern \\ (which matches a \) in these languages, you must type “\\” (two escape levels!!!). Similarly, you must type “\\d” for the regex \d metacharacter. This is cumbersome and error-prone!!!

Occurrence indicators (repeat operators): +, *, ?, {m}, {

m,n}, {m,}

A regex subexpression can be followed by an occurrence indicator (also known as a repeat operator):

?: The above item is optional

  • and matches at most once (i.e. occurs 0 or 1 times or optional).
  • *: The previous item will be matched zero or more times,

  • i.e. 0+ +: The previous element will be matched one or more times,
  • i.e. 1+
  • {m}: The previous element matches exactly m times. {m,}: The previous element matches m or more times, that is, m+ {m,n}:

  • The previous element matches
  • at least m times, but not more than n times
  • . For example: The regex xy{2,4} accepts “xyy”, “xyyy” and “xyyyy”. Switches You can apply switches to a regex to adapt its behavior, such as global, it is not case-sensitive.

multiline, etc. The ways to apply switches differ between languages.

In Perl, you can attach modifiers after a regex, in the form of /…/modifiers. For examples:

m/abc/i # case-insensitive match m/abc/g # global (Match ALL instead of matching first)

In Java, switches are applied when compiling the regex pattern. For example

, Pattern p1 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE); for case-insensitive matching Pattern p2 = Pattern.compile(regex, Pattern.MULTILINE); for the multiline input string Pattern p3 = Pattern.compile(regex, Pattern.DOTALL); The period (.) matches all characters, including the new line

The most commonly used modifer modes are:

Case insensitive mode (or i):

  • Letter matching that is not case-sensitive
  • . Global (or

  • g): Matches all instead of the first match
  • .

  • Multiline mode (or m): affects ^, $, \A and \Z. In multiline mode, ^ coincides with the start of the line or the start of the input; $ coincides with the end of the line or the end of the entry, \A matches the beginning of the entry; \Z coincides with the end of the entry.
  • Single line (

  • or s) mode: Dot (.) will match all characters, including the new line.
  • Comment mode (or x): Allow and ignore embedded comments starting with # to the end of the line (EOL).
  • more…

Greediness of Repetition Operators *, +, ?, {m,n}: Repetition operators are greedy operators and, by default, grab as many characters as possible for a match. For example, regex xy{2,4} tries to match

“xyyyy”, then “xyyy”, and then “xyy”.

Lazy quantifiers *?, +?, ??, {m,n}?, {m,}?, : You can put an extra ? after the repeat operators to curb your greed (i.e. stop at the shortest match). For example,

input = “The instances <code>first</code> and <code>second</code>” regex = <code>.*</code> matches “<code>first</code> and <code>second</code>” But regex = <code>.*?</code> produces two matches: “<code>first</code>” and <“code>second</code>”

Rollback: If a regex reaches a state where a game cannot be completed, it backs off by unrolling a greedy game character. For example, if the regex z*zzz is compared to the string “zzzz”, the z* first matches “zzzz”; unrolls to match “zzz”; unrolls to match “zz”; and finally unrolls to match “z”, so that the rest of the patterns can find a match.

Possessive quantifiers *+, ++, ?+, {m,n}+,

{m,}+: You can put an extra + to repeat operators to disable backtracking, even if it may result in a mismatch. For example, z++z will not match “zzzz”. This feature may not be supported in some languages.

Position anchors ^, $, \b, \B, \<, \>, \A, \Z

Positional anchors do NOT match the actual character, but match the position in a string, such as

the start of line, end of line, beginning of word, and end of word. ^

  • and $: The ^ coincides with the beginning of the line. The $ coincides with the end of the line excluding the new line, or the end of the entry (for the entry that does not end with the new line). These are the most commonly used position anchors. For example, ing$ # ending with ‘ing’ ^testing $123# Matches a single pattern. You should use equality comparison instead. ^[0-9]+$ # Number string
  • \b and \B: The \b matches

  • the limit of a word (i.e. word start or word end); and \B matches inverse of \b, or wordless boundary. For example
  • , \bcat\b # matches the word “cat” in the input string “This is a cat.” # but does not match the entry “This is a catalog.” \< and \>: \< and \> match the beginning of the word and the end of the word,

  • respectively (compared to \b, which can match the beginning and end of a word).
  • \A and \

  • Z: The \A coincides with the start of the entry. The \Z coincides with the end of the entry. They are different from ^ and $ when it comes to matching the entry to multiple lines. ^ matches the beginning of the string and after each line break, while \A only matches the beginning of the string. $ coincides at the end of the string and before each line break, while \Z only matches the end of the string. For example, $ python3 # Using ^ and $ in multiline mode >>> p1 = re.compile(r’^.+$’, re. MULTILINE) # . for any character except newline >>> p1.findall(‘testing\ntesting’) [‘testing’, ‘testing’] >>> p1.findall(‘testing\ntesting\n’) [‘testing’, ‘testing’] # ^ matches the start of the entry or after each line break at the beginning of the line # $ coincides with the end of the entry or before the line break at the end of the line # the new lines are NOT included in the matches # Using \A and \Z in mode multiline >>> p2 = re.compile(r’\A.+\Z’, re. MULTILINE) >>> p2.findall(‘testing\ntesting’) [] # This pattern does not match the internal \n >>> p3 = re.compile(r’\A.+\n.+\Z’, re. MULTILINE) # to match the internal \n >>> p3.findall(‘testing\ntesting’) [‘testing\ntesting’] >>> p3.findall(‘testing\ntesting\n’) [] # This pattern does not

match the end \n # \A matches the start of the input and \Z matches the end of the entry

Capturing matches through reverse references in parentheses and matching variables $1, $2,…

Parentheses ( )

serve two purposes in regex: First, parentheses

  1. ( ) can be used to group subexpressions to override precedence or apply a repeat operator. For example, (abc)+ (accepts abc, abcabc, abcabcabc, …) is different from abc+ (accepts abc, abcc, abccc, …).
  2. Secondly, parentheses are used to provide so-called reverse references (or capture groups). A reverse reference contains the matching substring. For example, regex (\S+) creates a retrospective reference (\S+), which contains the first word (not consecutive spaces) in the input string; the regex (\S+)\s+(\S+) creates two inverse references: (\S+) and another (\S+), which contains the first two words, separated by one or more spaces \s+.

These retrospective references (or capture groups) are stored in special variables $1, $2, … (or \1, \2, … in Python), where $1 contains the substring that matches the first pair of parentheses, and so on. For example, (\S+)\s+(\S+) creates two previous references that match the first two words. Matching words are stored in $1 and $2 (or \1 and \2), respectively.

The above references are important for manipulating the string. The above references can be used in the substitution chain as well as in the pattern. For example,

# Exchange the first and second words separated by a space s/(\S+) (\S+)/$2 $1/; # Perl re.sub(r'(\S+) (\S+)’, r’\2 \1′, inStr) # Python # Remove duplicate word s/(\w+) $1/$1/; # Perl re.sub(r'(\w+) \1′, r’\1′, inStr) # Python (Advanced) Lookahead

/Lookbehind, groupings and conditional

These features may not be supported in some languages

. Positive lookahead (?=pattern) The

(?=pattern)

is known as positive lookahead. It performs the match, but does not capture the match, returning only the result: match or non-match. It is also called assertion, as it does not consume any characters in the match. For example, AngularJS uses the following complex regex to match email addresses:

^(?=.{ 1,254}$)(?=.{ 1,64}@)[-!#$%&’*+/0-9=? A-Z^_’a-z{|} ~]+(\. [-!#$%&’*+/0-9=? A-Z^_’a-z{|} ~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])? (\. [A-Za-z0-9] ([A-Za-z0-9-]{0,61}[A-Za-z0-9])?) *$

The first positive lookahead patterns ^(?=.{ 1.254}$) sets the maximum length to 254 characters. The second positive lookahead ^(?=.{ 1.64}@) sets a maximum of 64 characters before the ‘@’ sign for the username.

Negative look forward (?! pattern)

Inverse of (?=pattern). Match if the pattern is missing. For example, a(?=b) matches ‘a’ in ‘abc’ (not consuming ‘b’); but not ‘acc’. While a(?! b) matches ‘a’ in ‘acc’, but not abc.

Positive lookbehind (?<

=pattern

) [ALL]

Negative lookbehind (?<!pattern

)

[ALL]

Non-capturing group (?:p attern)

Remember that you can use reverse references in parentheses to capture matches. To disable capture, use ?: inside the parentheses in the form of (?:p attern). In other words, ?: disables the creation of a capture group, so as not to create an unnecessary capture group.

Example: [ALL]

Named capture group (?<name>

pattern

)

The capture group can be referred to later by name

.

Atomic grouping (> pattern)

Disable backtracking, even if this may cause a mismatch.

Conditional (?( Cond)then|else

) [ALL] Unicode The meta characters \w, \W, (word and non-word character), \

b, \B (word and non-word limit) recognize Unicode characters.

[ALL]

Regex in Python programming languages

: See “

Python re module for Regular Expression

” Java: See “Regular expressions in Java” JavaScript: See “Regular expression in JavaScript” Perl: See “

Regular expressions in Perl”

PHP: [link

] C/C++

:

[Link]

REFERENCES AND RESOURCES

(Python) Python regular expression HOWTO @ https://docs.python.org/3/howto/regex.html (Python 3). (Python) Python’s

  1. re – Regular expression operations @ https://docs.python.org/3/library/re.html (Python 3). (
  2. Java

  3. ) Online Java Tutorial’s Trail on “Regular Expressions” @ https://docs.oracle.com/javase/tutorial/essential/regex/index.html.
  4. (Java)

  5. JavaDoc for java.util.regex Package @ https://docs.oracle.com/javase/10/docs/api/java/util/regex/package-summary.html (JDK 10).
  6. (Perl) perlrequick – Perl @ https://perldoc.perl.org/perlrequick.html regular expressions quickstart. (Perl)

  7. perlre – Perl regular
  8. expressions @

  9. https://perldoc.perl.org/perlre.html. (
  10. JavaScript) Regular expressions @ https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions.