Regular expressions, or short "Regex", are a pattern of characters and metacharacters that can be used for matching strings. For example, the pattern "gr[ae]y" matches both the strings "gray" and "grey".

While regular expressions are an integral part of other popular languages, they have been introduced to the Java world quite late in 2002, with the release of Java 1.4. Perl, certainly the mother language of modern regexes, already turned 15 that year.

Regexes are sometimes hard to understand, but once you got the hang of them, they will soon become your weapon of choice when you have to deal with texts.

In this article, I focus on Java code patterns for common scenarios. If you have never heard of regular expressions before, the Wikipedia article and the Pattern JavaDoc are good starting points. The Regex Crossword site is a great place to practice your regex skills.

Matching

The pimary thing you can do with regular expressions is to check if a string matches a pattern.

boolean match = Pattern.matches(".*Cream.*", "Ice Cream Sandwich");
assertThat(match, is(true));

The Pattern.matches() method compiles the pattern everytime before matching the string. When the pattern is used repeatedly, it's better to precompile it once and reuse the Pattern instance.

Pattern p = Pattern.compile(".*Cream.*");
boolean match = p.matcher("Ice Cream Sandwich").matches();
assertThat(match, is(true));
boolean match2 = p.matcher("Jelly Bean").matches();
assertThat(match2, is(false));

Since Java 8, a pattern can also be used as predicate (e.g. for filtering):

List<String> result = Stream.of("Pear", "Plum", "Honey")
        .filter(Pattern.compile("P.*").asPredicate())
        .collect(toList());
assertThat(result, contains("Pear", "Plum"));

Splitting

Texts can be split by a delimiter using regular expressions. The following example splits a CSV line, accepting both comma and semicolon as delimiter characters:

Pattern p = Pattern.compile("[;,]");
String[] result = p.split("123,abc;foo");
assertThat(result, arrayContaining("123", "abc", "foo"));

Since Java 8, it is possible to split straight into a Stream:

Pattern p = Pattern.compile("[;,]");
List<String> result = p.splitAsStream("123,abc;foo")
        .collect(Collectors.toList());
assertThat(result, contains("123", "abc", "foo"));

Extracting

Regular expressions are extremely useful for locating and extracting certain parts of a string. For example, let's say we have an ISO date string and we would like to extract the year, month and day. We would use parentheses for marking the desired groups in the pattern, and then read these groups:

Pattern p = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})T.*");
Matcher m = p.matcher("2014-08-27T21:33:11Z");
if (m.matches()) {
    String year = m.group(1);
    String month = m.group(2);
    String day = m.group(3);
    assertThat(year, is("2014"));
    assertThat(month, is("08"));
    assertThat(day, is("27"));
}

Groups are counted by their left parenthesis, starting from 1. Group number 0 always refers to the entire match. Note that you must always invoke matches() before invoking group(), even when you are confident that the text matches.

Replacing

Let's replace some text! The next example replaces the word "apple" by the word "cherry":

Pattern p = Pattern.compile("apple");
Matcher m = p.matcher("sweet apple pie");
String result = m.replaceAll("cherry");
assertThat(result, is("sweet cherry pie"));

This was simple. However, this example would also convert a "sweet pineapple pie" to a "sweet pinecherry pie". Do you find a way how to only match the word "apple"?

Let's make it more challenging and replace period decimal separators by comma, but leave punctuation marks unchanged. We will match decimal numbers and use group references in the replacement string:

Pattern p = Pattern.compile("(\\d+)\\.(\\d+)");
Matcher m = p.matcher("This is a book. It costs €35.71.");
String result = m.replaceAll("$1,$2");
assertThat(result, is("This is a book. It costs €35,71."));

What if we would like to compute the replacement string at runtime? In the next example, the name of a special ingredient in a famous Monty Python quote is converted to upper case. For the sake of this example, String.toUpperCase() is used instead of just replacing the lower case word by the upper case word.

Pattern p = Pattern.compile("spam");
Matcher m = p.matcher("spam, egg, spam, spam, bacon and spam");
StringBuffer sb = new StringBuffer();
while (m.find()) {
    String match = m.group();
    m.appendReplacement(sb, match.toUpperCase());
}
m.appendTail(sb);
assertThat(sb.toString(), is("SPAM, egg, SPAM, SPAM, bacon and SPAM"));

The Matcher API only accepts the synchronized StringBuffer, but thanks to some JIT magic the performance penalty is negligible.

Maybe you wonder if Java 8 offers something like replaceAll(String::toUpperCase). Well, I have bad news for you. The Matcher class does not offer support for Lambda expressions, so a helper method needs to do the job instead:

public static String replaceAll(Matcher m, Function<String, String> replacer) {
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        m.appendReplacement(sb, replacer.apply(m.group()));
    }
    m.appendTail(sb);
    return sb.toString();
}

Now we can just use Lambdas for computing the replacement string at runtime:

Pattern p = Pattern.compile("spam");
Matcher m = p.matcher("spam, egg, spam, spam, bacon and spam");
String result = replaceAll(m, String::toUpperCase);
assertThat(result, is("SPAM, egg, SPAM, SPAM, bacon and SPAM"));

Quoting

To be honest, writing regular expressions in Java can be a real pain sometimes. Besides special regex operators, Java also lacks a "raw string" where control characters like "\n" are not interpreted. However, regular expressions also use backslash for some of its meta characters, so the backslashes need to be duplicated in a Java string:

Pattern p = Pattern.compile("\\d+"); // regex: \d+

Even worse, if you want to match a backslash character, you have to actually write it four times (twice for the regular expression and twice again for the Java string):

Pattern p = Pattern.compile("C:\\\\"); // regex: C:\\ , matches C:\

Quoting is used when the search string contains a meta character. For example, when we would like to match the ASCII representation of the copyright symbol "(c)", a regular expression of "(c)" would just match the c character. We have to use backslashes to escape the meaning of the parentheses: "(c)" (and then double the backslashes in the Java string).

The Pattern.quote() method helps us quoting fixed strings:

Pattern p = Pattern.compile(".*" + Pattern.quote("(c)") + ".*");
boolean copyrighted = p.matcher("Material is (c) 2014").matches();
assertThat(copyrighted, is(true));

In one of the examples above, the group references "$1" and "$2" were used in the replaceAll() call. To use arbitrary strings as replacement, we must escape the special characters as well. This is what Matcher.quoteReplacement() does for us.

Pattern p = Pattern.compile("PRICETAG");
Matcher m = p.matcher("This book is PRICETAG.");
String result = m.replaceAll(Matcher.quoteReplacement("$12"));
assertThat(result, is("This book is $12."));