Dev

Little Java Regex Cookbook

Regular expressions, or short “regex”, are a pattern of characters and metacharacters that can be used for matching strings. For example, the pattern “gr[ae]y” matches both the strings “gray” and “grey”.

While regular expressions are an integral part of other popular languages, they have been introduced to the Java world rather late with the release of Java 1.4 in 2002. Perl, certainly the mother language of modern regexes, already turned 15 that year.

Regexes are sometimes hard to understand, but once you got the hang of them, they will soon become your weapon of choice when you have to deal with texts.

In this article, I focus on Java code patterns for common scenarios. If you have never heard of regular expressions before, the Wikipedia article and the Pattern JavaDoc are good starting points. The Regex Crossword site is a great place for working out your regex muscles.

Matching

The primary thing you can do with regular expressions is to check if a string matches a pattern.

boolean match = Pattern.matches(".*Cream.*", "Ice Cream Sandwich");
assertThat(match, is(true));

The Pattern.matches() method compiles the pattern everytime before matching the string. When the pattern is used repeatedly, it’s better to precompile it once and reuse the Pattern instance, and then use Pattern.matcher() to create a Matcher object:

Pattern p = Pattern.compile(".*Cream.*");

boolean match = p.matcher("Ice Cream Sandwich").matches();
assertThat(match, is(true));

boolean match2 = p.matcher("Jelly Bean").matches();
assertThat(match2, is(false));

Matcher.matches() returns true only if the entire string is matching the regular expression. To find out if the regular expression matches within the string, use Matcher.find() instead:

Pattern p = Pattern.compile("Cream");

boolean find = p.matcher("Ice Cream Sandwich").find();
assertThat(find, is(true));     // a part of the string has matched

boolean match = p.matcher("Ice Cream Sandwich").matches();
assertThat(match, is(false));   // but the regex does not match the entire string

A pattern can also be used as find predicate (e.g. for filtering):

List<String> result = Stream.of("Pear", "Plum", "Honey", "Cherry Pie")
        .filter(Pattern.compile("P.*").asPredicate())
        .collect(Collectors.toList());
assertThat(result, contains("Pear", "Plum", "Cherry Pie"));

Note that “Cherry Pie” is matching as well because this is a find predicate, so the pattern just needs to match a part of the string. Java 11 also permits match predicates, to match the entire expression:

List<String> result = Stream.of("Pear", "Plum", "Honey", "Cherry Pie")
        .filter(Pattern.compile("P.*").asMatchPredicate())
        .collect(Collectors.toList());
assertThat(result, contains("Pear", "Plum"));

Do you find a way how you can get the same result with a find predicate?

Splitting

Texts can be split at a delimiter using regular expressions. The following example splits a CSV line, accepting both comma and semicolon as delimiter characters:

Pattern p = Pattern.compile("[;,]");
String[] result = p.split("123,abc;foo");
assertThat(result, arrayContaining("123", "abc", "foo"));

It is also possible to split straight into a Stream:

Pattern p = Pattern.compile("[;,]");
List<String> result = p.splitAsStream("123,abc;foo")
        .collect(Collectors.toList());
assertThat(result, contains("123", "abc", "foo"));

Extracting

Regular expressions are extremely useful for locating and extracting certain parts of a string. For example, let’s say we have an ISO date string and we would like to extract the year, month, and day. Parentheses are used for marking the desired groups in the pattern. The matching part of each group can then be read by its positional number:

Pattern p = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})T.*");
Matcher m = p.matcher("2014-08-27T21:33:11Z");
if (m.matches()) {
    String year = m.group(1);
    String month = m.group(2);
    String day = m.group(3);
    assertThat(year, is("2014"));
    assertThat(month, is("08"));
    assertThat(day, is("27"));
}

Groups are counted by their left parenthesis, starting from 1. Group number 0 always refers to the entire match. It’s even better to use group names, so you won’t need to care about their positions:

Pattern p = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})T.*");
Matcher m = p.matcher("2014-08-27T21:33:11Z");
if (m.matches()) {
    String day = m.group("day");
    String month = m.group("month");
    String year = m.group("year");
    assertThat(day, is("27"));
    assertThat(month, is("08"));
    assertThat(year, is("2014"));
}

Note that you must always invoke matches() before invoking group(), even when you are absolutely sure that the text is matching.

Replacing

Let’s replace some text! The next example replaces the word “apple” by the word “cherry”:

Pattern p = Pattern.compile("apple");
Matcher m = p.matcher("sweet apple pie");
String result = m.replaceAll("cherry");
assertThat(result, is("sweet cherry pie"));

This was simple. However, this example would also convert a “sweet pineapple pie” to a “sweet pinecherry pie”. Do you find a way how to only match the word “apple”?

Let’s make it more challenging and replace period decimal separators by comma, but leave punctuation marks unchanged. We will match decimal numbers and use group references $1 and $2 in the replacement string:

Pattern p = Pattern.compile("(\\d+)\\.(\\d+)");
Matcher m = p.matcher("This is a book. It costs €35.71.");
String result = m.replaceAll("$1,$2");
assertThat(result, is("This is a book. It costs €35,71."));

What if we would like to compute the replacement string at runtime? In the next example, the name of a special ingredient in a famous Monty Python quote is converted to upper case. For the sake of this example, String.toUpperCase() is used instead of just replacing the lower case word by the upper case word.

Pattern p = Pattern.compile("spam");
Matcher m = p.matcher("spam, egg, spam, spam, bacon and spam");
String result = m.replaceAll(r -> r.group().toUpperCase());
assertThat(result, is("SPAM, egg, SPAM, SPAM, bacon and SPAM"));

The example above requires Java 9 or higher. If you need to use Java 8, you can simulate replaceAll() with this helper method:

public static String replaceAll(Matcher m, Function<MatchResult, String> replacer) {
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        m.appendReplacement(sb, replacer.apply(m));
    }
    m.appendTail(sb);
    return sb.toString();
}

Quoting

To be honest, writing regular expressions in Java can be a real pain sometimes. Other languages offer regex literals, like /\d+/. With Java, we’re not that lucky. We only have plain string literals, so we need to escape each regex backslash with another backslash:

Pattern p = Pattern.compile("\\d+"); // regex: \d+

Even worse, if we want to match a backslash character, we have to actually write it four times (twice for the regular expression and twice again for the Java string):

Pattern p = Pattern.compile("C:\\\\"); // regex: C:\\ , matches C:\

Java 12 was supposed to bring raw string literals, which would have cleaned up the backslash mess a bit. Sadly, this feature has been dropped before the final release.

Quoting is used when the search string contains regex meta characters. For example, when we would like to match the ASCII representation of the copyright symbol “(c)”, a regular expression of “(c)” would actually match any “c” character. We have to use backslashes to escape the meaning of the parentheses: “\(c\)” (and then double the backslashes in the Java string).

The Pattern.quote() method helps us quoting fixed strings:

Pattern p = Pattern.compile(".*" + Pattern.quote("(c)") + ".*");
boolean copyrighted = p.matcher("Material is (c) 2014").matches();
assertThat(copyrighted, is(true));

In one of the examples above, the group references “$1” and “$2” were used in the replaceAll() call. To use arbitrary strings as replacement, we must escape the special characters as well. This is what Matcher.quoteReplacement() does for us. In the next example, the replacement string is supposed to be $12, instead of a reference to the content of group 12:

Pattern p = Pattern.compile("PRICETAG");
Matcher m = p.matcher("This book is PRICETAG.");
String result = m.replaceAll(Matcher.quoteReplacement("$12"));
assertThat(result, is("This book is $12."));
How to feed DDMS with gpsbabel

The Android Device Monitor is not just an aid for debugging applications, but also allows to simulate GPS positions, so you won’t need to actually run around in the countryside for testing your GPS app. But where to get test data from?

I have recorded some of my hiking trips with my Garmin GPS 60, and saved them in Garmin’s proprietary gdb file format. These files contain waypoints, routes and also recorded tracks.

The Swiss Army Knife for GPS files, gpsbabel, comes in handy for converting a gdb file into the GPX file format that can be read by DDMS. This is the line I used for conversion:

gpsbabel -i gdb -f hike-track.gdb -o gpx,gpxver=1.1 -F hike-track.gpx

Note the gpxver=1.1 option, as DDMS is unable to read GPX 1.0 files.

After converting and loading the GPX file into DDMS, I can now send single waypoints as GPS events to the emulated device. But beyond that, I can also play back a recorded track, and simulate that I carry around the emulated device on that track. This is very useful for testing GPS apps.

Validating the Android 4.2.2 RSA fingerprint

Android 4.2.2 comes with a new security feature. If you try to connect to your smartphone via adb and USB debugging, you will note that your device is marked as “offline”. Additionally, a dialog shows up on your device, presenting an RSA fingerprint of your computer and asking for confirmation to accept a connection.

The rationale is that if your device is lost or stolen, there is no way to read its content even if USB debugging was enabled.

Now, presenting an RSA fingerprint surely is a nice idea to avoid man-in-the-middle attacks. But how do you get that fingerprint in order to compare it with the one shown on the device? At first I thought there must be a command (or an adb option) that prints out the fingerprint, but I wasn’t able to locate one. After spending some time with my favourite search engine, I finally dug up a rather more than less complicated command line that prints out the footprint:

awk '{print $1}' < adbkey.pub | openssl base64 -A -d -a | openssl md5 -c | \
  awk '{print $2}' | tr '[:lower:]' '[:upper:]'

The command must be executed in the directory where adb stores the adb key, which usually is ~/.android (or /root/.android if adb runs as root).

If you are somewhat security paranoid, you surely wonder why, on the one hand, Google shows a footprint on the device, but on the other hand makes it difficult to find out if that footprint actually belongs to your computer.

maven-release-plugin and git fix

After hours of trying and wondering why my release scripts suddenly stopped working, I found out that maven-release-plugin seems to have an issue with git on recent systems. If you invoke mvn release:prepare and find out that the release process just runs against the current SNAPSHOT instead of the release version, you likely stumbled upon bug MRELEASE-812.

The reason for this issue seems to be that mvn release:prepare parses the output of git status. However the status is localized in recent versions of git, and maven-release-plugin fails to parse the localized output.

The coming fix will probably use git status --porcelain, which returns a machine-readable output. However, for the time being

LANG='en_US.UTF-8'
mvn release:prepare

is a valid workaround.

Cilla source code released

Finally, after almost three years of development, I have published the source code of Cilla. Cilla is the software that runs this blog.

I started working on a new blog software on June 3, 2009. It should replace my old home page made with PHP. I decided to write an own blog software in Java, as there was no open source Java blog software that suited my needs. However I never expected that this project would grow that huge. The core modules alone consist of 27,000 lines of code in 295 classes.

The core modules of Cilla are now available on my development site shredzone.org. The source code is published on GitHub. The documentation, a few plugins, and a simple example web frontend are still missing. I will publish them later.

Cilla is published under a GNU Affero General Public License.