Using Regular Expressions in Java

pradeep's Avatar author of Using Regular Expressions in Java
This is an article on Using Regular Expressions in Java in Java.
Using Regular Expressions in Java

JDK versions 1.4.0 and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun's regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set (ie. by pattern matching). They can be used as a tool to search, edit or manipulate text or data. One common use is validation of data entry strings.

All classes related to regular expressions are found in the java.util.regex package which must be imported.

Java regular expression patterns use a syntax similar to the one used by perl. The best reference is found at sun.com.

Simple examples of the use of regular expressions are:

Code: Java
Pattern p = Pattern.compile("a*b");
 Matcher m = p.matcher("aaaaab");
 boolean b = m.matches();
As a convenience for a one-time use situation the matches method simplifies the syntax (but does not precompile the pattern).

Code: Java
boolean b = Pattern.matches("a*b", "aaaaab");

Using The Pattern Class


In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string , you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:
Code: Java
StringBuffer myStringBuffer = new StringBuffer();
 myMatcher = myPattern.matcher("subject");
 while (myMatcher.find()) {
   if (checkIfThisMatchShouldBeReplaced()) {
     myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
   }
 }
 myMatcher.appendTail(myStringBuffer);
Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.
Krolik's Avatar, Join Date: Feb 2007
Light Poster
The article is a little too short. There is too little information how to format regular expressions in Java (with examples). What if I want to match any '\'? Should I write "\\" or "\\\" or "\\\\"? This is difficult for beginners.
pradeep's Avatar, Join Date: Apr 2005
Team Leader
If you want to match '\' then you need to use '\\\\'
jimmius's Avatar, Join Date: Mar 2007
Newbie Member
Greetings,

I would like a little help about regex matter in Java.

Supposing that I want to check if a TextField contains only Strings not Numbers, which is the code to accomplish this?

I tried :
String str = new String();
if (str.matches("[a-zA-Z]")) {}
else {}


When it comes for the user to enter more than one character in the TextField, it doesn't work. It works only for one character!

Any help appreciated,

jimmius
pradeep's Avatar, Join Date: Apr 2005
Team Leader
Code: Java
String str = new String();
 if (str.matches("^[a-zA-Z]+$")) // will only matches alphabets
 {
 }
 else {}
jimmius's Avatar, Join Date: Mar 2007
Newbie Member
Sorry for my ignorance..

About the. "+$" What is it stand for?

Quote:
Originally Posted by pradeep
Code: Java
String str = new String();
 if (str.matches("^[a-zA-Z]+$")) // will only matches alphabets
 {
 }
 else {}
pradeep's Avatar, Join Date: Apr 2005
Team Leader
+ matches multiple matches of the character class specified
$ marks the end of the pattern
pradeep's Avatar, Join Date: Apr 2005
Team Leader
Repition
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches colour or color.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.
pradeep's Avatar, Join Date: Apr 2005
Team Leader
Anchors
Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $ before any line break. E.g. ^b matches only the first b in bob.
jimmius's Avatar, Join Date: Mar 2007
Newbie Member
It was more than I expected! Thank you very much! It was very helpful

Greetings from Hellas!