Using Regular Expressions in Java

pradeep · Jul 27, 2006

Using Regular Expressions in Java

JDK versions 1.4.0 and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun's regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set (ie. by pattern matching). They can be used as a tool to search, edit or manipulate text or data. One common use is validation of data entry strings.

All classes related to regular expressions are found in the java.util.regex package which must be imported.

Java regular expression patterns use a syntax similar to the one used by perl. The best reference is found at sun.com.

Simple examples of the use of regular expressions are:
Code:
Pattern p = Pattern.compile("a*b");
 Matcher m = p.matcher("aaaaab");
 boolean b = m.matches();
 
As a convenience for a one-time use situation the matches method simplifies the syntax (but does not precompile the pattern).
Code:
boolean b = Pattern.matches("a*b", "aaaaab");
Using The Pattern Class

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string , you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:
Code:
 StringBuffer myStringBuffer = new StringBuffer();
 myMatcher = myPattern.matcher("subject");
 while (myMatcher.find()) {
   if (checkIfThisMatchShouldBeReplaced()) {
     myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
   }
 }
 myMatcher.appendTail(myStringBuffer);
Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Krolik · Feb 16, 2007

The article is a little too short. There is too little information how to format regular expressions in Java (with examples). What if I want to match any '\'? Should I write "\\" or "\\\" or "\\\\"? This is difficult for beginners.

pradeep · Feb 16, 2007

If you want to match '\' then you need to use '\\\\'

jimmius · Mar 8, 2007

Greetings,

I would like a little help about regex matter in Java.

Supposing that I want to check if a TextField contains only Strings not Numbers, which is the code to accomplish this?

I tried :
String str = new String();
if (str.matches("[a-zA-Z]")) {}
else {}

When it comes for the user to enter more than one character in the TextField, it doesn't work. It works only for one character!

Any help appreciated,

jimmius

pradeep · Mar 8, 2007

Code:

 String str = new String();
 if (str.matches("^[a-zA-Z]+$")) // will only matches alphabets
 {
 }
 else {}

jimmius · Mar 8, 2007

Sorry for my ignorance..

About the. "+$" What is it stand for?
pradeep said:
Code:
 String str = new String();
 if (str.matches("^[a-zA-Z]+$")) // will only matches alphabets
 {
 }
 else {}
Click to expand...

pradeep · Mar 8, 2007

+ matches multiple matches of the character class specified
$ marks the end of the pattern

pradeep · Mar 8, 2007

Repition
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches colour or color.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.

pradeep · Mar 8, 2007

Anchors
Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $ before any line break. E.g. ^b matches only the first b in bob.

jimmius · Mar 8, 2007

It was more than I expected! Thank you very much! It was very helpful

Greetings from Hellas!

pradeep · Mar 8, 2007

The pleasure is all mine! ;-)

Namrata84 · Mar 10, 2007

hi
in my project i hve runtime change the status of combobox which is in html according to the option button how can i do that?

pradeep · Mar 14, 2007

You'll need to write some client-side script, like JavaScript, VBScript to do so!

elec.shabnam · Feb 20, 2008

pradeep said:

Anchors
Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $ before any line break. E.g. ^b matches only the first b in bob.
Click to expand...

if we want to match the second b

Log in or Sign up

Using Regular Expressions in Java

pradeep Team Leader

Krolik New Member

pradeep Team Leader

jimmius New Member

pradeep Team Leader

jimmius New Member

pradeep Team Leader

pradeep Team Leader

pradeep Team Leader

jimmius New Member

pradeep Team Leader

Namrata84 New Member

pradeep Team Leader

elec.shabnam New Member

Share This Page

Log in or Sign up

Using Regular Expressions in Java

pradeep Team Leader

Krolik New Member

pradeep Team Leader

jimmius New Member

pradeep Team Leader

jimmius New Member

pradeep Team Leader

pradeep Team Leader

pradeep Team Leader

jimmius New Member

pradeep Team Leader

Namrata84 New Member

pradeep Team Leader

elec.shabnam New Member

Share This Page

Useful Searches