Using Regular Expressions in Java

Discussion in 'Java' started by pradeep, Jul 27, 2006.

  1. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    Using Regular Expressions in Java

    JDK versions 1.4.0 and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun's regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

    Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set (ie. by pattern matching). They can be used as a tool to search, edit or manipulate text or data. One common use is validation of data entry strings.

    All classes related to regular expressions are found in the java.util.regex package which must be imported.

    Java regular expression patterns use a syntax similar to the one used by perl. The best reference is found at sun.com.

    Simple examples of the use of regular expressions are:

    Code:
    Pattern p = Pattern.compile("a*b");
     Matcher m = p.matcher("aaaaab");
     boolean b = m.matches();
     
    As a convenience for a one-time use situation the matches method simplifies the syntax (but does not precompile the pattern).

    Code:
    boolean b = Pattern.matches("a*b", "aaaaab");

    Using The Pattern Class


    In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

    If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

    Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

    Using The Matcher Class

    Except for splitting a string , you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

    To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

    To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

    The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

    myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

    The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:
    Code:
     StringBuffer myStringBuffer = new StringBuffer();
     myMatcher = myPattern.matcher("subject");
     while (myMatcher.find()) {
       if (checkIfThisMatchShouldBeReplaced()) {
         myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
       }
     }
     myMatcher.appendTail(myStringBuffer);
    Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.
     
  2. Krolik

    Krolik New Member

    Joined:
    Feb 16, 2007
    Messages:
    5
    Likes Received:
    0
    Trophy Points:
    0
    Home Page:
    http://home.elka.pw.edu.pl/~pkolaczk/
    The article is a little too short. There is too little information how to format regular expressions in Java (with examples). What if I want to match any '\'? Should I write "\\" or "\\\" or "\\\\"? This is difficult for beginners.
     
  3. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    If you want to match '\' then you need to use '\\\\'
     
  4. jimmius

    jimmius New Member

    Joined:
    Mar 8, 2007
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Greetings,

    I would like a little help about regex matter in Java.

    Supposing that I want to check if a TextField contains only Strings not Numbers, which is the code to accomplish this?

    I tried :
    String str = new String();
    if (str.matches("[a-zA-Z]")) {}
    else {}


    When it comes for the user to enter more than one character in the TextField, it doesn't work. It works only for one character!

    Any help appreciated,

    jimmius
     
  5. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    Code:
     String str = new String();
     if (str.matches("^[a-zA-Z]+$")) // will only matches alphabets
     {
     }
     else {}
    
     
  6. jimmius

    jimmius New Member

    Joined:
    Mar 8, 2007
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    Sorry for my ignorance..

    About the. "+$" What is it stand for?

     
  7. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    + matches multiple matches of the character class specified
    $ marks the end of the pattern
     
  8. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    Repition
    The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches colour or color.

    The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

    Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.
     
  9. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    Anchors
    Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $ before any line break. E.g. ^b matches only the first b in bob.
     
  10. jimmius

    jimmius New Member

    Joined:
    Mar 8, 2007
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0
    It was more than I expected! Thank you very much! It was very helpful

    Greetings from Hellas!
     
  11. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    The pleasure is all mine! ;-)
     
  12. Namrata84

    Namrata84 New Member

    Joined:
    Mar 10, 2007
    Messages:
    1
    Likes Received:
    0
    Trophy Points:
    0
    hi
    in my project i hve runtime change the status of combobox which is in html according to the option button how can i do that?
     
  13. pradeep

    pradeep Team Leader

    Joined:
    Apr 4, 2005
    Messages:
    1,645
    Likes Received:
    87
    Trophy Points:
    0
    Occupation:
    Programmer
    Location:
    Kolkata, India
    Home Page:
    http://blog.pradeep.net.in
    You'll need to write some client-side script, like JavaScript, VBScript to do so!
     
  14. elec.shabnam

    elec.shabnam New Member

    Joined:
    Feb 13, 2008
    Messages:
    102
    Likes Received:
    0
    Trophy Points:
    0
    if we want to match the second b
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice