RegEx to match Java Comments

Most(if not all) of the regular expressions a developer needs is already available. There are numerous resources online ranging from blogs and even an ever growing library. Not to mention the endless discussions on stackoverflow.

Few days back, I had lost the sources to a project and had to decompile them from the original build. Did not expect the decompiler to throw in so much comments.

Capture

The most obivious way to remove comments would be to use the following regEx.

/\*.*\*/

This seems the natural way to do it. /\* finds the start of the comment (note that the literal * needs to be escaped because * has a special meaning in regular expressions), .* finds any number of any character, and \*/ finds the end of the expression.

This will remove all the comments the decompiler put it. But, wanted to pursue more to handle any kind of comments.

The first problem with this approach is that .* does not match new lines. (Note, the text in green color is matched)

/* First comment 
 first comment—line two*/
/* Second comment */

This can be overcome easily by replacing the . with (.|[\r\n]):(\r and \n represent the return and newline characters)

/\*(.|[\r\n])*\*/

This reveals a second, more serious, problem—the expression matches too much. Regular expressions are greedy, they take in as much as they can. Consider the case in which your file has two comments. This regular expression will match them both along with anything in between:

start_code();
/* First comment */
more_code(); 
/* Second comment */
end_code();

To fix this, the regular expression must accept less. We cannot accept just any character with a ., we need to limit the types of characters that can be in our expressions. The new RegEx stops parsing when it sees a */ and will begin the next match only when it sees another /*.

/\*([^*]|[\r\n])*\*/

The disadvantage is that this simplistic approach doesn’t accept any comments with a * in them, which a standard java comment.

/* 
* Common multi-line comment style. 
*/ 
/* Second comment */

This is where it gets tricky. How do we accept a * without accepting the * that is part of the end comment? The solution is to still accept any character that is not *, but also accept a * and anything that follows it provided that it isn’t followed by a /. The parse will only stop if it sees a */ and will continue when it sees a *.

/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/

Also, the section in bold makes sure * comes only in pairs. It will accept any even number of *. It might even accept the * that is supposed to end the comment.

start_code();
/****
 * Common multi-line comment style.
 ****/
more_code(); 
/*
 * Another common multi-line comment style.
 */
end_code();

We should also make sure that the comment can end in multiple * s.

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

This RegEx can be used to find/replace java comments using a standard text editor.

An easier Method

Most regular expression packages support non-greedy matching. This means that the pattern will only be matched if there is no other choice. We can modify our second try to use the non-greedy matcher *? instead of the greedy matcher *. With this change, the middle of our comment will only match if it doesn’t match the end:

/\*(.|[\r\n])*?\*/