RegEx to match Java Comments

Most(if not all) of the regular expressions a developer needs is already available. There are numerous resources online ranging from blogs and even an ever growing library. Not to mention the endless discussions on stackoverflow.

Few days back, I had lost the sources to a project and had to decompile them from the original build. Did not expect the decompiler to throw in so much comments.

Capture

The most obivious way to remove comments would be to use the following regEx.

/\*.*\*/

This seems the natural way to do it. /\* finds the start of the comment (note that the literal * needs to be escaped because * has a special meaning in regular expressions), .* finds any number of any character, and \*/ finds the end of the expression.

This will remove all the comments the decompiler put it. But, wanted to pursue more to handle any kind of comments.

The first problem with this approach is that .* does not match new lines. (Note, the text in green color is matched)

/* First comment 
 first comment—line two*/
/* Second comment */

This can be overcome easily by replacing the . with (.|[\r\n]):(\r and \n represent the return and newline characters)

/\*(.|[\r\n])*\*/

This reveals a second, more serious, problem—the expression matches too much. Regular expressions are greedy, they take in as much as they can. Consider the case in which your file has two comments. This regular expression will match them both along with anything in between:

start_code();
/* First comment */
more_code(); 
/* Second comment */
end_code();

To fix this, the regular expression must accept less. We cannot accept just any character with a ., we need to limit the types of characters that can be in our expressions. The new RegEx stops parsing when it sees a */ and will begin the next match only when it sees another /*.

/\*([^*]|[\r\n])*\*/

The disadvantage is that this simplistic approach doesn’t accept any comments with a * in them, which a standard java comment.

/* 
* Common multi-line comment style. 
*/ 
/* Second comment */

This is where it gets tricky. How do we accept a * without accepting the * that is part of the end comment? The solution is to still accept any character that is not *, but also accept a * and anything that follows it provided that it isn’t followed by a /. The parse will only stop if it sees a */ and will continue when it sees a *.

/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/

Also, the section in bold makes sure * comes only in pairs. It will accept any even number of *. It might even accept the * that is supposed to end the comment.

start_code();
/****
 * Common multi-line comment style.
 ****/
more_code(); 
/*
 * Another common multi-line comment style.
 */
end_code();

We should also make sure that the comment can end in multiple * s.

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

This RegEx can be used to find/replace java comments using a standard text editor.

An easier Method

Most regular expression packages support non-greedy matching. This means that the pattern will only be matched if there is no other choice. We can modify our second try to use the non-greedy matcher *? instead of the greedy matcher *. With this change, the middle of our comment will only match if it doesn’t match the end:

/\*(.|[\r\n])*?\*/
Advertisements

On the beauty of Algorithms.. Part 1

As software engineers, our job is not just to find solution to problems, but to find the best solution to a problem.

You don’t aspire to be just another monkey with a fancy title..

Monkey-typing

Had the chance to comeupon a few beautifully crafted courses online. Highly reccomend to checkout this, this and this..

To demonstate the power of a good algorithms, let’s borrow a case study from Sedgewick and Wayne’s beautifully crafted book.

The Dynamic Connectivity problem 

We start with the following problem specification: The input is a sequence of pairs of integers, where each integer represents an object of some type and we are to interpret the pair p q as meaning “p is connected to q.”

We assume that “is connected to” is an equivalence relation, which means that it is
■ Reflexive : p is connected to p.
■ Symmetric : If p is connected to q, then q is connected to p.
■ Transitive : If p is connected to q and q is connected to r, then p is connected to r.

An equivalence relation partitions the objects into equivalence classes. In this case, two objects are in the same equivalence class if and only if they are connected. Our goal is to write a program to filter out extraneous pairs (pairs where both objects are in the same equivalence class) from the sequence.

See the illustration below,

union find

Assume the universe consists of 10 points numbered from 0-9. Initially none of the points are connected. The inputs are the pair of numbers on the left.

The first pair (4,3). We check if there exists a connection between the pair. So, we create  a connection between the two nodes and prints out the connection. Same for the next 4 inputs of (3,8), (6,5), (9,4) and (2,1).

But the next input, (8,9). Even though there is no direct connection between 8 and 9, they are part of a connected component, [3,4,8,9]. So, no new connection is to be made and nothing is printed. And so on..

Given a grid with numerous connected components, identifying the connected components or if two nodes are connected pose a computational problem. See the universe below,

union find2

You can quickly identify component consisting of five sites at the bottom left, but you might have difficulty verifying that all of the other sites are connected to one another. For a program, the task is
even more difficult, because it has to work just with site names and connections and has no access to the geometric placement of sites in the diagram. How can we tell quickly whether or not any given two sites in such a network are connected?

A solution to the problem provides an effective implementation for,

void union(int p, int q) – add connection between p and q
int find(int p) – component identifier for p (0 to N-1)
boolean connected(int p, int q) – return true if p and q are in the same component

The union() operation merges two components if the two sites are in different components, the find() operation returns an integer component identifier for a given site, the connected() operation determines whether two sites are in the same component.

As we shall soon see, the development of an algorithmic solution for dynamic connectivity thus reduces to the task of developing an implementation of this API. Every implementation has to
■ Define a data structure to represent the known connections
■ Develop efficient union(), find(), connected() implementations that are based on that data structure

As usual, the nature of the data structure has a direct impact on the efficiency of the algorithms, so data structure and algorithm design go hand in hand. The API already specifies the convention that both sites and components will be identified by int values between 0 and N-1, so it makes sense to use a site-indexed array id[] as our basic data structure to represent the components.

Initially, none of the sites are connected and each site corresponds to a component. Initially, we start with N components, each site in its own component, so we initialize id[i] to i for all i from 0 to N-1.

All of our implementa-tions use a one-line implementation of connected() that returns the boolean value find(p) == find(q). The find(int p) methods returns the identified for the component. So, it should make perfect sense to assert that, if two sites correspond to the same component, they are connected.

Implementations 

We shall consider three different implementations, all based on using the site-indexed id[] array, to determine whether two sites are in the same connected component.

Quick-find

One approach is to maintain the invariant that p and q are connected if and only if id[p] is equal to id[q]. In other words, all sites in a component must have the same value in id[]. This method is called quick-find because find(p) just returns id[p], which immediately implies that connected(p, q) reduces to just the test id[p] == id[q] and returns true if and only if p and q are in the same component.

For implementing union, we first check whether they are already in the same component, in which case there is nothing to do. Otherwise, we are faced with the situation that all of the id[] entries corresponding to sites in the same component as p have one value and all of the id[] entries corresponding to sites in the same component as q have another value. To combine the two components into one, we have to make all of the id[] entries corresponding to both sets of sites the same value, as shown in the example. To do so, we go through the array, changing all the entries with values equal to id[p] to the value id[q].

We could have decided to change all the entries equal to id[q] to the value id[p]—the choice between these two alternatives is arbitrary.

union find3

The implementation is pretty simple,

public int find(int p)
{ return id[p]; }
public void union(int p, int q)
{  // Put p and q into the same component.
   int pID = find(p);
   int qID = find(q);
   // Nothing to do if p and q are already in the same component.
   if (pID == qID) return;
      // Rename p’s component to q’s name.
      for (int i = 0; i < id.length; i++){
         if (id[i] == pID) {
           id[i] = qID;
         }
         count--;
      }
}

The find() operation is certainly quick, as it only accesses the id[] array once in order to complete the operation. But quick-find is typically not useful for large problems because union() needs to scan through the whole id[] array for each input pair.

In particular, suppose that we use quick-find for the dynamic connectivity problem and wind up with a single component. This requires at least (N-1) calls to union(). union() iterates through all the n elements at least once, which clearly gives us a N 2 complexity. This analysis generalizes to say that quick-find is quadratic for typical applications where we end up with a small number of components. Modern computers can execute hundreds of millions or billions of instructions per second, so this cost is not noticeable if N is small, but we also might find ourselves with millions or billions of sites and connections to process in a modern application. Trying to solve a problem with millions or billions of sites will not give us an acceptable time.

Most defenitely Quick-Find can be impoved..

 

String intern() function and String Pool

Let’s start with the  JVM spec..

The Java programming language requires that identical string literals (that is, literals that contain the same sequence of code points) must refer to the same instance of class String. In addition, if the method String.intern is called on any string, the result is a reference to the same class instance that would be returned if that string appeared as a literal. Thus, the following expression must have the value true:

Elucidating, if there are more than one string literal with the same value, they share the memory.

Consider the example. An application X processes the results from a questionnaire which expects an YES/NO response. Adding on, there are more than a billion replies. If you store  all those ‘YES’ and ‘NO’ values separately, you’ll end up clogging up your precious memory. So, why don’t we have a single string for ‘YES’ and let the trizillion literals point to it. This is exactly what JAVA does and the table where they store all such string literals is called the String Pool

String sampleOne = "Dylan";
String sampleTwo = "Dylan";
String sampleThree = new String("Dylan");

if(sampleOne == sampleTwo){
System.out.println("One and Two are same");
}

if(sampleOne == sampleThree){
System.out.println("One and Threeo are same");
}

In line 1, we create a String with value “Dylan”. In this process, Java automatically adds a reference of the String “Dylan” to the String Pool. In line 2, when we try to create another string, Java sees if this is already available in the String Pool and just points the literal to the value. But in line 3, we use the new operator to create a string. This is a bit off the mark, the new operator forces Java to ignore the String Pool and create a new instance. This’s the reason why line 6 is printed and line 10 is not… This very well explains why it’s not advisable to use a new operator while creating a String.

intern() method

The docs tells us that the intern() method returns a canonical representation of the string. In other words, using the intern() method, we are forcing Java to check the String pool and return the value of the String. Let me modify the above example

String sampleOne = "Dylan";
String sampleTwo = "Dylan";
String sampleThree = new String("Dylan").intern();
if(sampleOne == sampleTwo){
System.out.println("One and Two are same");
}
if(sampleOne == sampleThree){
System.out.println("One and Threeo are same");
}

Here, both line 6 and line 10 is printed.
Instead of creating a new instance in line 3, the intern() forces Java to return the String from String Pool.
Enough Said

renderSnake – Better Way to create HTML

We live in an age where everything is open and innovations spring out every second. All those algorithms, all those concepts and all those frameworks are easily avalable to he who is willing to look. Unless you are not staring down from the tip of the cutting edge, you can pretty much get help on any problem. If someone had a similar problem before and you have the solution available, why re-invent the wheel..

When I was in the task of creating an HTML report generating tool, I did not have to look much. Found the perfect framework – renderSnake. renderSnake is a component based HTML generation framework. Its purpose is to support the creation of Web applications that are better maintainable, allows for easier reuse, have testable UI components and produces compact HTML in an efficient way. Let’s start off with an example.

So, let’s create a plain old HTML.

</pre>
<h4>Two rows and three columns:</h4>
<table border="1">
<tbody>
<tr>
<td>Haruki Murakami</td>
<td>Gregory Roberts</td>
</tr>
<tr>
<td>A Wild Sheep Chase</td>
<td>Shantaram</td>
</tr>
</tbody>
</table>
<pre>

And the java code to create the HTML 

HtmlCanvas html = new HtmlCanvas();
html
 .html()
  .body()
   .h4()
    .content("Two rows and three columns:")
     .table(border("1"))
      .tr()
       .td()
        .content("Haruki Murakami")
       .td()
        .content("Gregory Roberts")
      ._tr()
      .tr()
       .td()
        .content("A Wild Sheep Chase")
       .td()
        .content("Shantaram")
      ._tr()
     ._table()
    ._body()
  ._html();

HTMLCanvas

Using a Canvas, you can write code that produces HTML. The Canvas has methods for all HTML tags. There are two methods for writing an opening tag, one without and one with attributes. The Canvas maintains a stack of nested tags. Proper closing of tags is required, according to the rules of HTML (some end tags are required, some are optional and some are forbidden).

Each tag method returns the Canvas instance, allowing for a fluent programming interface. Using source indentation you can visually verify the correct structure. In the above example, the variable html is an instance of HtmlCanvas.

HtmlAttributes

Using an HtmlAttributes instance, you can add attribtues to tags such as href, class and id. The Java classes HtmlAttributes and HtmlAttributesFactory provides convenience methods for creating instances. This allows for a fluent programming interface.

//This one is using HtmlAttributes :

html.body(new HtmlAttributes("key","value")); 

//This one is using the HTMLAttributesFactory

.a(href("http://goto.com")) 
                .img(src("picture.png").alt("some picture")) 
                ._a(); 

All the attributes for a given tag can be set using the corresponding method. In the above example, the href, img and alt properties are set.

I am yet to find a limitation for renderSnake. You can get more examples here and here..