RegEx to match Java Comments

Most(if not all) of the regular expressions a developer needs is already available. There are numerous resources online ranging from blogs and even an ever growing library. Not to mention the endless discussions on stackoverflow.

Few days back, I had lost the sources to a project and had to decompile them from the original build. Did not expect the decompiler to throw in so much comments.

Capture

The most obivious way to remove comments would be to use the following regEx.

/\*.*\*/

This seems the natural way to do it. /\* finds the start of the comment (note that the literal * needs to be escaped because * has a special meaning in regular expressions), .* finds any number of any character, and \*/ finds the end of the expression.

This will remove all the comments the decompiler put it. But, wanted to pursue more to handle any kind of comments.

The first problem with this approach is that .* does not match new lines. (Note, the text in green color is matched)

/* First comment 
 first comment—line two*/
/* Second comment */

This can be overcome easily by replacing the . with (.|[\r\n]):(\r and \n represent the return and newline characters)

/\*(.|[\r\n])*\*/

This reveals a second, more serious, problem—the expression matches too much. Regular expressions are greedy, they take in as much as they can. Consider the case in which your file has two comments. This regular expression will match them both along with anything in between:

start_code();
/* First comment */
more_code(); 
/* Second comment */
end_code();

To fix this, the regular expression must accept less. We cannot accept just any character with a ., we need to limit the types of characters that can be in our expressions. The new RegEx stops parsing when it sees a */ and will begin the next match only when it sees another /*.

/\*([^*]|[\r\n])*\*/

The disadvantage is that this simplistic approach doesn’t accept any comments with a * in them, which a standard java comment.

/* 
* Common multi-line comment style. 
*/ 
/* Second comment */

This is where it gets tricky. How do we accept a * without accepting the * that is part of the end comment? The solution is to still accept any character that is not *, but also accept a * and anything that follows it provided that it isn’t followed by a /. The parse will only stop if it sees a */ and will continue when it sees a *.

/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/

Also, the section in bold makes sure * comes only in pairs. It will accept any even number of *. It might even accept the * that is supposed to end the comment.

start_code();
/****
 * Common multi-line comment style.
 ****/
more_code(); 
/*
 * Another common multi-line comment style.
 */
end_code();

We should also make sure that the comment can end in multiple * s.

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

This RegEx can be used to find/replace java comments using a standard text editor.

An easier Method

Most regular expression packages support non-greedy matching. This means that the pattern will only be matched if there is no other choice. We can modify our second try to use the non-greedy matcher *? instead of the greedy matcher *. With this change, the middle of our comment will only match if it doesn’t match the end:

/\*(.|[\r\n])*?\*/

On the beauty of Algorithms.. Part 1

As software engineers, our job is not just to find solution to problems, but to find the best solution to a problem.

You don’t aspire to be just another monkey with a fancy title..

Monkey-typing

Had the chance to comeupon a few beautifully crafted courses online. Highly reccomend to checkout this, this and this..

To demonstate the power of a good algorithms, let’s borrow a case study from Sedgewick and Wayne’s beautifully crafted book.

The Dynamic Connectivity problem 

We start with the following problem specification: The input is a sequence of pairs of integers, where each integer represents an object of some type and we are to interpret the pair p q as meaning “p is connected to q.”

We assume that “is connected to” is an equivalence relation, which means that it is
■ Reflexive : p is connected to p.
■ Symmetric : If p is connected to q, then q is connected to p.
■ Transitive : If p is connected to q and q is connected to r, then p is connected to r.

An equivalence relation partitions the objects into equivalence classes. In this case, two objects are in the same equivalence class if and only if they are connected. Our goal is to write a program to filter out extraneous pairs (pairs where both objects are in the same equivalence class) from the sequence.

See the illustration below,

union find

Assume the universe consists of 10 points numbered from 0-9. Initially none of the points are connected. The inputs are the pair of numbers on the left.

The first pair (4,3). We check if there exists a connection between the pair. So, we create  a connection between the two nodes and prints out the connection. Same for the next 4 inputs of (3,8), (6,5), (9,4) and (2,1).

But the next input, (8,9). Even though there is no direct connection between 8 and 9, they are part of a connected component, [3,4,8,9]. So, no new connection is to be made and nothing is printed. And so on..

Given a grid with numerous connected components, identifying the connected components or if two nodes are connected pose a computational problem. See the universe below,

union find2

You can quickly identify component consisting of five sites at the bottom left, but you might have difficulty verifying that all of the other sites are connected to one another. For a program, the task is
even more difficult, because it has to work just with site names and connections and has no access to the geometric placement of sites in the diagram. How can we tell quickly whether or not any given two sites in such a network are connected?

A solution to the problem provides an effective implementation for,

void union(int p, int q) – add connection between p and q
int find(int p) – component identifier for p (0 to N-1)
boolean connected(int p, int q) – return true if p and q are in the same component

The union() operation merges two components if the two sites are in different components, the find() operation returns an integer component identifier for a given site, the connected() operation determines whether two sites are in the same component.

As we shall soon see, the development of an algorithmic solution for dynamic connectivity thus reduces to the task of developing an implementation of this API. Every implementation has to
■ Define a data structure to represent the known connections
■ Develop efficient union(), find(), connected() implementations that are based on that data structure

As usual, the nature of the data structure has a direct impact on the efficiency of the algorithms, so data structure and algorithm design go hand in hand. The API already specifies the convention that both sites and components will be identified by int values between 0 and N-1, so it makes sense to use a site-indexed array id[] as our basic data structure to represent the components.

Initially, none of the sites are connected and each site corresponds to a component. Initially, we start with N components, each site in its own component, so we initialize id[i] to i for all i from 0 to N-1.

All of our implementa-tions use a one-line implementation of connected() that returns the boolean value find(p) == find(q). The find(int p) methods returns the identified for the component. So, it should make perfect sense to assert that, if two sites correspond to the same component, they are connected.

Implementations 

We shall consider three different implementations, all based on using the site-indexed id[] array, to determine whether two sites are in the same connected component.

Quick-find

One approach is to maintain the invariant that p and q are connected if and only if id[p] is equal to id[q]. In other words, all sites in a component must have the same value in id[]. This method is called quick-find because find(p) just returns id[p], which immediately implies that connected(p, q) reduces to just the test id[p] == id[q] and returns true if and only if p and q are in the same component.

For implementing union, we first check whether they are already in the same component, in which case there is nothing to do. Otherwise, we are faced with the situation that all of the id[] entries corresponding to sites in the same component as p have one value and all of the id[] entries corresponding to sites in the same component as q have another value. To combine the two components into one, we have to make all of the id[] entries corresponding to both sets of sites the same value, as shown in the example. To do so, we go through the array, changing all the entries with values equal to id[p] to the value id[q].

We could have decided to change all the entries equal to id[q] to the value id[p]—the choice between these two alternatives is arbitrary.

union find3

The implementation is pretty simple,

public int find(int p)
{ return id[p]; }
public void union(int p, int q)
{  // Put p and q into the same component.
   int pID = find(p);
   int qID = find(q);
   // Nothing to do if p and q are already in the same component.
   if (pID == qID) return;
      // Rename p’s component to q’s name.
      for (int i = 0; i < id.length; i++){
         if (id[i] == pID) {
           id[i] = qID;
         }
         count--;
      }
}

The find() operation is certainly quick, as it only accesses the id[] array once in order to complete the operation. But quick-find is typically not useful for large problems because union() needs to scan through the whole id[] array for each input pair.

In particular, suppose that we use quick-find for the dynamic connectivity problem and wind up with a single component. This requires at least (N-1) calls to union(). union() iterates through all the n elements at least once, which clearly gives us a N 2 complexity. This analysis generalizes to say that quick-find is quadratic for typical applications where we end up with a small number of components. Modern computers can execute hundreds of millions or billions of instructions per second, so this cost is not noticeable if N is small, but we also might find ourselves with millions or billions of sites and connections to process in a modern application. Trying to solve a problem with millions or billions of sites will not give us an acceptable time.

Most defenitely Quick-Find can be impoved..

 

Visualizing an xml as a graph – Neo4j 101

Neo4j is quite awesome for pretty much everything. Here is an extract I found from this blog

Whenever someone gives you a problem, think graphs. They are the most fundamental and flexible way of representing any kind of a relationship, so it’s about a 50-50 shot that any interesting design problem has a graph involved in it. Make absolutely sure you can’t think of a way to solve it using graphs before moving on to other solution types.

As a baby step, let us try to visualize an xml as a graph.

We can start with a simple xml here,

<?xml version="1.0" encoding="UTF-8"?>
 <breakfast_menu>
  <food>
   <name attr="one">Belgian Waffles</name>
   <price>$5.95</price>
   <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
   <calories>650</calories>
  </food>
 </breakfast_menu>

It is a simple ‘Breakfast Menu’ with just one item, ‘Belgian Waffles‘.

Neo4j has a neat interface which lets us visualize the graph we created. The graph for the above xml looks slick..

small-graph-xml

 

The above XML has 6 Tags and we have 6 nodes in our graph. The nested tags/Nodes has a ‘CHILD-OF’ relationship with the Parent tag/Node.

In the graph above,

  • Node numbered 0 is the root node – Corresponds to the <breakfast_menu> tag
  • Node numbered 1 is the immediate child of the root – Corresponds to the <food> tag
  • Nodes numbered from 2-4 are the tags which come under <food> tag, viz <name>, <price>, <description> and <calories>

Transforming XML..

Let us parse the xml into a data structure which can be easily persisted using Neo4j. Personally, I would prefer SAX parser as there are serious memory constraints when creating a DOM object (really painful if you talking big data as xml). Additionally SAX parsing gives you unlimited freedom to do whatever you want to do with the XML.

This is how I represent a node, using the object XmlElement. Each XmlElement is identified by an id. This is basically an integer or long. Only thing to make sure is that the number has to be unique across all XMLElements.


public class XmlElement {

 private String tagName;
 private String tagValue;
 private Map<String, String> attributes = new HashMap<String,String>();
 private HierarchyIdentifier hierarchyIdentifier;
 private int parentId;
 private boolean persisted;

//Setters and Getters for the members

 public String getAtrributeString(){
  ObjectMapper jsonMapper = new ObjectMapper();
   try {
    return jsonMapper.writeValueAsString(this.attributes);
   } catch (JsonProcessingException e) {
    LOGGER.severe(e.getMessage());
    e.printStackTrace();
   }
  return null;
 }

}

The XML Tag name and value are stored as string and the attributes are stored as a Map. Also, there is another member Object, HierarchyIdentifier

public class HierarchyIdentifier {
 private int depth;
 private int width;
 private int id;

//Getters and Setters for member element
}

The HierarchyIdentifier Class contains the id class which is used to identify an XmlElement, which translates to identifying an XML Tag. Each XmlElement has the id of it’s parent stored.

And after the parse, the XML should be represented as a map of <Integer,XmlElement> and each XmlElement identified by the corresponding Integer id. You can see the SAX PARSER here..

So we pass the Map to the GraphWriter object.

We create a node with the following properties

  • Value – The Value of the XML Tag
  • ID – The ID of the current node
  • Parent – The ID of the parent Tag
  • Attributes – The String representation of the Tag attributes

Also, the node has the following Labels

  • NODE – For all nodes
  • PARENT – For the seed ( highest parent ) node
  • XML tag Name – Say, the node for the tag <food> will have label FOOD

Currently, there is only a single type of relationship, ‘CHILD_OF’ , between immediate nested tags.

See the below the example for a bigger XML

<breakfast_menu>
 <food>
  <name attr="one">Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
 </food>
 <food>
  <name attr="two">Masala Dosa</name>
  <price>$10.95</price>
  <description>South India's famous slim pancake with mashed potatoes</description>
  <calories>650</calories>
  <eaten>
   <name first="nikhil"/>
   <age>25</age>
  </eaten>
 </food>
 </breakfast_menu>

I have added a second item to the ‘breakfast_menu’.. the legendary masala dosa.

Also, the second food item has a new child tag, <eaten>. And the resultant graph looks prettier

xml-med

 

The sets of tags,relationships and properties can be seen in the Neo4j local server interface.

xml-med-tags

The entire codebase can be found here in github..

Visualizing as a GraphGist

An easy representation can be done in GraphGist. I have created the graph of this xml.

Here is the link to the gist. Also, the gist script can be found in the github gist here..

 

Comparing .csv files and generate sweet HTML reports in rendersnake

Here is a simple library to perform a simple operation. Compare two delimiter separated files and output an .html file.

Simple and straightforward. It can be used to output a ‘Comparison Result’ object, or it can be used to print out an .html report

Features

  1. Compare methodology can be customized using a properties file
  2. Report can be customized using a properties file

Using rendersnake and it pretty awesome.. It is available on github and v1.0 is somewhere between the first cry of the baby and kindergarten.

Here is the gitlab link.

See a report sample,

csvCompare

csvCompare

enjoy..

Git Internals – The commit tree

Git stores the commits as a trees and blobs, and a tree object can consist of both trees and blobs as it’s children. An example of storage as blob can be seen here.

See the file structure of a project below


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree
.
|-- about.html
|-- css
| `-- simple.css
|-- index.html
`-- list.html

1 directory, 4 files

It is pretty straight forward with a couple of .html files and one .css file in a separate directory. Please note that all these files are committed, and that means it has already been written out to the underlying storage. To make sure we can use the really useful git log command,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

It shows only one commit has been made. It also returns a hash value that is associated with the commit.

The cat-file command we overused and abused in the previous post can be used to see the tree corresponding to last commit. We pass the hash value of the commit as a parameter.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 300f5c42a5aed68268547a95db4f40b6b122fb5b
tree 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530

initial commit

So, the commit hash points to a tree, whose hash is displayed. Let us see what the tree contains,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob 9015a7a32ca0681be64471d3ac2f8c1f24c1040d index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

This tree contains three blobs for the three .html files and another tree for the css directory. Let us go into this tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 02ff2e2946f969bc640886861ff8c7039e1a2339
100644 blob dac138d9e013a2e9a10e67d793bd4703c1b86bd1 simple.css

It contains the .css file. So, the entire structure looks something like this.

commit one

Now, lets make a mall change to the index.html file and do a second commit. After this, this is how the log looks like,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit ec8b103771588498923711c036ff3280c863f713
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 17:36:33 2014 +0530

second commit

commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

There are two commits and both of them have two different hash values,

First Commit : 300f5c42a5aed68268547a95db4f40b6b122fb5b (Initial Commit)

Second Commit : ec8b103771588498923711c036ff3280c863f713 (Second Commit)

Let us follow the second commit tree like we did before,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p ec8b103771588498923711c036ff3280c863f713
tree 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
parent 300f5c42a5aed68268547a95db4f40b6b122fb5b
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530

second commit

The second commit is a child of the first commit, which is interesting, as it is exactly how we see in tools like gitk. A commit contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.

Going further deep into the tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob b110c44fd08f191062636f18cfeeaeccd5be1b73 index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

The most interesting thing to note here is that, all the hash values, except the one for index.html remains the same.

The two commit trees look something like this.

commit two

In the second commit tree, the hash values for the unchanged files are the same as the previous commit tree. Now, just see the hash values as pointers to files. The second tree points to the unchanged files.

Well, that’s it.. commit trees…

Working with paths and files in Java 7

Well, lots happening in the latest java upgrade and ironically most of us are comfortable with Java 5 or 6. Enough said.

The changes in NIO is quite impressive. Find below the quick reference for paths. The new Classes are found in the java.nio.file package. Lets see though the most common and useful entry points.

  • java.nio.file.Files;
  • java.nio.file.Path;
  • java.nio.file.Paths;

Starting with Path,

The Path class includes various methods that can be used to obtain information about the path, access elements of the path, convert the path to other forms, or extract portions of a path. Let’s see how to create a path

// Microsoft Windows syntax
Path path = Paths.get("C:\\home\\joe\\foo");

// Solaris syntax
Path path = Paths.get("/home/joe/foo");

Path contains all sorts of useful methods to get information about the file system. For instance, the relativize method that can be used to construct a relative path between two paths. Paths can be compared, and tested against each other using the startsWith and endWith methods.

See the example for a Windows file system

Path path = Paths.get("C:\\Temp\\data\\git-new.png");
System.out.format("The path is %s%n", path.toString());
System.out.format("The path's parent is %s%n", path.getParent().toString());
System.out.format("The path's root is %s%n", path.getRoot().toString());

Path relativeRoot = Paths.get("C:\\Temp\\");

System.out.format("%nRelatice root is is %s%n", relativeRoot.toString());
Path relativePath = relativeRoot.relativize(path);

System.out.format("%nRelative path is %s%n", relativePath.toString());

And the output


The path is C:\Temp\data\git-new.png
The parent of the path is C:\Temp\data
The root of the path is C:\

The relative root path is C:\Temp

The relative path is data\git-new.png

Moving onto Files,

This class consists exclusively of static methods that operate on files, directories, or other types of files. Files works like a breeze using Path.

Lets see a couple of uses

Reading and writing from/to a file 

Let us see two methods, one which reads from a file to a byte array and the another which writes to a file from a byte array. Both of them work with Paths.


// The below methods are for small files

byte[] readSmallBinaryFile(String aFileName) throws IOException {
Path path = Paths.get(aFileName);
return Files.readAllBytes(path);
}

void writeSmallBinaryFile(byte[] aBytes, String aFileName) throws IOException {
Path path = Paths.get(aFileName);
Files.write(path, aBytes); //creates, overwrites
}

//In case of large files, use a BufferedWriter

Charset charset = Charset.forName("US-ASCII");
String s = "";
try (BufferedWriter writer = Files.newBufferedWriter(path, charset)) {
writer.write(s, 0, s.length());
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}

Most beautiful way to move files,

Moving files were a bit cumbersome, considering the amount of code needed for so trivial an operation, see here.

The Files class has a static method move,

public static Path move(Path source,
        Path target,
        CopyOption... options)
                 throws IOException

This takes in two path parameters, one for the source and one for the target. The third parameter being a list of CopyOptions. Right now, there are three Copy Options, and there could be more.

See the example below,

Path source = /var/tmp/source/out.log;
Path target = /vat/tmp/destination/out.log;
Files.move(source,
           target,
           REPLACE_EXISTING,
           ATOMIC_MOVE);

And the copy options are pretty cool,

ATOMIC_MOVE

Move the file as an atomic file system operation.
COPY_ATTRIBUTES

Copy attributes to the new file.
REPLACE_EXISTING

Replace an existing file if it exists.

String intern() function and String Pool

Let’s start with the  JVM spec..

The Java programming language requires that identical string literals (that is, literals that contain the same sequence of code points) must refer to the same instance of class String. In addition, if the method String.intern is called on any string, the result is a reference to the same class instance that would be returned if that string appeared as a literal. Thus, the following expression must have the value true:

Elucidating, if there are more than one string literal with the same value, they share the memory.

Consider the example. An application X processes the results from a questionnaire which expects an YES/NO response. Adding on, there are more than a billion replies. If you store  all those ‘YES’ and ‘NO’ values separately, you’ll end up clogging up your precious memory. So, why don’t we have a single string for ‘YES’ and let the trizillion literals point to it. This is exactly what JAVA does and the table where they store all such string literals is called the String Pool

String sampleOne = "Dylan";
String sampleTwo = "Dylan";
String sampleThree = new String("Dylan");

if(sampleOne == sampleTwo){
System.out.println("One and Two are same");
}

if(sampleOne == sampleThree){
System.out.println("One and Threeo are same");
}

In line 1, we create a String with value “Dylan”. In this process, Java automatically adds a reference of the String “Dylan” to the String Pool. In line 2, when we try to create another string, Java sees if this is already available in the String Pool and just points the literal to the value. But in line 3, we use the new operator to create a string. This is a bit off the mark, the new operator forces Java to ignore the String Pool and create a new instance. This’s the reason why line 6 is printed and line 10 is not… This very well explains why it’s not advisable to use a new operator while creating a String.

intern() method

The docs tells us that the intern() method returns a canonical representation of the string. In other words, using the intern() method, we are forcing Java to check the String pool and return the value of the String. Let me modify the above example

String sampleOne = "Dylan";
String sampleTwo = "Dylan";
String sampleThree = new String("Dylan").intern();
if(sampleOne == sampleTwo){
System.out.println("One and Two are same");
}
if(sampleOne == sampleThree){
System.out.println("One and Threeo are same");
}

Here, both line 6 and line 10 is printed.
Instead of creating a new instance in line 3, the intern() forces Java to return the String from String Pool.
Enough Said