Comparing .csv files and generate sweet HTML reports in rendersnake

Here is a simple library to perform a simple operation. Compare two delimiter separated files and output an .html file.

Simple and straightforward. It can be used to output a ‘Comparison Result’ object, or it can be used to print out an .html report

Features

  1. Compare methodology can be customized using a properties file
  2. Report can be customized using a properties file

Using rendersnake and it pretty awesome.. It is available on github and v1.0 is somewhere between the first cry of the baby and kindergarten.

Here is the gitlab link.

See a report sample,

csvCompare

csvCompare

enjoy..

Advertisements

Git Internals – The commit tree

Git stores the commits as a trees and blobs, and a tree object can consist of both trees and blobs as it’s children. An example of storage as blob can be seen here.

See the file structure of a project below


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree
.
|-- about.html
|-- css
| `-- simple.css
|-- index.html
`-- list.html

1 directory, 4 files

It is pretty straight forward with a couple of .html files and one .css file in a separate directory. Please note that all these files are committed, and that means it has already been written out to the underlying storage. To make sure we can use the really useful git log command,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

It shows only one commit has been made. It also returns a hash value that is associated with the commit.

The cat-file command we overused and abused in the previous post can be used to see the tree corresponding to last commit. We pass the hash value of the commit as a parameter.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 300f5c42a5aed68268547a95db4f40b6b122fb5b
tree 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530

initial commit

So, the commit hash points to a tree, whose hash is displayed. Let us see what the tree contains,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob 9015a7a32ca0681be64471d3ac2f8c1f24c1040d index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

This tree contains three blobs for the three .html files and another tree for the css directory. Let us go into this tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 02ff2e2946f969bc640886861ff8c7039e1a2339
100644 blob dac138d9e013a2e9a10e67d793bd4703c1b86bd1 simple.css

It contains the .css file. So, the entire structure looks something like this.

commit one

Now, lets make a mall change to the index.html file and do a second commit. After this, this is how the log looks like,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit ec8b103771588498923711c036ff3280c863f713
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 17:36:33 2014 +0530

second commit

commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

There are two commits and both of them have two different hash values,

First Commit : 300f5c42a5aed68268547a95db4f40b6b122fb5b (Initial Commit)

Second Commit : ec8b103771588498923711c036ff3280c863f713 (Second Commit)

Let us follow the second commit tree like we did before,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p ec8b103771588498923711c036ff3280c863f713
tree 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
parent 300f5c42a5aed68268547a95db4f40b6b122fb5b
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530

second commit

The second commit is a child of the first commit, which is interesting, as it is exactly how we see in tools like gitk. A commit contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.

Going further deep into the tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob b110c44fd08f191062636f18cfeeaeccd5be1b73 index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

The most interesting thing to note here is that, all the hash values, except the one for index.html remains the same.

The two commit trees look something like this.

commit two

In the second commit tree, the hash values for the unchanged files are the same as the previous commit tree. Now, just see the hash values as pointers to files. The second tree points to the unchanged files.

Well, that’s it.. commit trees…

Git Internals – Basic object data storage

Well, what makes git super fast? A look into git’s underbelly..

Before i begin, i will be setting up an empty repository.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git init
Initialized empty Git repository in /home/nikhil/dev/blog/git/.git/
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ ls -a
. .. .git

Also, it can be seen that initializing the repository creates a .git directory, and see the contents of the directory. As you can see, the objects folder is empty. Git has initialized the objects directory and created pack and info subdirectories in it, but there are no regular files.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ tree
.
|-- branches
|-- config
|-- description
|-- HEAD
|-- hooks
| |-- applypatch-msg.sample
| |-- commit-msg.sample
| |-- post-update.sample
| |-- pre-applypatch.sample
| |-- pre-commit.sample
| |-- prepare-commit-msg.sample
| |-- pre-rebase.sample
| `-- update.sample
|-- info
| `-- exclude
|-- objects
| |-- info
| `-- pack
`-- refs
|-- heads
`-- tags

9 directories, 12 files

At the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time. To demonstrate, you can use the plumbing command hash-object, which takes some data, stores it in your .git directory, and gives you back the key the data is stored as. Note that the hash-object is a plumbing command and is not meant to be used in a regular day.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ echo 'supercompiler' | git hash-object -w --stdin
755eb4004ee1ac36d0dd51008ed6279c2fb200e5

The -w tells hash-object to store the object; otherwise, the command simply tells you what the key would be. --stdin tells the command to read the content from stdin; if you don’t specify this, hash-object expects the path to a file. The output from the command is a 40-character checksum hash. This is the SHA-1 hash — a checksum of the content you’re storing plus a header.

Let us move to the objects directory and see how the file is stored,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ tree
.
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- info
`-- pack

3 directories, 1 file

You can see a file in the objects directory. This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters.

You can pull the content back out of Git with the cat-file command. This command is sort of a Swiss army knife for inspecting Git objects. Passing -p to it instructs the cat-file command to figure out the type of content and display it nicely for you.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ git cat-file -p 755eb4004ee1ac36d0dd51008ed6279c2fb200e5
supercompiler

Ok, let us play around a bit.

I am creating a v1 of a file and writing it to the repository, followed by modifying the file and writing the v2 to the repository. We can see both the file contents using the cat-file command and see a total of three different hashes stored within the objects directory.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 1" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
83baae61804e65cc73a7201a7252750c76066a30
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 2" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree .git/objects/
.git/objects/
|-- 1f
| `-- 7a7a472abf3dd9643fd615f6da379c4acb3e3a
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- 83
| `-- baae61804e65cc73a7201a7252750c76066a30
|-- info
`-- pack

5 directories, 3 files

You can have Git tell you the object type of any object in Git, given its SHA-1 key, with cat-file -t:

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -t 83baae61804e65cc73a7201a7252750c76066a30
blob

Now, there are two things that has to be mentioned here.

  1. Git does not store the file. Git stores only the contents.
  2. The contents are stored as a blob object

How to : git stash

Stashing is a great way to pause what you’re currently working on and come back to it later. Suppose you are working on something, and suddenly, something high priority comes up, like a production bug. Ah, lets see.. a storyboard can make things light..

So, wonderful morning, and you wanted to do some performance improvements to the code..

So, you create a new branch and make the changes, and you see that the changes are seen when you run a git status

$ git checkout -b performance
Switched to a new branch 'performance'

vi JobPersist.java

$ git status
# On branch performance
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: JobPersist.java
#
no changes added to commit (use "git add" and/or "git commit -a")

But, I am not done and not ready to commit the data. Then, something important comes up. You don’t have time to complete your work and will have do something else.

So, to get the latest code, you do a pull, and see what happens.

$ git pull
remote: Counting objects: 51, done.
remote: Compressing objects: 100% (34/34), done.
remote: Total 51 (delta 15), reused 37 (delta 9)
Unpacking objects: 100% (51/51), done.
From https://github.com/nikhilkuria/jobRunner
f175a2b..987a41f master -> origin/master
* [new branch] performance -> origin/performance
Updating f175a2b..987a41f
error: Your local changes to the following files would be overwritten by merge:
jobRunner/src/com/job/persist/JobPersist.java
Please, commit your changes or stash them before you can merge.
Aborting

So, git tells me to either commit or stash. Since, I am not ready to commit, I just shove all the work away under the sheet and resume working when I am done with the new work. Stashing is the best way to do this.

The stash command in Git is kind of like a clipboard; you can stash away any changes in your current branch to work on something else for a while. You can change branches and perform other commits.

$ git stash
Saved working directory and index state WIP on master: f175a2b Adding relationsh
ips
HEAD is now at f175a2b Adding relationships

So, now the branch moved back to the previous commit and moved all my tracked files to a ‘stash’, making the working directory clean

$ git status
# On branch master
nothing to commit, working directory clean

I can also see the list of all my stashes,

$ git stash list
stash@{0}: WIP on master: f175a2b Adding relationships

So, and finally, you are done with the new job and is ready to continue working on the performance issue. You can bring back the stash using the command, git stash apply

$ git stash apply
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: JobPersist.java
#
no changes added to commit (use "git add" and/or "git commit -a")

So, this brought back my previous work and I can complete this at peace.

There are cases when you can have multiple stashes,

$ git stash list
stash@{0}: WIP on master: eec8b6f fix for prod issue
stash@{1}: WIP on master: ba8de07 initial commit

So, if you need to go back to a specific stash, you could pass the stash id as a param to git stash apply

$ git stash apply stash@{1}
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: JobPersist.java
#
no changes added to commit (use "git add" and/or "git commit -a")

The changes to your files were reapplied, but the file you staged before wasn’t restaged. To do that, you must run the git stash apply command with a --index option to tell the command to try to reapply the staged changes. If you had run that instead, you’d have gotten back to your original position:

$ git stash apply stash@{4} --index
# On branch master
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
# modified: JobPersist.java

You could also resume your work by creating a new branch from stash. (highly recommended). Running
git stash branch <branchname> [<stash>]
which creates a new branch for you, checks out the commit you were on when you stashed your work, reapplies your work there.

$ git stash branch performance-new stash@{4}
Switched to a new branch 'performance-new'
# On branch performance-new
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: JobPersist.java
#
no changes added to commit (use "git add" and/or "git commit -a")
Dropped stash@{4} (448b5faf608729000757f65a8d309a8e3fc3aa8d)

You could read further here and here.

Working with paths and files in Java 7

Well, lots happening in the latest java upgrade and ironically most of us are comfortable with Java 5 or 6. Enough said.

The changes in NIO is quite impressive. Find below the quick reference for paths. The new Classes are found in the java.nio.file package. Lets see though the most common and useful entry points.

  • java.nio.file.Files;
  • java.nio.file.Path;
  • java.nio.file.Paths;

Starting with Path,

The Path class includes various methods that can be used to obtain information about the path, access elements of the path, convert the path to other forms, or extract portions of a path. Let’s see how to create a path

// Microsoft Windows syntax
Path path = Paths.get("C:\\home\\joe\\foo");

// Solaris syntax
Path path = Paths.get("/home/joe/foo");

Path contains all sorts of useful methods to get information about the file system. For instance, the relativize method that can be used to construct a relative path between two paths. Paths can be compared, and tested against each other using the startsWith and endWith methods.

See the example for a Windows file system

Path path = Paths.get("C:\\Temp\\data\\git-new.png");
System.out.format("The path is %s%n", path.toString());
System.out.format("The path's parent is %s%n", path.getParent().toString());
System.out.format("The path's root is %s%n", path.getRoot().toString());

Path relativeRoot = Paths.get("C:\\Temp\\");

System.out.format("%nRelatice root is is %s%n", relativeRoot.toString());
Path relativePath = relativeRoot.relativize(path);

System.out.format("%nRelative path is %s%n", relativePath.toString());

And the output


The path is C:\Temp\data\git-new.png
The parent of the path is C:\Temp\data
The root of the path is C:\

The relative root path is C:\Temp

The relative path is data\git-new.png

Moving onto Files,

This class consists exclusively of static methods that operate on files, directories, or other types of files. Files works like a breeze using Path.

Lets see a couple of uses

Reading and writing from/to a file 

Let us see two methods, one which reads from a file to a byte array and the another which writes to a file from a byte array. Both of them work with Paths.


// The below methods are for small files

byte[] readSmallBinaryFile(String aFileName) throws IOException {
Path path = Paths.get(aFileName);
return Files.readAllBytes(path);
}

void writeSmallBinaryFile(byte[] aBytes, String aFileName) throws IOException {
Path path = Paths.get(aFileName);
Files.write(path, aBytes); //creates, overwrites
}

//In case of large files, use a BufferedWriter

Charset charset = Charset.forName("US-ASCII");
String s = "";
try (BufferedWriter writer = Files.newBufferedWriter(path, charset)) {
writer.write(s, 0, s.length());
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}

Most beautiful way to move files,

Moving files were a bit cumbersome, considering the amount of code needed for so trivial an operation, see here.

The Files class has a static method move,

public static Path move(Path source,
        Path target,
        CopyOption... options)
                 throws IOException

This takes in two path parameters, one for the source and one for the target. The third parameter being a list of CopyOptions. Right now, there are three Copy Options, and there could be more.

See the example below,

Path source = /var/tmp/source/out.log;
Path target = /vat/tmp/destination/out.log;
Files.move(source,
           target,
           REPLACE_EXISTING,
           ATOMIC_MOVE);

And the copy options are pretty cool,

ATOMIC_MOVE

Move the file as an atomic file system operation.
COPY_ATTRIBUTES

Copy attributes to the new file.
REPLACE_EXISTING

Replace an existing file if it exists.