Implementing Word Ladder game using Neo4j

Word Ladder is a pretty popular game invented by Lewis Carroll, who is better known as the author of Alice in Wonderland.

In a word ladder puzzle you must make the change occur gradually by changing one letter at a time. At each step you must transform one word into another word, you are not allowed to transform a word into a non-word. For instance, the word “FOOL” can be transformed into the word “SAGE” as

FOOL
POOL
POLL
POLE
PALE
SALE
SAGE

There are many variations of the word ladder puzzle. For example you might be given a particular number of steps in which to accomplish the transformation, or you might need to use a particular word. In this section we are interested in figuring out the smallest number of transformations needed to turn the starting word into the ending word.

However, the best possible solution of the Word Ladder problem is using graphs, and with Neo4j, it is even easier.

Here is an outline of where we are going:

  • Represent the relationships between the words as a graph.
  • Use the graph algorithm known as breadth first search to find an efficient path from the starting word to the ending word.

Building the word ladder graph

Our first problem is to figure out how to turn a large collection of words into a graph. What we would like is to have a relation from one word to another if the two words are only different by a single letter. If we can create such a graph, then any path from one word to another is a solution to the word ladder puzzle.

Writing the words onto the graph is another beautiful problem altogether.

We could use several different approaches to create the graph we need to solve this problem. Let’s start with the assumption that we have a list of words that are all the same length. As a starting point, we can create a vertex in the graph for every word in the list. To figure out how to connect the words, we could compare each word in the list with every other. When we compare we are looking to see how many letters are different. If the two words in question are different by only one letter, we can create an edge between them in the graph. For a small set of words that approach would work fine; however let’s suppose we have a list of 5,110 words. Roughly speaking, comparing one word to every other word on the list is an O(n2)algorithm. For 5,110 words, n2 is more than 26 million comparisons.

We can do much better by using the following approach. Suppose that we have a huge number of buckets, each of them with a four-letter word on the outside, except that one of the letters in the label has been replaced by an underscore. For example, consider the image below, we might have a bucket labeled “pop_.” As we process each word in our list we compare the word with each bucket, using the ‘_’ as a wildcard, so both “pope” and “pops” would match “pop_.” Every time we find a matching bucket, we put our word in that bucket. Once we have all the words in the appropriate buckets we know that all the words in the bucket must be connected.

wordbuckets

In java, this can be implemented using a Map as the obvious solution.. see the snippet below


for (String word : words) {
 for (int wordIndex = 0; wordIndex < word.length(); wordIndex++) {
   String keyWord = word.substring(0, wordIndex)+"_"+word.substring(wordIndex+1,word.length());
   if(!wordMap.containsKey(keyWord)){
    wordMap.put(keyWord, new ArrayList<String>());
   }
   wordMap.get(keyWord).add(word);
 }
}

Once we have the ‘well crafted’ map ready, it is easy to write this into a neo4j database.

  • For every word, a separate node is introduced
  • For every word under the same bucket, a relationship is made

Note : Neo4j don’t support the use of non-directional relationships. So, we go ahead and create a directional relationship and ignore the direction when we traverse the graph.

To write to the graph easily, a wrapper like the one shown below can be used,

public class WordGraph {

Map<String, Node> nodeMap;
GraphDatabaseService graphDb;

public WordGraph(){
  graphDb = NeoDatabaseHandler.getGraphDatabase();
  nodeMap = new HashMap<String, Node>();
}

private Node getNode(String word){
  if(!nodeMap.containsKey(word)){
   Node node = graphDb.createNode(NeoHelper.WordLabel.WORD);
   node.setProperty("word", word);
   nodeMap.put(word, node);
  }
  return nodeMap.get(word);
}

public void addEdge(String parentWord, String childWord){
  Node parentNode = getNode(parentWord);
  Node childNode = getNode(childWord);
  parentNode.createRelationshipTo(childNode, NeoHelper.RelationshipTypes.MOVE);
}
}

Whenever we need an edge created, we just call the method, addEdge and it will take care of the rest.

So, we iterate over the buckets and write the words and their relationships to the graph,

for (List<String> mappedWordsParent : wordMap.values()) {
  for (String parentWord : mappedWordsParent) {
    for (String childWord : mappedWordsParent) {
      if(!childWord.equals(parentWord)){
        wordGraph.addEdge(parentWord, childWord);
      }
    }
  }
}

The graph created using a minimum set of words will look like the one below, (apologies for the clutter in the relationship names)

Untitled

 

Once, we have the graph ready, we can do a decent traversal to get the required path. Let us assume, we are trying to find a path from fool to sage

The graph algorithm we are going to use is called the “breadth first search” algorithm. Breadth first search (BFS) is one of the easiest algorithms for searching a graph.

From a very neat post that i came across

Given a graph G and a starting vertex s, a breadth first search proceeds by exploring edges in the graph to find all the vertices in G for which there is a path from s. The remarkable thing about a breadth first search is that it finds all the vertices that are a distance k from s before it finds any vertices that are a distance k+1. One good way to visualize what the breadth first search algorithm does is to imagine that it is building a tree, one level of the tree at a time. A breadth first search adds all children of the starting vertex before it begins to discover any of the grandchildren.

Implementing this on our own, let us save that for another day. Neo4j provides a neat Traversal Framework.

Lets have a look at the code,

graphDb
.traversalDescription().breadthFirst()
                       .relationships(NeoHelper.RelationshipTypes.MOVE)
                       .evaluator(Evaluators.includeWhereEndNodeIs(endNode))
                       .traverse(startNode)

It’s pretty straight forward. We traverse through relationships named Move ( which is the only relationship we have in the Graph ). We use an Evaluator, which decides what to be returned as the output of a traversal, and in this case, we pass the node corresponds to the endword. So, whenever the traversal reaches the endword ( in this case sage), it returns the corresponding path. The complete code can be found in github.

Once, we have the path, printing it to standard out yields,

fool
pool
poll
pall
pale
sale
sage

Which is pretty much what we expected. You could try this out own your own as the entire project is available on github.

Much content has been adopted from this book, which is the ONE place you can start to understand data structures in python.

 

Adding DOM like features to Neo4j xml graph

This is an extension to my previous post on how to convert an xml format data to a neo4j graph. For instance, see the graph generated an xml file in the following format

<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
.
.
.
</catalog>

And the neo4j graph visualization below..

Catlog-xml

There are 4 <book> nodes under <catlog> and for each book, the associated metadata. Altogether it’s a pretty standard xml.

Most of our xml operations involve traversing the xml. JAXP is bread and butter for any developer working with xml in Java. I have done my bit of xml parsing and the interface provided is so wonderful (and resource intensive because of the DOM).

Similar flexibility is possible for the xmlGraph too..

I have tried to implemented the following methods..

  • getTags – Fetches the element objects for the specified xml tag
  • getParent – Fetches the parent element object
  • getChildren – Fetches the child element object
  • getSiblings – Fetches the siblings of the element

The implementation of the methods are in a separate service Facade..

public interface XmlTreeService {
  public List<XmlElement> getTags(String tagName);
  public List<XmlElement> getChildren(XmlElement parent);
  public XmlElement getParent(XmlElement child);
  public List<XmlElement> getSiblings(XmlElement element);
}

Implementation Details

Fetching the list of Tags, (the getTags method)

The best solution was to use the GlobalGraphOperations interface. As discussed before, while saving the xml as a graph, each node has it’s tag name as a label. For instance, the node representing a book node will have the label, ‘book‘ associated with it.

node-label-cropped

 

It can be seen the TAG property ‘book‘ is also a label.

This means fetching all Tags of a specified name resolves to fetching all nodes with a specific label, which is easy using GlobalGraphOperations

GlobalGraphOperations globalGraph = getGlobalGraphOperations();
Label label = DynamicLabel.label(tagName);
ResourceIterable<Node> nodes;
try (Transaction tx = graphDb.beginTx()) {
 nodes = globalGraph.getAllNodesWithLabel(label);
 for (Node node : nodes) {
   XmlElement element = getXmlElement(node);
   elements.add(element);
}
 tx.success();
} catch (IOException e) {
 LOGGER.severe(e.getMessage());
 e.printStackTrace();
}

 

Fetching the parent and children

In the graph we created, between a child tag and the parent tag in the xml, there exists a relationship, CHILD_OF from the child to the parent

child-of-crop

Implies finding the lineage is about finding the relationship of the node and fetching the element. For instance, see how the children nodes are fetched,

 

public List getChildren(XmlElement parent) {
 GraphDatabaseService graphDb = Neo4jDatabaseHandler.getGraphDatabase();
 List childElementList = new ArrayList<>();
 try(Transaction tx = graphDb.beginTx()) {
  Node node = graphDb.getNodeById(parent.getId());
  Iterable childRelations = node.getRelationships(Direction.INCOMING);
  Iterator relationshipIterator = childRelations.iterator();
  while(relationshipIterator.hasNext()) {
    Relationship relationship = relationshipIterator.next();
    Node childNode = relationship.getStartNode();
    XmlElement childElement = getXmlElement(childNode);
    childElementList.add(childElement);
  }
} catch (IOException e) {
// TODO Auto-generated catch block
 e.printStackTrace();
}

The process can be outlined as

  1. Fetch the node from GraphDatabaseService using the id which is stored in the XmlElement Object.
  2. Fetch the outgoing relationships which are INCOMING to the parent node.
  3. Iterate through the relationships and fetch the starting nodes.

The same method can be used to identify the parent node too..

Fetching Sibling Nodes

To fetch the sibling nodes, first fetch the parent node and then get the children. Also remove the node in question. Finding siblings is quite elementary.

You can always find the complete source code in github.

Summary

To roll all balls at once, let’s try to do a small test,

GraphDatabaseService graphDb = Neo4jDatabaseHandler.getGraphDatabase();
XmlTreeServiceGraph treeServiceGraph = new XmlTreeServiceGraph(graphDb);
System.out.println("-----Test for DOM like methods on XmlGraph----\n");
List<XmlElement> groupElements = treeServiceGraph.getTags("book");
System.out.println("Fetching all Book Nodes...");
for (XmlElement xmlElement : groupElements) {
  System.out.println(xmlElement.getAtrributeString());
}
XmlElement firstElement = groupElements.get(0);
System.out.println("\nFetching children of the first Book Node : " + firstElement.getAtrributeString() + "...");
List<XmlElement> childrenElements = treeServiceGraph.getChildren(firstElement);
for (XmlElement xmlElement : childrenElements) {
  System.out.println(xmlElement.getTagName()+ " : " + xmlElement.getTagValue());
}
XmlElement child = childrenElements.get(0);
System.out.println("\nFetching parent of the first child element : "+child.getTagName()+" : "+child.getTagValue() + "...");
XmlElement parentElement = treeServiceGraph.getParent(child);
System.out.println(parentElement.getTagName() + " : " + parentElement.getAtrributeString());
System.out.println("\nFetching siblings of the first Book Node : " + firstElement.getAttributes());
List<XmlElement> elementSibling = treeServiceGraph.getSiblings(firstElement);
for (XmlElement xmlElement : elementSibling) {
  System.out.println(xmlElement.getTagName() + " : " + xmlElement.getAtrributeString());
}

 

P.S : I know the test code is crappy, but for a demo test, this will do.. 🙂

So, it’s quite straightforward, I am doing the following steps

  1. Fetch all ‘book’ nodes in the xml, and print them
  2. Fetch all the children of the first ‘book’ node, and print them
  3. Fetch the parent of the first child from step #2, which should give us our first book node, and print them
  4. Fetch the siblings of the first node, and print them

Let’s see how the output looks like,

-----Test for DOM like methods on XmlGraph----

Fetching all Book Nodes...
{"id":"bk101"}
{"id":"bk102"}
{"id":"bk103"}
{"id":"bk104"}

Fetching children of the first Book Node : {"id":"bk101"}...
description : An in-depth look at creating applications with XML.
publish_date : 2000-10-01
price : 44.95
genre : Computer
title : XML Developer's Guide
author : Gambardella, Matthew

Fetching parent of the first child element : description : An in-depth look at creating applications with XML....
book : {"id":"bk101"}

Fetching siblings of the first Book Node : {id=bk101}
book : {"id":"bk104"}
book : {"id":"bk103"}
book : {"id":"bk102"}

It looks good, and everything as expected… 🙂

The Buendia Family Tree as a Neo4j graph – Tribute to Gabo

One of the most significant and prominent authors passed away last fortnight – Gabriel Garcia Marquez. I read One Hundred Years of Solitude during my senior college years. Magical Realism has long been associated with the vernacular literature in my home state.

gabriel-garcia-marquez

The Buendia family in One Hundred Years of Solitude is another epic on it’s own. See a very nice graphic family tree below.

100_years_of_solitude_family_t_by_clothos

I always wanted to visualize a family tree using a graph, and the Buendias of Macondo seemed so perfect. My other choice was the political families of India, which is pretttyyy vaasttt..

The visualization is pretty straightforward, two types of nodes

  • Male
  • Female

And 4 simple relationships

  • HUSBAND
  • FATHER
  • MOTHER
  • MISTRESS (Sorry, could not find an euphemism for that)

final-buendia-crop

Apologies if it is too cluttered, but you could play around with the graphgist

Visualizing an xml as a graph – Neo4j 101

Neo4j is quite awesome for pretty much everything. Here is an extract I found from this blog

Whenever someone gives you a problem, think graphs. They are the most fundamental and flexible way of representing any kind of a relationship, so it’s about a 50-50 shot that any interesting design problem has a graph involved in it. Make absolutely sure you can’t think of a way to solve it using graphs before moving on to other solution types.

As a baby step, let us try to visualize an xml as a graph.

We can start with a simple xml here,

<?xml version="1.0" encoding="UTF-8"?>
 <breakfast_menu>
  <food>
   <name attr="one">Belgian Waffles</name>
   <price>$5.95</price>
   <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
   <calories>650</calories>
  </food>
 </breakfast_menu>

It is a simple ‘Breakfast Menu’ with just one item, ‘Belgian Waffles‘.

Neo4j has a neat interface which lets us visualize the graph we created. The graph for the above xml looks slick..

small-graph-xml

 

The above XML has 6 Tags and we have 6 nodes in our graph. The nested tags/Nodes has a ‘CHILD-OF’ relationship with the Parent tag/Node.

In the graph above,

  • Node numbered 0 is the root node – Corresponds to the <breakfast_menu> tag
  • Node numbered 1 is the immediate child of the root – Corresponds to the <food> tag
  • Nodes numbered from 2-4 are the tags which come under <food> tag, viz <name>, <price>, <description> and <calories>

Transforming XML..

Let us parse the xml into a data structure which can be easily persisted using Neo4j. Personally, I would prefer SAX parser as there are serious memory constraints when creating a DOM object (really painful if you talking big data as xml). Additionally SAX parsing gives you unlimited freedom to do whatever you want to do with the XML.

This is how I represent a node, using the object XmlElement. Each XmlElement is identified by an id. This is basically an integer or long. Only thing to make sure is that the number has to be unique across all XMLElements.


public class XmlElement {

 private String tagName;
 private String tagValue;
 private Map<String, String> attributes = new HashMap<String,String>();
 private HierarchyIdentifier hierarchyIdentifier;
 private int parentId;
 private boolean persisted;

//Setters and Getters for the members

 public String getAtrributeString(){
  ObjectMapper jsonMapper = new ObjectMapper();
   try {
    return jsonMapper.writeValueAsString(this.attributes);
   } catch (JsonProcessingException e) {
    LOGGER.severe(e.getMessage());
    e.printStackTrace();
   }
  return null;
 }

}

The XML Tag name and value are stored as string and the attributes are stored as a Map. Also, there is another member Object, HierarchyIdentifier

public class HierarchyIdentifier {
 private int depth;
 private int width;
 private int id;

//Getters and Setters for member element
}

The HierarchyIdentifier Class contains the id class which is used to identify an XmlElement, which translates to identifying an XML Tag. Each XmlElement has the id of it’s parent stored.

And after the parse, the XML should be represented as a map of <Integer,XmlElement> and each XmlElement identified by the corresponding Integer id. You can see the SAX PARSER here..

So we pass the Map to the GraphWriter object.

We create a node with the following properties

  • Value – The Value of the XML Tag
  • ID – The ID of the current node
  • Parent – The ID of the parent Tag
  • Attributes – The String representation of the Tag attributes

Also, the node has the following Labels

  • NODE – For all nodes
  • PARENT – For the seed ( highest parent ) node
  • XML tag Name – Say, the node for the tag <food> will have label FOOD

Currently, there is only a single type of relationship, ‘CHILD_OF’ , between immediate nested tags.

See the below the example for a bigger XML

<breakfast_menu>
 <food>
  <name attr="one">Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
 </food>
 <food>
  <name attr="two">Masala Dosa</name>
  <price>$10.95</price>
  <description>South India's famous slim pancake with mashed potatoes</description>
  <calories>650</calories>
  <eaten>
   <name first="nikhil"/>
   <age>25</age>
  </eaten>
 </food>
 </breakfast_menu>

I have added a second item to the ‘breakfast_menu’.. the legendary masala dosa.

Also, the second food item has a new child tag, <eaten>. And the resultant graph looks prettier

xml-med

 

The sets of tags,relationships and properties can be seen in the Neo4j local server interface.

xml-med-tags

The entire codebase can be found here in github..

Visualizing as a GraphGist

An easy representation can be done in GraphGist. I have created the graph of this xml.

Here is the link to the gist. Also, the gist script can be found in the github gist here..

 

Comparing .csv files and generate sweet HTML reports in rendersnake

Here is a simple library to perform a simple operation. Compare two delimiter separated files and output an .html file.

Simple and straightforward. It can be used to output a ‘Comparison Result’ object, or it can be used to print out an .html report

Features

  1. Compare methodology can be customized using a properties file
  2. Report can be customized using a properties file

Using rendersnake and it pretty awesome.. It is available on github and v1.0 is somewhere between the first cry of the baby and kindergarten.

Here is the gitlab link.

See a report sample,

csvCompare

csvCompare

enjoy..

Git Internals – The commit tree

Git stores the commits as a trees and blobs, and a tree object can consist of both trees and blobs as it’s children. An example of storage as blob can be seen here.

See the file structure of a project below


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree
.
|-- about.html
|-- css
| `-- simple.css
|-- index.html
`-- list.html

1 directory, 4 files

It is pretty straight forward with a couple of .html files and one .css file in a separate directory. Please note that all these files are committed, and that means it has already been written out to the underlying storage. To make sure we can use the really useful git log command,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

It shows only one commit has been made. It also returns a hash value that is associated with the commit.

The cat-file command we overused and abused in the previous post can be used to see the tree corresponding to last commit. We pass the hash value of the commit as a parameter.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 300f5c42a5aed68268547a95db4f40b6b122fb5b
tree 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394968974 +0530

initial commit

So, the commit hash points to a tree, whose hash is displayed. Let us see what the tree contains,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 5d0d7785b65180c195f7a1bf3cf02218b56f6f0a
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob 9015a7a32ca0681be64471d3ac2f8c1f24c1040d index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

This tree contains three blobs for the three .html files and another tree for the css directory. Let us go into this tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 02ff2e2946f969bc640886861ff8c7039e1a2339
100644 blob dac138d9e013a2e9a10e67d793bd4703c1b86bd1 simple.css

It contains the .css file. So, the entire structure looks something like this.

commit one

Now, lets make a mall change to the index.html file and do a second commit. After this, this is how the log looks like,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git log
commit ec8b103771588498923711c036ff3280c863f713
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 17:36:33 2014 +0530

second commit

commit 300f5c42a5aed68268547a95db4f40b6b122fb5b
Author: nikhil <nikhil@nikhil-Inspiron-3537.(none)>
Date: Sun Mar 16 16:52:54 2014 +0530

initial commit

There are two commits and both of them have two different hash values,

First Commit : 300f5c42a5aed68268547a95db4f40b6b122fb5b (Initial Commit)

Second Commit : ec8b103771588498923711c036ff3280c863f713 (Second Commit)

Let us follow the second commit tree like we did before,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p ec8b103771588498923711c036ff3280c863f713
tree 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
parent 300f5c42a5aed68268547a95db4f40b6b122fb5b
author nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530
committer nikhil <nikhil@nikhil-Inspiron-3537.(none)> 1394971593 +0530

second commit

The second commit is a child of the first commit, which is interesting, as it is exactly how we see in tools like gitk. A commit contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.

Going further deep into the tree,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/tree$ git cat-file -p 455e56fbce4fe35ee64d9e7af572e5b0adef14f6
100644 blob 09e16f36b3c4993ba924b1074629283a49869be9 about.html
040000 tree 02ff2e2946f969bc640886861ff8c7039e1a2339 css
100644 blob b110c44fd08f191062636f18cfeeaeccd5be1b73 index.html
100644 blob b92b8b70267846c8b21b5ad412666cb99f9c9211 list.html

The most interesting thing to note here is that, all the hash values, except the one for index.html remains the same.

The two commit trees look something like this.

commit two

In the second commit tree, the hash values for the unchanged files are the same as the previous commit tree. Now, just see the hash values as pointers to files. The second tree points to the unchanged files.

Well, that’s it.. commit trees…

Git Internals – Basic object data storage

Well, what makes git super fast? A look into git’s underbelly..

Before i begin, i will be setting up an empty repository.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git init
Initialized empty Git repository in /home/nikhil/dev/blog/git/.git/
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ ls -a
. .. .git

Also, it can be seen that initializing the repository creates a .git directory, and see the contents of the directory. As you can see, the objects folder is empty. Git has initialized the objects directory and created pack and info subdirectories in it, but there are no regular files.

nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ tree
.
|-- branches
|-- config
|-- description
|-- HEAD
|-- hooks
| |-- applypatch-msg.sample
| |-- commit-msg.sample
| |-- post-update.sample
| |-- pre-applypatch.sample
| |-- pre-commit.sample
| |-- prepare-commit-msg.sample
| |-- pre-rebase.sample
| `-- update.sample
|-- info
| `-- exclude
|-- objects
| |-- info
| `-- pack
`-- refs
|-- heads
`-- tags

9 directories, 12 files

At the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time. To demonstrate, you can use the plumbing command hash-object, which takes some data, stores it in your .git directory, and gives you back the key the data is stored as. Note that the hash-object is a plumbing command and is not meant to be used in a regular day.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git$ echo 'supercompiler' | git hash-object -w --stdin
755eb4004ee1ac36d0dd51008ed6279c2fb200e5

The -w tells hash-object to store the object; otherwise, the command simply tells you what the key would be. --stdin tells the command to read the content from stdin; if you don’t specify this, hash-object expects the path to a file. The output from the command is a 40-character checksum hash. This is the SHA-1 hash — a checksum of the content you’re storing plus a header.

Let us move to the objects directory and see how the file is stored,


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ tree
.
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- info
`-- pack

3 directories, 1 file

You can see a file in the objects directory. This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters.

You can pull the content back out of Git with the cat-file command. This command is sort of a Swiss army knife for inspecting Git objects. Passing -p to it instructs the cat-file command to figure out the type of content and display it nicely for you.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git/.git/objects$ git cat-file -p 755eb4004ee1ac36d0dd51008ed6279c2fb200e5
supercompiler

Ok, let us play around a bit.

I am creating a v1 of a file and writing it to the repository, followed by modifying the file and writing the v2 to the repository. We can see both the file contents using the cat-file command and see a total of three different hashes stored within the objects directory.


nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 1" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
83baae61804e65cc73a7201a7252750c76066a30
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30
version 1
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ echo "version 2" > manual.txt
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git hash-object -w manual.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
version 2
nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ tree .git/objects/
.git/objects/
|-- 1f
| `-- 7a7a472abf3dd9643fd615f6da379c4acb3e3a
|-- 75
| `-- 5eb4004ee1ac36d0dd51008ed6279c2fb200e5
|-- 83
| `-- baae61804e65cc73a7201a7252750c76066a30
|-- info
`-- pack

5 directories, 3 files

You can have Git tell you the object type of any object in Git, given its SHA-1 key, with cat-file -t:

nikhil@nikhil-Inspiron-3537:~/dev/blog/git$ git cat-file -t 83baae61804e65cc73a7201a7252750c76066a30
blob

Now, there are two things that has to be mentioned here.

  1. Git does not store the file. Git stores only the contents.
  2. The contents are stored as a blob object