Navigating XML Graph using Cypher

Cypher is a neat way to manipulate a Neo4j database. It would be equally amazing if the Xml graph could be queried with Cypher as well.

Honestly, I must put credits to Michael for suggesting such a possibility here..

Well, let’s start with a simple xml file.

<library>
<author firstname="Earnest" lastname="Hemingway">
<works>
<book name="A Farewell to Arms" year="1929" />
<book name="For Whom the bell tolls" year="1940" />
<book name="The Old man and the sea" year="1951" />
</works>
<awards>
<award name = "Pulitzer Prize" category="Fiction" year="1953"></award>
<award name = "Nobel Prize" category="Literature" year="1954"></award>
</awards>
</author>
<author firstname="Victor" lastname="Hugo">
<works>
<book name="The Hunchback of Notre-Dame" year="1831" />
<book name="Les Misérables" year="1862" />
</works>
</author>
</library>

It’s a simple xml with nothing fancy in it. As explained in the previous posts here and here.. A neat neo4j graph can be made out of this…

Screenshot from 2014-07-22 21:16:46

So, let’s go about traversing this graph using Cypher.. And since we are trying to traverse an XML, let’s make a rough comparison to XPath.

Let’s fetch all the book nodes,

The Xpath to get all the book nodes, no matter where they are in the document, is

//book

For the same purpose, the Cypher query would be,

MATCH (books:book) RETURN books

This will fetch the following output for the above Graph,


bookNodes

 

Let’s now try to fetch the name of all books. The XPath will require only a slight modification,

//book/@name

The XPath will return the list as,

Attribute='name="A Farewell to Arms"'
Attribute='name="For Whom the bell tolls"'
Attribute='name="The Old man and the sea"'
Attribute='name="The Hunchback of Notre-Dame"'
Attribute='name="Les Misérables"'

The Cypher will only require a small modification. Instead of returning the entire node, fetch the ‘name’ attribute for the nodes.

MATCH (books:book) RETURN books.name

bookName Next up, let’s query the awards honoured to Earnest Hemingway,

This can be achieved via XPath as,

//author[@firstname=’Earnest’]/awards

which gives the output

<awards>
  <award name="Pulitzer Prize" category="Fiction" year="1953" />
  <award name="Nobel Prize" category="Literature" year="1954" />
</awards>

As for Cypher,

MATCH (author {firstname: “Earnest”})-[*]->(award:award) RETURN award

We try to fetch any node of the type ‘award’ connected to a node of type ‘author’ with firstname = Earnest

awards

The above examples are very much trivial, and aims to prove the possibility of using Cypher to traverse a database. I am looking for a huge and most importantly meaningful xml content which can be queried to get some useful information. Keep watching this space for more..

Advertisements

Adding DOM like features to Neo4j xml graph

This is an extension to my previous post on how to convert an xml format data to a neo4j graph. For instance, see the graph generated an xml file in the following format

<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
.
.
.
</catalog>

And the neo4j graph visualization below..

Catlog-xml

There are 4 <book> nodes under <catlog> and for each book, the associated metadata. Altogether it’s a pretty standard xml.

Most of our xml operations involve traversing the xml. JAXP is bread and butter for any developer working with xml in Java. I have done my bit of xml parsing and the interface provided is so wonderful (and resource intensive because of the DOM).

Similar flexibility is possible for the xmlGraph too..

I have tried to implemented the following methods..

  • getTags – Fetches the element objects for the specified xml tag
  • getParent – Fetches the parent element object
  • getChildren – Fetches the child element object
  • getSiblings – Fetches the siblings of the element

The implementation of the methods are in a separate service Facade..

public interface XmlTreeService {
  public List<XmlElement> getTags(String tagName);
  public List<XmlElement> getChildren(XmlElement parent);
  public XmlElement getParent(XmlElement child);
  public List<XmlElement> getSiblings(XmlElement element);
}

Implementation Details

Fetching the list of Tags, (the getTags method)

The best solution was to use the GlobalGraphOperations interface. As discussed before, while saving the xml as a graph, each node has it’s tag name as a label. For instance, the node representing a book node will have the label, ‘book‘ associated with it.

node-label-cropped

 

It can be seen the TAG property ‘book‘ is also a label.

This means fetching all Tags of a specified name resolves to fetching all nodes with a specific label, which is easy using GlobalGraphOperations

GlobalGraphOperations globalGraph = getGlobalGraphOperations();
Label label = DynamicLabel.label(tagName);
ResourceIterable<Node> nodes;
try (Transaction tx = graphDb.beginTx()) {
 nodes = globalGraph.getAllNodesWithLabel(label);
 for (Node node : nodes) {
   XmlElement element = getXmlElement(node);
   elements.add(element);
}
 tx.success();
} catch (IOException e) {
 LOGGER.severe(e.getMessage());
 e.printStackTrace();
}

 

Fetching the parent and children

In the graph we created, between a child tag and the parent tag in the xml, there exists a relationship, CHILD_OF from the child to the parent

child-of-crop

Implies finding the lineage is about finding the relationship of the node and fetching the element. For instance, see how the children nodes are fetched,

 

public List getChildren(XmlElement parent) {
 GraphDatabaseService graphDb = Neo4jDatabaseHandler.getGraphDatabase();
 List childElementList = new ArrayList<>();
 try(Transaction tx = graphDb.beginTx()) {
  Node node = graphDb.getNodeById(parent.getId());
  Iterable childRelations = node.getRelationships(Direction.INCOMING);
  Iterator relationshipIterator = childRelations.iterator();
  while(relationshipIterator.hasNext()) {
    Relationship relationship = relationshipIterator.next();
    Node childNode = relationship.getStartNode();
    XmlElement childElement = getXmlElement(childNode);
    childElementList.add(childElement);
  }
} catch (IOException e) {
// TODO Auto-generated catch block
 e.printStackTrace();
}

The process can be outlined as

  1. Fetch the node from GraphDatabaseService using the id which is stored in the XmlElement Object.
  2. Fetch the outgoing relationships which are INCOMING to the parent node.
  3. Iterate through the relationships and fetch the starting nodes.

The same method can be used to identify the parent node too..

Fetching Sibling Nodes

To fetch the sibling nodes, first fetch the parent node and then get the children. Also remove the node in question. Finding siblings is quite elementary.

You can always find the complete source code in github.

Summary

To roll all balls at once, let’s try to do a small test,

GraphDatabaseService graphDb = Neo4jDatabaseHandler.getGraphDatabase();
XmlTreeServiceGraph treeServiceGraph = new XmlTreeServiceGraph(graphDb);
System.out.println("-----Test for DOM like methods on XmlGraph----\n");
List<XmlElement> groupElements = treeServiceGraph.getTags("book");
System.out.println("Fetching all Book Nodes...");
for (XmlElement xmlElement : groupElements) {
  System.out.println(xmlElement.getAtrributeString());
}
XmlElement firstElement = groupElements.get(0);
System.out.println("\nFetching children of the first Book Node : " + firstElement.getAtrributeString() + "...");
List<XmlElement> childrenElements = treeServiceGraph.getChildren(firstElement);
for (XmlElement xmlElement : childrenElements) {
  System.out.println(xmlElement.getTagName()+ " : " + xmlElement.getTagValue());
}
XmlElement child = childrenElements.get(0);
System.out.println("\nFetching parent of the first child element : "+child.getTagName()+" : "+child.getTagValue() + "...");
XmlElement parentElement = treeServiceGraph.getParent(child);
System.out.println(parentElement.getTagName() + " : " + parentElement.getAtrributeString());
System.out.println("\nFetching siblings of the first Book Node : " + firstElement.getAttributes());
List<XmlElement> elementSibling = treeServiceGraph.getSiblings(firstElement);
for (XmlElement xmlElement : elementSibling) {
  System.out.println(xmlElement.getTagName() + " : " + xmlElement.getAtrributeString());
}

 

P.S : I know the test code is crappy, but for a demo test, this will do.. 🙂

So, it’s quite straightforward, I am doing the following steps

  1. Fetch all ‘book’ nodes in the xml, and print them
  2. Fetch all the children of the first ‘book’ node, and print them
  3. Fetch the parent of the first child from step #2, which should give us our first book node, and print them
  4. Fetch the siblings of the first node, and print them

Let’s see how the output looks like,

-----Test for DOM like methods on XmlGraph----

Fetching all Book Nodes...
{"id":"bk101"}
{"id":"bk102"}
{"id":"bk103"}
{"id":"bk104"}

Fetching children of the first Book Node : {"id":"bk101"}...
description : An in-depth look at creating applications with XML.
publish_date : 2000-10-01
price : 44.95
genre : Computer
title : XML Developer's Guide
author : Gambardella, Matthew

Fetching parent of the first child element : description : An in-depth look at creating applications with XML....
book : {"id":"bk101"}

Fetching siblings of the first Book Node : {id=bk101}
book : {"id":"bk104"}
book : {"id":"bk103"}
book : {"id":"bk102"}

It looks good, and everything as expected… 🙂

Visualizing an xml as a graph – Neo4j 101

Neo4j is quite awesome for pretty much everything. Here is an extract I found from this blog

Whenever someone gives you a problem, think graphs. They are the most fundamental and flexible way of representing any kind of a relationship, so it’s about a 50-50 shot that any interesting design problem has a graph involved in it. Make absolutely sure you can’t think of a way to solve it using graphs before moving on to other solution types.

As a baby step, let us try to visualize an xml as a graph.

We can start with a simple xml here,

<?xml version="1.0" encoding="UTF-8"?>
 <breakfast_menu>
  <food>
   <name attr="one">Belgian Waffles</name>
   <price>$5.95</price>
   <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
   <calories>650</calories>
  </food>
 </breakfast_menu>

It is a simple ‘Breakfast Menu’ with just one item, ‘Belgian Waffles‘.

Neo4j has a neat interface which lets us visualize the graph we created. The graph for the above xml looks slick..

small-graph-xml

 

The above XML has 6 Tags and we have 6 nodes in our graph. The nested tags/Nodes has a ‘CHILD-OF’ relationship with the Parent tag/Node.

In the graph above,

  • Node numbered 0 is the root node – Corresponds to the <breakfast_menu> tag
  • Node numbered 1 is the immediate child of the root – Corresponds to the <food> tag
  • Nodes numbered from 2-4 are the tags which come under <food> tag, viz <name>, <price>, <description> and <calories>

Transforming XML..

Let us parse the xml into a data structure which can be easily persisted using Neo4j. Personally, I would prefer SAX parser as there are serious memory constraints when creating a DOM object (really painful if you talking big data as xml). Additionally SAX parsing gives you unlimited freedom to do whatever you want to do with the XML.

This is how I represent a node, using the object XmlElement. Each XmlElement is identified by an id. This is basically an integer or long. Only thing to make sure is that the number has to be unique across all XMLElements.


public class XmlElement {

 private String tagName;
 private String tagValue;
 private Map<String, String> attributes = new HashMap<String,String>();
 private HierarchyIdentifier hierarchyIdentifier;
 private int parentId;
 private boolean persisted;

//Setters and Getters for the members

 public String getAtrributeString(){
  ObjectMapper jsonMapper = new ObjectMapper();
   try {
    return jsonMapper.writeValueAsString(this.attributes);
   } catch (JsonProcessingException e) {
    LOGGER.severe(e.getMessage());
    e.printStackTrace();
   }
  return null;
 }

}

The XML Tag name and value are stored as string and the attributes are stored as a Map. Also, there is another member Object, HierarchyIdentifier

public class HierarchyIdentifier {
 private int depth;
 private int width;
 private int id;

//Getters and Setters for member element
}

The HierarchyIdentifier Class contains the id class which is used to identify an XmlElement, which translates to identifying an XML Tag. Each XmlElement has the id of it’s parent stored.

And after the parse, the XML should be represented as a map of <Integer,XmlElement> and each XmlElement identified by the corresponding Integer id. You can see the SAX PARSER here..

So we pass the Map to the GraphWriter object.

We create a node with the following properties

  • Value – The Value of the XML Tag
  • ID – The ID of the current node
  • Parent – The ID of the parent Tag
  • Attributes – The String representation of the Tag attributes

Also, the node has the following Labels

  • NODE – For all nodes
  • PARENT – For the seed ( highest parent ) node
  • XML tag Name – Say, the node for the tag <food> will have label FOOD

Currently, there is only a single type of relationship, ‘CHILD_OF’ , between immediate nested tags.

See the below the example for a bigger XML

<breakfast_menu>
 <food>
  <name attr="one">Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
 </food>
 <food>
  <name attr="two">Masala Dosa</name>
  <price>$10.95</price>
  <description>South India's famous slim pancake with mashed potatoes</description>
  <calories>650</calories>
  <eaten>
   <name first="nikhil"/>
   <age>25</age>
  </eaten>
 </food>
 </breakfast_menu>

I have added a second item to the ‘breakfast_menu’.. the legendary masala dosa.

Also, the second food item has a new child tag, <eaten>. And the resultant graph looks prettier

xml-med

 

The sets of tags,relationships and properties can be seen in the Neo4j local server interface.

xml-med-tags

The entire codebase can be found here in github..

Visualizing as a GraphGist

An easy representation can be done in GraphGist. I have created the graph of this xml.

Here is the link to the gist. Also, the gist script can be found in the github gist here..

 

XPath for XML

XPath is a language used to navigate through the XML document. It’s used to identify elements in the XML document. It is so limber that technologies like XQuery and XPointer are built on it. XPath uses path expressions to select nodes or node-sets in an XML document. These path expressions look very much like the expressions you see when you work with a traditional computer file system.

While using XPath, the xml document is treated as tree of nodes. See the example below

<?xml version="1.0" encoding="ISO-8859-1"?></pre>
<bookstore>
  <book>
    <title lang="en">The Joke</title>
    <author>Milan Kundera</author>
    <price>350</price>
  </book>
  <book>
    <title lang="en">After Dark</title>   
    <author>Haruki Mukarami</author>   
    <price>450</price>
  </book>
</bookstore>

Here, the tag <bookstore> is the root node. The tag <author> is the element node and the attribute lang is the attribute node. The nodes also have the hierarchical properties. For instance, the nodes children of the node <book>. Also, <title>,<author> and <price> are siblings.

XPath uses the following expressions to parse through the XML Document.

/bookstore/book - returns all the &lt;book&gt; nodes which are children of &lt;bookstore&gt;
bookstore//book - returns all book elements that are descendant of the &lt;bookstore&gt; element,
//@lang - returns all attributes that are named lang

Specific nodes can also be identified in XPath

/bookstore/book[1] - Returns the first book element
/bookstore//book[last()-1] - Returns the second last element
/bookstore/book[position()&lt;3] - Returns the first two book elements
//title[@lang] - Returns all title elements that has an attribute lang
/bookstore/book[price&gt;350]/title - Returns the titles of all books which has price more than 350

XPath supports wildcard characters as well

/bookstore/* - Selects all the children of the bookstore elements
//title[@*] - Select all the title elements which has an attribute

The detailed list of parse syntax can be found here..

So, till now, everything was pretty simple. Here comes the most flexible and useful feature of XPath 

XPath Axes

An axis defines a node-set relative to the current node. When we say, ‘the children of the current node’, the children defines a nodeset and thus children is an axes. Similarly parent, sibling, attribute are all axes W3C gives the complete list here..

From the examples we saw above, there are two types of location paths in XPath – absolute and relative

An absolute location path starts with a slash ( / ) and a relative location path does not. In both cases the location path consists of one or more steps, each separated by a slash

</pre>
An absolute location path:
/step/step/...
A relative location path:
step/step/...

A step in the examples above can consist of

  • an axis (defines the tree-relationship between the selected nodes and the current node)
  • a node (identifies a node within an axis)
  • zero or more predicates (to further refine the selected node-set)

Generalizing it, a step would look like this

axis:node[predicate]

take away the axis and predicate and you are left with the kind of steps we saw in the above examples.

See some more examples


/bookstore/child::book - Select all the book nodes which are children of bookstore

/bookstore/book/attribute::* - Select all the attributes of the book node

child::*/child::price - Select all the price grandchildren of the current node

See that the first two example are absolute paths and the last one is a relative path.

XPath can be evaluated via javascript or through a PL like Java. I will soon chalk a post on that. Also, I really need to have another post dedicated to XPath axes..