Of the many NoSQL data management capabilities, the graph database offers special appeal to individuals who want…
Step 2 of 2:
to bridge the gaps between inherently connected information and apply graph analytics to find new insights not typically afforded by conventional relational database management systems.
There is a growing pool of graph database platforms and products, yet while many offer trial licenses or community versions, the sample applications are often very cursory and provide only a small inkling of the potential power of graph data modeling. While these sample applications demonstrate the core basics of how the graph database works, they only touch on the potential breadth of data management and analytical capabilities these tools provide.
In some cases, the graph approach is confusing, and some people don’t completely understand how to map their data to a graph model. In others, a lack of knowledge about the implementation of the graph structure can become a bottleneck. The result is that people may be able to tinker with those small graph examples, but they may be stymied when trying to craft a reasonable prototype or proof of concept.
However, many relational databases have latent graph representations hidden within the tabular structure, and there are some concrete steps that an analyst can take to find the data’s inner graph. In fact, these graph data modeling steps are not much different than those taken in developing a relational model.
How graph databases work
Graph databases use an alternative approach to data representation by capturing information about entities and their attributes, as well as the relationships among those entities as first-class objects. The foundation of graph databases is mathematical graph theory. Graphs consist of a collection of vertices — which are also referred to as nodes or points — that represent the modeled entities, connected by edges — which are also referred to as links, connections or relationships — that capture the way that two entities are related.
For example, your database might refer to customers — which is one entity — who live at specific addresses — which become a different entity. Lives-At is one type of relationship that links a customer to a location and would be the label assigned to the edge between the specific customer node and the specific location node. Without the loss of generality, we can assume that every relationship can be represented as a triple consisting of a subject — the source of the edge — the relationship and the object — the target of the edge. For example:
132 Main Ave. Tenley, Md., 29817
This example is one instance of a more general triple relationship:
Most, if not all graph database systems are engineered to be able to ingest a representation of a graph consisting of two artifacts: The list of nodes and the list of edges between those nodes. Using this foundation, we can follow these graph data modeling steps to create those two artifacts:
- Find the entities. Review your data sets to identify those core nouns that could be either the subjects or objects of a relationship. Some examples include customer, employee, vendor, organization, location, purchase transaction, insurance claim, workflow step, product, part, movie, author, book, etc.
- Find each entity’s properties. Entity properties are similar to entity attributes in the relational model. For example, a movie’s properties might include year, format and copyright information. However, once you identify attributes that are also entities, you can start to identify relationships.
- Find each relationship’s properties. These are the attributes associated with the links between entities. To continue our example, an actor might play a role in a stage play for a limited engagement, making the start date and end date properties of the play’s relationship between an actor and a role.
The mechanics of developing the prototype involve a series of source data set scans. The first set of scans find the unique set of entities of each entity class, along with the properties for each of those entities. Once the full set of entities is accumulated, that collection of vertices can be output to a persistent file that can subsequently be ingested by the graph database system.
The second set of scans will extract the relationships between the different entities, along with the properties of those links. Because these relationships link known entities, make sure that you collect only relationship triples that refer to the entities logged during the prior set of scans. When those edges have been accumulated, output the triples to a persistent edges file.
Together, those two files — the vertices and the edges — represent the graph. This exercise combines a repeatable process for extracting a graph from the source data with the actual artifacts that can be loaded into a graph database system. The result is a simplified method for creating prototypes for the evaluation of different graph database tools.