AnzoGraph: A W3C Standards-Based Graph Database – Towards Data Science

In this interview, I’m catching up with Barry Zane, Vice President at Cambridge Semantics. Barry is creator of AnzoGraph™, a native, massively parallel processing (MPP) distributed graph database. Barry has had quite a journey in database world. He served as Vice President of Technology of Netezza Corporation from 2000 to 2005, and was responsible for guiding all aspects of software architecture and implementation, from initial prototypes through volume shipments to leading telecommunications, retail and internet customers. Netezza was eventually sold to IBM, but prior to that, Barry had turned his attentions elsewhere to found another company, ParAccel, which eventually became the core technology for AWS Redshift. The need for a graph-based online analytics processing (graph OLAP) database started to emerge in the market and based on this market need, Barry founded SPARQL City in 2013.

Barry kindly agreed to speak to me this week following a recent announcement that the AnzoGraph database is now available for download for independent evaluation and usage in customer applications, on premise or cloud. Although not yet announced, Barry also revealed that AnzoGraph has been enhanced to use RDF*/SPARQL* which gives it complete property graph functionality. So it was exciting to speak to him and find out more about how graph analysis and W3C standards are coming together.

Firstly, Barry, please can you tell us a little about Cambridge Semantics?

Cambridge Semantics has been around since about 2007. One of the solutions that we have built out over the years is a semantic layer product called Anzo. Anzo is used in a number of large enterprises, such as pharmaceutical, financial services, retail, oil & gas, healthcare companies and the government sector. These enterprises have in common a tendency to have diverse data sources along with a real need for discovering and analyzing data. The semantic layer provided by Anzo combines and presents the raw data with business meaning and context. It just so happens that the graph database is a key infrastructure element of this solution.

Cambridge Semantics saw the value in graph analytics early on and was one of the first customers for SPARQL City. They acquired us in 2016. Late in 2018 we took the graph engine underneath Anzo and spun it out as its own product called AnzoGraph.

Please can you explain the key use cases for AnzoGraph?

The graph database market is well-covered in terms of OLTP databases. Rather than an OLTP graph database, like Neo4J and recently AWS Neptune, we decided to build an OLAP-style graph database. There was a real need in the market to perform data warehouse style analytics with the added benefit of handling both structured and unstructured data. With AnzoGraph we can offer reporting and BI analytics & aggregates, graph algorithms like page rank and shortest path, inferencing and more of the data warehouse style analytics that the market was missing.

Customers use AnzoGraph to discover new insights on large scale diverse data, including historical and recent data. It is great for running algorithms and analysis across a very large set of data to find relevant entities, relationships and insights. We combine the value users get in using a W3C standards-based RDF database with the value they get with property graphs.

We’ve had interest in using AnzoGraph for a wide range of purposes. Think about all the times you’d want to perform analytics where the information that connects the data is equally as important as the data itself. For example, knowledge graphs are popular for many companies who are trying to wire together disparate data sources, and our experience in doing that with Anzo helps. Companies are struggling with understanding buyer intent and building recommendation engines. Graphs can help with the “those who like product A will probably also like product B” problem. In the financial services world, banks are using graphs to “follow the money”. Graphs offer the capability to follow transfers of derivatives and other assets and therefore can help banks manage risk. Even IT organizations are looking at complex networks and trying to gain a better understanding of how IP traffic flows between devices.

There are a couple of emerging use cases that I find pretty exciting. First, when paired with a natural language processing engine or parser, AnzoGraph is great at dealing with linked structured/unstructured data and graph-based infrastructure for graph based algorithms in AI and machine learning. Second, it’s interesting to follow how graph analytics is making an impact in genomic research. Rather than the brute force techniques that brought forth many analytics-powered innovations in genetics, scientists are developing new analytics techniques with graph analysis that allow users to find new insights without explicitly programming for those insights as you would do in a relational database.

What makes AnzoGraph different from other database warehouse solutions?

It is one that you might not expect, and is to do with the inflexibility of sharing schemas in the traditional RDBMS data warehouse world, where we’re tasked with creating tables and fixed schemas. Then to get an answer, we might have to create complicated JOINs to query the tables. In the graph database world, however, since everything is represented in triples, where we are describing a person, place or thing with a verb and a description, it’s easy to add more triples to further describe it without the need to change schema. A standard ontology exists to help us describe relationships, which is helpful especially when we want to share the data. Database schemas aren’t usually as flexible since they are often fixed and customized from the start.

Of course, support for analytics is a huge difference, too. While AnzoGraph offers all the analytical functions of a traditional data warehouse, it also offers graph algorithms, inferencing and others. It makes handling those use cases I mentioned above quite easier to handle. Graph databases are better suited for certain types of machine learning algorithms and provide machine based inferencing that can be very valuable in machine learning.

Unlike the traditional data warehouse, AnzoGraph lends itself well to deployment flexibility and scalability. The market is responding to applications built with containers like Docker and Kubernetes because of the scalability factor. When you can spin up multiple containers and spin them back down at will, it makes for a very economical solution that scales. In benchmarks, we have achieved up to 100X faster performance than other databases and the sky’s the limit. Of course, AnzoGraph can deploy on bare metal, VMs or in any of the clouds, but containers get the most interest.

2018 has seen enormous uptake across a range of technology spaces in machine learning while deep learning is waiting in the wings to have its time. Do graph databases have anything to offer those with huge amounts of data that want to join the AI gold rush?

We are seeing a broader adoption of machine learning and AI and graph databases will play a part. We all know that the biggest challenge of machine learning is data preparation. However, this preparation and curation is simplified by directly importing the raw data and then curating in the graph database itself rather than a complex ETL pipeline. The simplicity of the data model makes curation dramatically simpler and faster than curation in a relational database. Users will be able to do some mining of unstructured data more easily when the complicated schemas are gone, and they can take advantage of the scalability of containers.

Graph databases have been around for some time, but are coming of age right now. What are your predictions for the next 2 years in this space, and how will AnzoGraph be working to lead the pack of next-generation graph databases?

I’m anticipating a greater understanding over the coming years of the general categories of executing big data analytics as opposed to operational queries. AnzoGraph are strongly focussed around big data analytics that aggregate across a graph space. We can go beyond narrow queries “Tell me about Steve” to cover broader analytics such as “Tell me about humans”.

The W3C standard is the only current formal standard but Cypher is clearly the de facto standard for labelled property graphs. There’s a group that has formed to create a next generation formal standard and it’ll be interesting to see how that shapes up. Here at Cambridge Semantics, we are very supportive of that process and it can only be a good thing to have a robust language for graphs. So my prediction for the coming few years in the graph spaces is that the proprietary model is on the way out.

The marketplace will decide on the exact standards and we will be adapting our solutions to comply as we are strongly committed to standards. I don’t see this evolution as a threat but as a huge opportunity for us as it matches our mindset and will only grow the uptake in graph technologies.

In Conclusion

I’d like to thank Barry and the team at Cambridge Semantics for the opportunity to find out more about AnzoGraph. I’m not affiliated in any way with the company and should point out that I was not compensated by them for this interview.

If you’re after more details about AnzoGraph, there’s a great technical presentation on Slideshare from October 2018 or check out the website. Do feel free to leave any questions in the comments below!