My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introduce yourself to the InfoQ community?

Hello, I am Ian Robinson, I am engineer at Neo Technology, I am based in London and I work on the Neo4j graph database.

 

2.What is a graph database?

A graph database? It’s a database, it’s an OLTP system, the kind of database that you could insert between a web request and a web response, but it allows you to model, store and query your data as a graph or as a network structure. So instead of modeling your data as a tabular structure and querying it with set-based operations, as you do in a relational database, with a graph database you are modeling your data as a network structure, and storing it as a network structure and then querying it as a network structure. So that structure comprises nodes and relationships and properties.

3.I guess social networks have done a lot to popularise the idea of graphs; I think everyone’s heard of the Facebook social graph for instance. So what is it about social data that lends itself naturally to being treated as a graph?

Social data or social behaviors are intrinsically based upon relationships. People behave within groups and they have friends and there are friends of friends, and so on, and there is an enormous body of research that demonstrates that there are emergent behaviors within these group structures, or these network structures, and that there are properties that we can identify and effectively predict certain behaviors or the transmission of certain behaviors within those social groups. But those groups are always structured as a network, or as a graph-like structure, and therefore I think ... There have been many, many years of analysis about these kinds of things, but it’s really within the last five years or so where it’s become prominent in the public’s mind because of things like Facebook, which have really given us, everybody, the ability to make concrete all of those connections and to record them and take advantage of them over the internet.

4.So presumably, you can exploit that for things like recommendation engines and that sort of thing?

Yes, for recommendations engines, both in terms of being able to recommend other people within that network who you might want to connect with - there are emergent properties within that network that say if we have friends in common, we’re more likely to get to know one another than if we are completely disconnected within a network, and we can use those kinds of properties within a network to suggest or recommend people to connect with. And similarly, if you start introducing other elements into the network, things that people buy or actions that they perform or sports and hobbies and so on, again we can begin to identify or predict likely paths that will be pursued in the future and use that to effectively create recommendations.

5.Tell me a bit about the history of Neo4j. When did you start development on it?

As a product, it’s more than ten years old; I guess it’s about 12 years old now. It started life as a database backing a content management system for the Swedish military, I believe, and it’s still being used there in that capacity. But several years after it was broken out as a separate product and then open-sourced, so it’s been an open source graph database for nearly ten years now; lots of contributions from the community. But then about five/six years ago Neo Technology was founded to sponsor that development and Neo employ a lot of the developers who work on the kernel and we still accept a lot of contributions from the community, but Neo sponsors the development of the kernel and ensures that the integrity and the quality of the product is at it should be moving forward.

Charles: And it’s JVM based, I understand. 

Yes, it’s JVM based and hence the 4j, a wonderful piece of foresight in naming the product all those years ago. But, yes, it’s 4j, it’s JVM based, it’s written almost entirely in Java, although some of the query language or the parsing for the query language is written in Scala.

Charles: So, tell me a bit about the data structure. It’s some kind of property graph I guess. 

Yes. I think it’s probably fair to say that Neo invented the term property graph and our original data model was those three fundamental elements: nodes, relationships, and properties. We use nodes to represent the things in your domain that you are interested in, things with identity, the entities in your domain; and then you can attach properties to those nodes, simple key value pairs to represent the attributes of those entities. So at that point you have a very simple record-like structure. But then you can begin to introduce even more structure by connecting those nodes with relationships; so the relationships are first class elements in the data model. Every relationship has a name and a direction, so that lends semantic clarity. We are kind of baking in the semantics of the ways in which things are connected at the level of data model. And then we can also attach properties to these relationships: so we can qualify the relationships with their strength or their weight or their quality. That’s really useful when you’ve got a network and you want to be able to calculate the shortest route between two parts of that network: you can take advantage of the properties on the relationships in order to calculate the shortest path.

So they’re the three fundamental parts of the model that we’ve had for years: nodes, relationships and properties. And then more recently we revised the data model, at the end of last year, introducing a fourth primitive which we call labels. Labels you can attach to nodes; you can ef-fectively label or tag nodes; and we use those labels to help you identify the role that a node plays within your dataset: this node represents a person or this node represents a product. So that gives us a very simple grouping semantic where you can say to the database, “Find me all of the nodes that are labeled product”, for example. But we also then use the labels to associate certain behaviors with those nodes, things such as indexing and constraints. So we can associate indexes and constraints with labels and we can say to the database, “For every node that’s labeled ‘book’, I want you to index it based on its ISBN property” or “Every node that’s labeled ‘book’, I want you to ensure that its ISBN property is unique within the context of this dataset”. So the label is a very, very simple way of tagging nodes, of helping express the role that they play in the dataset, but they also give us those additional capabilities of indexing and optional constraints that we can layer onto the structure.

6.What kind of constraints can you do with that?

Today we have two kind of behaviors: indexing and unique constraints; this is what we released at the end of last year; but we hang them all off labels. But effectively the labels in the future will be adding additional constraints. So, I think you can expect within the next year or so to see additional constraints, such as every node that is labeled ‘person’ you can expect to see a first name, and a last name property; or every node that is labeled ‘author’ can only be connected by way of outgoing rote relationships to nodes labeled ‘book’; so all different kinds of constraints. We don’t have those today, they are things that you have to build in the application, but they are some of the kind of constraints that we have in mind for the future.

7.Neo4j is quite unusual in NoSQL in that it supports ACID transactions. Is it the data structure that informs that design choice?

Yes. Well, the data structure plus the fact that we know that enterprises still care very much about their data and care about the kind of transactional characteristics when they are dealing with their data. Traditionally Neo4j has been seen as part of that larger family of NoSQL databases, but whereas other NoSQL databases can surrender certain ACID properties in order to gain some other benefits: benefits around horizontal scalability or high availability; and they can do that very often because they are dealing with what Martin Fowler calls middle aggregates, simple key-value pairs, or discrete documents, and so on.

We are slightly different. When you are working with a graph you are working with an intrinsically connected structure and very often what you want to do is to create a sub-graph in its entirety: lots of nodes, lots of relationships that connect them, and all the properties attached to them. And when you persist that, you want to be confident that entire structure, not just the discrete elements, but the entire structure has been made durable on disk. Hence we have transactions; every chain to the database has to take place in the context of a transaction. And that way we can guarantee that when the control returns to the client, that entire sub-graph has been made durable on disk. And the advantage there is we slot very easily into enterprises that require of us all of those guarantees around the quality of their data and the durability of their data.

Charles: Presumably there are trade-offs though for having ACID compliance. 

Yes. Again, one of the benefits of relaxing some of those ACID properties and dealing with simple, aggregate oriented data - as you get in a key-value store or a column-orientated database - it makes it very, very easy to distribute that data and provide for very, very large horizontal scalability. With Neo4j today, it’s not so easy to distribute the graph data. We cluster, so Neo4j is a clustered database product, but every instance in that cluster has a full copy of the dataset. So, actually distributing the data across machine boundaries and querying across machine boundaries is a really, really tricky problem to solve in the general case. We know for sure that by far the best performance when you are creating and querying a graph structure is going to be achieved when you can work within the context of a single machine boundary; you know, within the context of a single machine. The ACID requirements, and the fact that we are dealing with an intrinsically connected structure, means it makes it more difficult for us to easily distribute that data across different machines. At the same time, we get the benefits of working locally within a single instance when we are satisfying a request and that can give us orders of magnitude improved performance over the equivalent kind of query in a relational database.

Charles: And the data replication itself is master-slave, I think.

Yes. So, we have a cluster of machines and one of those machines at any point in time will be acting as the master and coordinating all of the rights to the system, and then the other ma-chines will be serving as slaves and polling at frequent intervals, and pulling across recent tran-sactions and applying them locally. This means that if you write directly to the master, by the time the control returns to the client, that’s immediately consistent with regard to that particular interaction; but the overall system is eventually consistent, in the order of milliseconds. So you can configure that with a polling interval, but there is that window of eventual consistency where the slaves are pulling across the recent transactions and catching up.

8.Do you have any upper limit on the number of slaves that you can have?

I think we are probably talking here in terms of clusters in single digit, perhaps just touching the double digits. I’ve spun up clusters of 30-40 instances, but I only know of customers who have instances of 11-12 or so. But we are talking those kind of numbers.

9.We talked earlier on in the conversation about Facebook as an example of a use case. I’d like to look at some other examples. So, route finding and logistics I guess is one for you, is it?

Route finding goes back to the roots of graph theory itself: discovering routes and shortest routes through the system; so being able to calculate routes on a transport network, for example, this is one of the ways that graph databases and Neo4j is being employed today. In fact, the largest parcel network in Europe is using Neo4j today to route all of the parcels through that network; there are 2,000, at peak time 3,000, parcels per second entering the network, at post offices and other delivery centers. And at that point in time they have a window of up to a second to calculate the most effective or efficient route through the parcel network to the ultimate destination; because the parcel is actually travelling down a chute and at the bottom of the chute it’s going to get switched one way or the other depending on where we think it should be headed.

So, we’ve got that little window of time in which to calculate the route through the entire parcel network; and you are talking about millions of locations all connected by way of different delivery routes. And then the problem is made even more difficult because those delivery routes are time-sensitive. You’ve got more trucks on the road at Christmas time, when there are more parcels in the network, than you do during the summer, so when you are calculating the route you have to be aware of which routes are available today and in the next few days looking forwards. But as I say, Neo4j there is being used to effectively calculate the routes for 2,000 - 3,000 parcels per second and it has to do that within milliseconds; before that parcel has reached the bottom of the chute.

10.And do you have customers using Neo4j for network impact analysis?

Yes, several different telcos who are employing it in one form or another for doing network impact analysis. The nice thing here is in the past a lot of those telcos would know about the different parts of their network through having several different inventory systems; you’ve actually got a very heterogeneous set of elements in that network, from the customers themselves, through the applications and the services, the machines, the virtual machines and the data centers, and then all those lower level network elements: the fibers and the switches and the routers and so on. These companies now are using Neo4j to model the entirety of that network in the database and having done that they can then use it for impact analysis. And you can do two kinds of impact analysis: you can do top-down - given this important customer, which parts of the network do they actually depend upon, which applications and services and machines and data centers and so on; do we have redundancy throughout the network on behalf of these people and if we haven’t, how can we introduce it? That’s the kind of top-down analysis. The bottom-up analysis is, given this particular network element, this switch or this router, who is it going to impact if it fails or if we have to replace or repair it. So companies are using it to plan a lot of their repair programs so that they don’t knock out a path through the network that would impact their customers. And having done all of that, having modeled that in Neo4j, it’s helping reduce problem resolution time. In the past they would have been hours or days just to query all of those inventory systems, to identify the likely cause of the problem. Now you can identify the routes in seconds or milliseconds and that naturally improves the response times.

Another pattern I am seeing with some of those customers: they’ll have the model of the network which they will refresh at least daily, probably more often; but then they may have separate systems, complex event processes perhaps, dealing with the stream of very low level network events - hundreds of thousands of events per second coming off the network - and if those complex event processes identify a significant business event, having looked at that window of low level events will suddenly discover it looks like there is an outage in this particular region, they can then interrogate the graph to discover the likely impact and to anticipate what kind of mitigation they need to take.

11.Do you have financial services companies maybe doing fraud detection work with Neo, as well?

Yes. In fact, I think we’ve written recently something on the Neo4j blog about using Neo for fraud detection. One of the use cases there is something called first party fraud, where you have a group of people who work together to defraud financial institutions. You’ve got a fraud ring, a number of people constitute a fraud ring, and the pool of identity information, things like social security numbers, addresses, telephone numbers and so on. And those people will open multiple lines of credit, multiple accounts, credit cards and so on, picking different bits of identity in-formation from that pool in order to support their application for that credit card or whatever. No two applications need share exactly the same set of identity information, but there is a lot of overlap: you’ve got a large number of people using the pool of identity information to open mul-tiple lines of credit. And then for some period of time they will probably act like good citizens: they’ll pay their bills, they’ll pay off the credit card and so on. But at some point they’ll begin to extend those lines of credit, max out their credit cards, and then all of a sudden the entire ring will just close out, disappear off the radar and take all that money with them.

So, the graph there is being used to connect all those different things: the people, the accounts and the little bits of identity information, so that you can begin to identify the likelihood of dealing with a fraud ring. And you can apply this stuff at the point in time when someone is applying for a new account; you can take their current details, the identity details that they are submitting, see whether there are existing pieces in the graph, connect them to those existing pieces, introduce the new bits of identity where they are talking about something that is a property move, and so on. And then you can look at the graph that surrounds these bits of identity information and see how broad the span is. So, you can imagine if you‘ve got a family living at a particular address, a couple of the people there may have credit cards, bank accounts and so on; so you’d expect to see a small amount of shared information between several different accounts and several different people, but if the span is a lot broader, if you’ve got tens of people sharing that identity information, then the risk is that you are dealing with a fraud ring. So, the graph is being used there to help identify exposure and risk, and again, to help qualify applications at the point in time when someone is submitting an application online.

Charles: That’s very interesting. Ian, thank you very much.

infoq - neo4j graph db的更多相关文章

  1. Graph database_neo4j 底层存储结构分析(7)

    3.7  Relationship 的存储 下面是neo4j graph db 中,Relationship数据存储对应的文件: neostore.relationshipgroupstore.db ...

  2. Graph database_neo4j 底层存储结构分析(6)

    3.6  Node 数据存储 neo4j 中, Node 的存储是由 NodeStore 和 ArrayPropertyStore 2中类型配合来完成的. node 的label 内容是存在Array ...

  3. Graph database_neo4j 底层存储结构分析(5)

    3.5 Property 的存储 下面是neo4j graph db 中,Property数据存储对应的文件: neostore.propertystore.db neostore.propertys ...

  4. 通过Spring Data Neo4J操作您的图形数据库

    在前面的一篇文章<图形数据库Neo4J简介>中,我们已经对其内部所使用的各种机制进行了简单地介绍.而在我们尝试对Neo4J进行大版本升级时,我发现网络上并没有任何成型的样例代码以及简介,而 ...

  5. Graph database_neo4j 底层存储结构分析(1)

    1       neo4j 中节点和关系的物理存储模型 1.1  neo4j存储模型 The node records contain only a pointer to their first pr ...

  6. Coursera, Big Data 1, Introduction (week 1/2)

    Status: week 2 done. Week 1, 主要讲了大数据的的来源 - 机器产生的数据,人产生的数据(比如社交软件上的update, 一般是unstructed data), 组织产生的 ...

  7. 图形数据库Neo4J简介

    最近我在用图形数据库来完成对一个初创项目的支持.在使用过程中觉得这种图形数据库实际上挺有意思的.因此在这里给大家做一个简单的介绍. NoSQL数据库相信大家都听说过.它们常常可以用来处理传统的关系型数 ...

  8. neo4j初次使用学习简单操作-cypher语言使用

    Neo4j 使用cypher语言进行操作 Cypher语言是在学习Neo4j时用到数据库操作语言(DML),涵盖对图数据的增删改查  neo4j数据库简单除暴理解的概念: Neo4j中不存在表的概念, ...

  9. Neo4j 高可用集群安装

    安装neo4j高可用集群,抓图安装过程 http://www.ibm.com/developerworks/cn/java/j-lo-neo4j/ Step1.下载neo4j商业版并解压,复制为neo ...

随机推荐

  1. (转)基于socket的TCP和UDP编程

    一.概述 TCP(传输控制协议)和UDP(用户数据报协议是网络体系结构TCP/IP模型中传输层一层中的两个不同的通信协议. TCP:传输控制协议,一种面向连接的协议,给用户进程提供可靠的全双工的字节流 ...

  2. 【JAE】JAE京东云擎部署首个Java应用

    前几天一直在寻找好用并且免费的PaaS最终就发现了国内的京东云擎JAE和国外的Heroku.我首先选择的是Heroku,因为他有一个非常强大的eclipse插件 为开发带来了很大的方便.用它提交后的代 ...

  3. HUAS_ACM 个人训练#2 C

    题目链接:http://acm.hust.edu.cn/vjudge/contest/view.action?cid=117538#problem/C 题目大意:给出一个字符串,然后给出一系列的x,y ...

  4. Android底部菜单的实现

    前言:以前制作菜单使用TabHost,但是android 3.0以上就被废弃了,google已经不建议使这个类了.ActionBar也是菜单,不过在头部,算是导航了 ===本文就介绍怎么制作底部菜单= ...

  5. pushState与ajax实现无刷新加载

    一.JS代码: $(document).ready(function() { getContent();//初始化页面 $("nav a").click(function() { ...

  6. c++聪聪看书(低数据版代码)

    聪聪是一个善良可爱.睿智聪慧的好孩子.聪聪喜欢看书,这一天她在看一本书时看到了这样一个问题:给你一个正整数n,你要将它分成若干个自然数Ai的和的形式,并且使得这若干个自然数Ai的乘积尽量大,并输出最大 ...

  7. 【Git学习笔记】远程仓库

    第一种情景:本地初始化一个Git仓库后,接着又在github上创建了一个Git仓库,现在要让这两个仓库进行远程同步. 1. 关联本地仓库就和远程仓库  $ git remote add origin ...

  8. python logging usage

    python中,logging模块主要是处理日志的. 所谓日志,可理解为在软件运行过程中,所记录的的一些运行情况信息 软件开发人员可以根据自己的需求添加日志,日志可以帮助软件开发人员 了解软件的运行信 ...

  9. web应用程序开发原理

    企业应用计算的演变为1.主机/哑终端的集中计算模式: 2.客户机/服务器计算模式:3.浏览器    /服务器计算模式.其中,1具有部署方面的优势,但在一台计算机中进行全部的处理,应用程序难于维护,难于 ...

  10. Unity WebGL MoonSharp崩溃问题

    当前Unity的代码更新方案基本都选择的ULua,而我们项目还需要考虑Web平台,ULua不支持WebGL,所以决定选择MoonSharp.MoonSharp(http://www.moonsharp ...