通过使用JanusGraph索引提高性能

翻译整理：纪玉奇

Extending JanusGraph Server

JanusGraph支持两种类型的索引：graph index和vertex-centric index。graph index常用于根据属性查询Vertex或Edge的场景；vertex index在图遍历场景非常高效，尤其是当Vertex有很多Edge的情况下。

Graph Index

Graph Index是整个图上的全局索引结构，用户可以通过属性高效查询Vertex或Edge。如下面的代码：

g.V().has('name','hercules')
g.E().has('reason', textContains('loves'))

上面的例子即为根据属性查找Vertex或Edge的实例，如果没有设置索引，上述的操作将会导致全表扫描，对大图来说是不可接受的。

JanusGraph支持两种不同的Graph Index，Composte index和Mixed Index，Compostie非常高效和快速，但只能应用对某特定的，预定义的属性key组合进行相等查询。Mixed index可用在查询任何index key的组合上并支持多条件查询，除了相等条件要依赖于后端索引存储。

这两种类型的Index都是通过JanusGraph的management操作的：

JanusGraphManagement.buildIndex(String,Class）

第一个参数是index的名称，第二个参数是要索引的类（如Vertex.class），name必须唯一。如果是在同一事务中新增的属性key所构成Index将会即刻生效，否则需要运行一个reindex proceudre来同步索引和数据，直到同步完成，否则索引不可用。推荐在初始化schema时同时定义索引。

注意：如果没有建索引，会进行全表扫面，此时性能非常低，可以通过配置force-index参数禁止全表扫描。

Composite Index

Comosite index通过一个或多个固定的key组合来获取Vertex Key或Edge，也即查询条件是在Index中固定的。

// 在graph中有事务执行时绝不能创建索引（否则可能导致死锁）
graph.tx().rollback()
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
// 构建根据name查询vertex的组合索引
mgmt.buildIndex('byNameComposite',Vertex.class).addKey(name).buildCompositeIndex()
// 构建根据name和age查询vertex的组合索引
mgmt.buildIndex('byNameAndAgeComposite',Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()
//等待索引生效
mgmt.awaitGraphIndexStatus(graph,'byNameComposite').call()
mgmt.awaitGraphIndexStatus(graph,'byNameAndAgeComposite').call()
//对已有数据重新索引
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"),SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"),SchemaAction.REINDEX).get()
mgmt.commit()

需要注意的是，Composite index需要在查询条件完全匹配的情况下才能触发，如上面代码，g.V().has('name', 'hercules')和g.V().has('age',30).has('name','hercules')都是可以触发索引的，但g.V().has('age',30)则不行，因并未对age建索引。g.V().has('name','hercules').has('age',inside(20,50))也不可以，因只支持精确匹配，部支持范围查询。

Index Uniqueness

Composite Index也可以作为图的属性唯一约束使用，如果composite graph index被设置为unique()，则只能存在最多一个对应的属性组合。

graph.tx().rollback()//Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
mgmt.buildIndex('byNameUnique',Vertex.class).addKey(name).unique().buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph,'byNameUnique').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameUnique"),SchemaAction.REINDEX).get()
mgmt.commit()

注意：对于设置为最终一致性的后端存储，index的一致性必须被设置为允许锁定。

Mixed Index

Mixed Index支持通过其中的任意key的组合查询Vertex或者Edge。Mix Index使用上更加灵活，而且支持范围查询等（不仅包含相等）；从另外一方面说，Mixed index效率要比Composite Index低。

与Composite key不同，Mixed Index需要配置索引后端，JanusGraph可以在一次安装中支持多个索引后端，而且每个索引后端必须使用JanusGraph中配置唯一标识：称为indexing backend name。

graph.tx().rollback()//Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name).addKey(age).buildMixedIndex("search")
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph,'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"),SchemaAction.REINDEX).get()
mgmt.commit()

上面的代码建立了一个名为nameAndAge的索引，该索引使用name和age属性构成，并设定其索引后端为"search"，对应到配置文件中为：index.serarch.backend，如果叫solrsearch，则需要增加：index.solrsearch.backend配置。

下面展示了如果使用text search作为默认的搜索行为：

mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name,Mapping.TEXT.getParameter()).addKey(age,Mapping.TEXT.getParameter()).buildMixedIndex("search")

更加详细的使用参考：Charpter21, Index Parameter and Full-Test Search

在使用上，支持范围查询和索引中任何组合查询，而不仅局限于“相等”查询方式：

g.V().has('name', textContains('hercules')).has('age', inside(20,50))
g.V().has('name', textContains('hercules'))
g.V().has('age', lt(50))

Mixed Index支持全文检索，范围检索，地理检索和其他方式，参考Chapter20, Search Predicates and Data Types。

注意：不像composite index，mixed index不支持唯一性。

Adding Property Keys

可以向已经存在的mixed index中新增属性，之后就可以在查询条件中使用了。

//Never create new indexes while a transaction is active
graph.tx().rollback()
mgmt = graph.openManagement()
//创建一个新的属性
location = mgmt.makePropertyKey('location').dataType(Geoshape.class).make()
nameAndAge = mgmt.getGraphIndex('nameAndAge')
//修改索引
mgmt.addIndexKey(nameAndAge, location)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph,'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"),SchemaAction.REINDEX).get()
mgmt.commit()

如果索引是在同意事务中创建的，则在该事务中马上可以使用。如果该属性Key已经被使用，需要执行reindex procedure来保证索引中包含了所有数据，知道该过程执行完毕，否则不能使用。

Mapping Parameters

当向mixed index增加新的property key时（无论通过何种方式创建），可以指定一组参数来设置property value在后端的存储方式。参考mapping paramters overview章节。

Ordering

图查询的集合返回顺序可由order().by()指定，该方法包含了两个参数：

排序依据的属性名称
升降序，incr和decr

如：

g.V().has('name', textContains('hercules')).order().by('age', decr).limit(10)

返回了name属性中包含‘hercules’且以'age'降序返回的10条数据。

使用Order时需要注意：

composite graph index原生不支持对返回结果排序，数据会被先加载到内存中再进行排序，对于大数据集合来讲成本非常高
Mixed graph index本身支持排序返回，但排序中要使用的property key需要提前被加到mix index中去，如果要排序的property key不是index的一部分，将会导致整个数据集合加载到内存。

Label Constraint

有些情况下，我们不想对图中具有某一label的所有Vertex或Edge进行索引，例如，我们只想对有GOD标签的节点进行索引，此时我们可以使用indexOnly方法表示只索引具有某一Label的Vertex和Edge。如下：

//Never create new indexes while a transaction is active
graph.tx().rollback()
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
god = mgmt.getVertexLabel('god')
//只索引有god这一label的顶点
mgmt.buildIndex('byNameAndLabel',Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph,'byNameAndLabel').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndLabel"),SchemaAction.REINDEX).get()
mgmt.commit()

label约束对mix index也是类似的，当一个有label约束的composite index被设置为唯一时，唯一约束只应用于具有此label的vertex或edge属性上。

Composite versus Mixed Indexes

1. 使用comosite key应用与确切的匹配场景，composite key不需要外部索引系统且通常具有更好的性能。

作为一个例外，如果要精确匹配的值数量很小（如12个月份）或一个元素与图中很多的元素有关联，此时应使用mix index。

2. 对取范围，全文检索或位置查询这样的应用场景，应该使用mix index，而且使用mixed index可以提供order().by()的性能。

Vertex-centric Indexs

Vertex-centric index（顶点中心索引）是为每个vertex建立的本地索引结构，在大型graph中，每个vertex有数千条Edge，在这些vertex中遍历效率将会非常低（需要在内存中过滤符合要求的Edge）。Vertex-centric index可以通过使用本地索引结构加速遍历效率。

如：

h = g.V().has('name','hercules').next()
g.V(h).outE('battled').has('time', inside(10,20)).inV()

如果没有vertex-centric index，则需要便利所有的batteled边并找出记录，在边的数量庞大时效率非常低。

建立一个vertex-centric index可以加速查询：

//Never create new indexes while a transaction is active
graph.tx().rollback()
mgmt = graph.openManagement()
//找到一个property key
time = mgmt.getPropertyKey('time')
// 找到一个label
battled = mgmt.getEdgeLabel('battled')
// 创建vertex-centric index
mgmt.buildEdgeIndex(battled,'battlesByTime',Direction.BOTH,Order.decr, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph,'battlesByTime').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("battlesByTime"),SchemaAction.REINDEX).get()
mgmt.commit()

上面的代码对battled边根据time以降序建立了双向索引。buildEdgeIndex()方法中的第一个参数是要索引的Edge的Label，第二个参数是index的名称，第三个参数是边的方向，BOTH意味着可以使用IN/OUT，如果只设置为某一方向，可以减少一半的存储和维护成本。最后两个参数是index的排序方向，以及要索引的property key，property key可以是多个，order默认为升序（Order.ASC）。

graph.tx().rollback()//Never create new indexes while a transaction is active
mgmt = graph.openManagement()
time = mgmt.getPropertyKey('time')
rating = mgmt.makePropertyKey('rating').dataType(Double.class).make()
battled = mgmt.getEdgeLabel('battled')
mgmt.buildEdgeIndex(battled,'battlesByRatingAndTime',Direction.OUT,Order.decr, rating, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitRelationIndexStatus(graph,'battlesByRatingAndTime','battled').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getRelationIndex(battled,'battlesByRatingAndTime'),SchemaAction.REINDEX).get()
mgmt.commit()

上面的代码建立了battlesByRatingAndTime索引，并以rating和time构成，需要注意构成索引的property key的顺序非常重要，查询时只能根据propety key定义的顺序查询。

h = g.V().has('name','hercules').next()
g.V(h).outE('battled').property('rating',5.0)//Add some rating properties
g.V(h).outE('battled').has('rating', gt(3.0)).inV()
g.V(h).outE('battled').has('rating',5.0).has('time', inside(10,50)).inV()
g.V(h).outE('battled').has('time', inside(10,50)).inV()

对上面部分的代码，只有查询1,2是可以使用索引的，查询3使用time查询无法匹配先根据rating再根据time的index构造顺序。可以对一个label创建多个不同的索引来支持不同的遍历。JanusGraph自动选择最有效的索引，Vertex-centric仅支持相等和range/interval约束。

注意：在vertex-centirc中使用的property key必须是显式定义的且未确定的class类型（不是Object.class）才能支持排序。如果数据类型浮点型，必须使用JanusGraph的Decimal或Precision数据类型。

根据在同一事务中新建的label所创建的索引可以即刻生效，如果edge正在被使用，则需要运行reindex程序，直到该程序运行结束，否则该索引无法使用。

注意：JanusGraph自动为每个edge label的每个property key建立了vertex-centric label，因此即使有数千个边也能高效查询。

Vertex-centric label无法加速不受约束的遍历（在所有边中遍历），这种遍历随着边的增加会变的更慢，通常这些遍历可以作为受约束遍历重写来提高性能。

Ordering Traversals

下面的查询使用了local和limit方法获取了遍历过程的排序子集。

h = g..V().has('name','hercules').next()
g.V(h).local(outE('battled').order().by('time', decr).limit(10)).inV().values('name')
g.V(h).local(outE('battled').has('rating',5.0).order().by('time', decr).limit(10)).values('place')

如果排序字段和排序方向与vertex-centric index一致的话，上面的查询非常高效。

注意：vertex 排序查询时JanusGraph对Gremlin的扩展，要使用该功需要一段冗长的语句，而且需要_()步骤将JanusGraph转换为Gremlin管道。