A First Exploration Of SolrCloud

Update: this article was published in August 2012, before the very first release of SolrCloud. Meanwhile SolrCloud has evolved, please refer to the Solr website and community for up-to-date information.

SolrCloud has recently been in the news and was merged into Solr trunk, so it was high time to have a fresh look at it.The SolrCloud wiki page gives various examples but left a few things unclear for me. The examples only show Solr instances which host one core/shard, and it doesn’t go deep on the relation between cores, collections and shards, or how to manage configurations.

In this blog, we will have a look at an example where we host multiple shards per instance, and explain some details along the way.

The setup we are going to create is shown in this diagram.

SolrCloud terminology

In SolrCloud you can have multiple collections. Collections can be divided into partitions, these partitions are called slices. Each slice can exist in multiple copies, these copies of the same slice are called shards. So the word shard has a bit of a confusing meaning in SolrCloud, it is rather a replica than a partition. One of the shards within a slice is the leader, though this is not fixed: any shard can become the leader through a leader-election process.

Each shard is a physical index, so one shard corresponds to one Solr core.

If you look at the SolrCloud wiki page, you won’t find the word slice [anymore]. It seems like the idea is to hide the use of this word, though once you start looking a bit deeper you will encounter it anyway so it’s good to know about it. It’s also good to know that the words shard and slice are often used in ambiguous ways, switching one for the other (even in the sources). Once you know this, things become more comprehensible. An interesting quote in this regard: “removing that ambiguity by introducing another term seemed to add more perceived complexity”. In this article I’ll use the words slice and shard as defined above, so that we can distinguish the two concepts.

In SolrCloud, the Solr configuration files like schema.xml and solrconfig.xml are stored in ZooKeeper. You can upload multiple configurations to ZooKeeper, each collection can be associated with one configuration. The Solr instances hence don’t need the configuration files to be on the file system, they will read them from ZooKeeper.

Running ZooKeeper

Let’s start by launching a ZooKeeper instance. While Solr allows to run an embedded ZooKeeper instance, I find that this rather complicates things. ZooKeeper is responsible for storing coordination and configuration information for the cluster, and should be highly available. By running it separately, we can start and stop Solr instances without having to think about which one(s) embed ZooKeeper.

For the purpose of this article, you can get it running like this:

  • download ZooKeeper 3.3. Don’t take version 3.4, as it’s not recommended for production yet.
  • extract the download
  • copy conf/zoo_sample.cfg to conf/zoo.cfg
  • mkdir /tmp/zookeeper (path can be changed in zoo.cfg)
  • ./bin/zkServer.sh start

And that’s it, you have ZooKeeper running.

Setting up Solr instance directories

We are going to run two Solr instances, thus we’ll need two Solr instance directories.

Let’s create two directories like this:

mkdir -p ~/solrcloudtest/sc_server_1
mkdir -p ~/solrcloudtest/sc_server_2

Create the file ~/solrcloudtest/sc_server_1/solr.xml containing:

<solr persistent="true">
  <cores adminPath="/admin/cores" hostPort=”8501”>
  </cores>
</solr>

Create the file ~/solrcloudtest/sc_server_2/solr.xml containing (note the different hostPort value):

<solr persistent="true">
  <cores adminPath="/admin/cores" hostPort=”8502”>
  </cores>
</solr>

We need to specify the hostPort attribute since Solr can’t detect the port, it falls back to the default 8983 when not specified.

This is all we need: the actual core configuration will be uploaded to ZooKeeper in the next section.

Creating a Solr configuration in ZooKeeper

As explained before, the Solr configuration needs to be available in ZooKeeper rather than on the file system.

Currently, you can upload a configuration directory from the file system to ZooKeeper as part of the Solr startup. It is also possible to run ZkController’s main method for this purpose (SOLR-2805), but as there’s no script to launch it, the easiest way right now to upload a configuration is by starting Solr:

export SOLR_HOME=/path/to/solr-trunk/solr

# important: move into one the instance directory!
# (otherwise Solr will start up with defaults and create a core etc.)
cd sc_server_1

java \
 -Djetty.port=8501 \
 -Djetty.home=$SOLR_HOME/example/ \
 -Dsolr.solr.home=. \
 -Dbootstrap_confdir=$SOLR_HOME/example/solr/conf \
 -Dcollection.configName=config1 \
 -DzkHost=localhost:2181 \
 -jar $SOLR_HOME/example/start.jar

Now that the configuration is uploaded, you can stop this Solr instance again (ctrl+c).

We can now check in ZooKeeper that the configuration has been uploaded, and that no collections have been created yet.

For this, go to the ZooKeeper directory, and run ./bin/zkCli.sh, and do the following commands:

[zk: localhost:2181(CONNECTED) 1] ls /configs
[config1]

[zk: localhost:2181(CONNECTED) 2] ls /collections
[]

You could repeat this process to upload more configurations.

If you would like to change a configuration later on, you essentially have to upload it again in the same way. The various Solr cores that make use of that configuration won’t be reloaded automatically however (SOLR-3071).

Starting the Solr servers

All SolrCloud magic is activated by specifying the zkHost parameter. Without this parameter, you run Solr ‘classic’, with the parameter, you run SolrCloud. If you look into the source code, you will see that this parameter causes the creation of a ZkController, and at various places checks of the kind ‘zkController != null’ are done to change behavior when in cloud mode.

Open two shells, and start the two Solr instances:

export SOLR_HOME=/path/to/solr-trunk/solr
cd sc_server_1
java \
 -Djetty.port=8501 \
 -Djetty.home=$SOLR_HOME/example/ \
 -Dsolr.solr.home=. \
 -DzkHost=localhost:2181 \
 -jar $SOLR_HOME/example/start.jar

and (note: different instance dir & jetty port)

export SOLR_HOME=/path/to/solr-trunk/solr
cd sc_server_2
java \
 -Djetty.port=8502 \
 -Djetty.home=$SOLR_HOME/example/ \
 -Dsolr.solr.home=. \
 -DzkHost=localhost:2181 \
 -jar $SOLR_HOME/example/start.jar

Note that now, we don’t have to specify the boostrap_confdir and collection.configName properties anymore (though that last one can still be useful as default sometimes, but not with the way we will create collections & shards below).

We have neither added the -Dnumshards parameter, which you might have encountered elsewhere. When you manually assign cores to shards as we will do below, I don’t think it serves any purpose.

So the situation now is that we have two Solr instances running, both with 0 cores.

Define the cores, collections, slices, shards

We are now going to create cores, and assign each core to a specific collection and slice. It is not necessary to define collections & shards anywhere, they are implicit by the fact that there are cores that refer them.

In our example, the collection is called ‘collectionOne’ and the slices are called ‘slice1’ and ‘slice2’.

Let’s start with creating a core on the first server:

curl 'http://localhost:8501/solr/admin/cores?action=CREATE&name=core_collectionOne_slice1_shard1&collection=collectionOne&shard=slice1&collection.configName=config1'

(in the URL above, and the solr.xml snippet below, the word ‘shard’ is used for ‘slice’)

If you have a look now at sc_server_1/solr.xml, you will see the core was added:

<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
  <cores adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8501"
hostContext="solr">
    <core
       name="core_collectionOne_slice1_shard1"
       collection="collectionOne"
       shard="slice1"
       instanceDir="core_collectionOne_slice1_shard1/"/>
  </cores>
</solr>

AFAIU the information in ZooKeeper takes precedence, so the attributes collection and shard on the core above serve more as documentation, or they are of course also relevant if you would create cores by listing them in solr.xml rather than using the cores-admin API. Actually listing them in solr.xml might be simpler than doing a bunch of API calls, but there is currently one limitation: you can’t specify the configName this way.

In ZooKeeper, you can verify this collection is associated with the config1 configuration:

[zk: localhost:2181(CONNECTED) 5] get /collections/collectionOne
{"configName":"config1"}

And you can also get an overview of all collections, slices and shards like this:

[zk: localhost:2181(CONNECTED) 0] get /clusterstate.json
{
  "collectionOne": {
    "slice1": {
      "fietsbel:8501_solr_core_collectionOne_slice1_shard1": {
        "shard_id":"slice1",
        "leader":"true",
        "state":"active",
        "core":"core_collectionOne_slice1_shard1",
        "collection":"collectionOne",
        "node_name":"fietsbel:8501_solr",
        "base_url":"http://fietsbel:8501/solr"
      }
    }
  }
}

(the somewhat strange “shard_id”:”slice1″ is just a back-pointer from the shard to the slice to which it belongs)

Now let’s create the remaining cores: one more on server 1, and two on server 2 (notice the different port numbers to which we send these requests).

curl 'http://localhost:8502/solr/admin/cores?action=CREATE&name=core_collectionOne_slice2_shard1&collection=collectionOne&shard=slice2&collection.configName=config1'

curl 'http://localhost:8501/solr/admin/cores?action=CREATE&name=core_collectionOne_slice2_shard2&collection=collectionOne&shard=slice2&collection.configName=config1'

curl 'http://localhost:8502/solr/admin/cores?action=CREATE&name=core_collectionOne_slice1_shard2&collection=collectionOne&shard=slice1&collection.configName=config1'

Let’s have a look in ZooKeeper at the current state of the clusterstate.json:

[zk: localhost:2181(CONNECTED) 0] get /clusterstate.json
{
  "collectionOne": {
    "slice1": {
      "fietsbel:8501_solr_core_collectionOne_slice1_shard1": {
        "shard_id":"slice1",
        "leader":"true",
        "state":"active",
        "core":"core_collectionOne_slice1_shard1",
        "collection":"collectionOne",
        "node_name":"fietsbel:8501_solr",
        "base_url":"http://fietsbel:8501/solr"
      },
      "fietsbel:8502_solr_core_collectionOne_slice1_shard2": {
        "shard_id":"slice1",
        "state":"active",
        "core":"core_collectionOne_slice1_shard2",
        "collection":"collectionOne",
        "node_name":"fietsbel:8502_solr",
        "base_url":"http://fietsbel:8502/solr"
      }
    },
    "slice2": {
      "fietsbel:8502_solr_core_collectionOne_slice2_shard1": {
        "shard_id":"slice2",
        "leader":"true",
        "state":"active",
        "core":"core_collectionOne_slice2_shard1",
        "collection":"collectionOne",
        "node_name":"fietsbel:8502_solr",
        "base_url":"http://fietsbel:8502/solr"
      },
      "fietsbel:8501_solr_core_collectionOne_slice2_shard2": {
        "shard_id":"slice2",
        "state":"active",
        "core":"core_collectionOne_slice2_shard2",
        "collection":"collectionOne",
        "node_name":"fietsbel:8501_solr",
        "base_url":"http://fietsbel:8501/solr"
      }
    }
  }
}

We see we have:

  • one collection named collectionOne
  • two slices named slice1 and slice2
  • each slice has two shards. Within each slice, one shard is the leader (see “leader”:”true”), the other(s) are replicas. Of each slice, one shard is hosted in each Solr instance.

Adding some documents

Now let’s try our setup works by adding some documents:

cd $SOLR_HOME/example/exampledocs
java -Durl=http://localhost:8501/solr/core_collectionOne_slice1_shard1/update -jar post.jar *.xml

We sent the request to one specific core, but you could have picked any other core and the end result would be the same. The request will be forwarded automatically to the leader shard of the appropriate slice. The slice is selected based on the hash of the id of the document.

Use the admin stats page to see documents got added to all cores:

http://localhost:8501/solr/core_collectionOne_slice1_shard1/admin/stats.jsp
http://localhost:8502/solr/core_collectionOne_slice1_shard2/admin/stats.jsp

http://localhost:8502/solr/core_collectionOne_slice2_shard1/admin/stats.jsp
http://localhost:8501/solr/core_collectionOne_slice2_shard2/admin/stats.jsp

In my case, the cores for slice1 got 16 documents, those for slice2,
12. Unlike the traditional Solr replication, with SolrCloud updates are
sent directly to the replica’s.

Querying

Let’s just query all documents. Again, we send the request to one particular shard:

http://localhost:8501/solr/core_collectionOne_slice1_shard1/select?q=*:*

You will see the numFound=”28”, the sum of 16 and 12.

What happens internally is that when you sent a request to a core,
when in SolrCloud mode, Solr will look up what collection the core is
associated with, and do a distributed query across all slices (it will
pick one shard for each slice).

The SolrCloud wiki page gives the suggestion that you can use the
collection name in the URL (like /solr/collection1/select). In our
example, this would then be /solr/collectionOne/select. This is however
not the case, but rather a particularity of that example. As long as you
don’t host more than one slice and shard of the same collection in one
Solr server, it can make sense to use such a core naming strategy.

Starting from scratch

When playing around with this stuff, you might want to start from
scratch sometimes. In such case, don’t forget you have to remove data in
three places: (1) the state stored in ZooKeeper (2) the cores defined
in solr.xml and (3) the instance directories of these cores.

When writing the first draft of this article, I was using just one
Solr instance and tried to have all the 4 cores (including replica’s) in
one Solr instance. Turns out there was a bug that prevents this from
working correctly (SOLR-3108).

Managing slices & shards

Once you have defined a collection, you can not (or rather should
not) add new slices to it, since documents won’t be automatically moved
to the new slice to fit with the hash-based partitioning (SOLR-2595).

Adding more replica shards should be no problem though. While above
we have used a very explicit way of assigning each core to a particular
slice, you can actually leave that parameter off and Solr will
automatically assign it to some slice within the collection. (I guess
here the -Dnumshards parameter kicks in to decide whether the new core
should be a slice or a shard)

How about removing replicas? It can be done, but manually. You have
to unload the core and remove the related state in ZooKeeper. This is an
area that will be improved upon later. (SOLR-3080)

Another interesting thing to note is that when your run in SolrCloud
mode, all cores will automatically take part in the cloud thing. If you
add a core without specifying the collection, a collection named after
that core will be created. You can’t mix ‘classic’ cores and ‘cloud’
cores in one Solr instance.

Conclusion

In this article we have barely touched the surface of everything
SolrCloud is: there’s the update log for durability and recovery, the
sync’ing between replica’s, the details of distributed querying and
updating, the special _version_ field to help with some of the these
points, the coordination (election of overseer & shard leaders), …
Much interesting stuff to explore!

As becomes clear from this article, SolrCloud isn’t as easy to use
yet as ElasticSearch. It still needs polishing and there’s more manual
work involved in setting it up. To some extent this has its advantages,
as long as it’s clear what you can expect from the system, and what you
have to take care of yourself. Anyway, it’s great to see that the Solr
developers were able to catch up with the cloud world.

Data-Driven Solutions from NGDATA

A First Exploration Of SolrCloud的更多相关文章

  1. Apache SolrCloud安装

    1.介绍  SolrCloud通过ZooKeeper集群来进行协调,使一个索引进行分片,各个分片可以分布在不同的物理节点上,多个物理分片组成一个完成的索引Collection.SolrCloud自动支 ...

  2. Solr Cloud - SolrCloud

    关于 Solr Cloud Zookeeper 入门,介绍 原理 原封不动转自 http://wiki.apache.org/solr/SolrCloud/ ,文章的内存有些过时,但是了解原理. Th ...

  3. SolrCloud分布式集群部署步骤

    Solr及SolrCloud简介 Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口.用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成 ...

  4. solrCloud+tomcat+zookeeper集群配置

    solrcolud安装solrCloud+tomcat+zookeeper部署  转载请出自出处:http://eksliang.iteye.com/blog/2107002 http://eksli ...

  5. solrCloud的两种部署方式

    solrcloud 的部署其实有两种方式可选,那么我们在实践开发中应该怎样选择呢? 第一种:当启动solr服务器时,内嵌的启动一个Zookeeper服务器,然后将这些内嵌的Zookeeper服务器组成 ...

  6. poj 2594 Treasure Exploration (二分匹配)

    Treasure Exploration Time Limit: 6000MS   Memory Limit: 65536K Total Submissions: 6558   Accepted: 2 ...

  7. Solr术语介绍:SolrCloud,单机Solr,Collection,Shard,Replica,Core之间的关系

    Solr有一堆让人发晕的术语如:collections,shards,replicas,cores,config sets. 在了解这些术语之前需要先做做如下功课: 1)什么是倒排索引? 2)搜索引擎 ...

  8. SolrCloud的官方配置方式

    前面写过生产过程中的SolrCloud集群配置,实际上官方给出的是免安装配置,启动时采用命令行参数的方式启动,这样相对简单,并且官方文档也给出了外部Zookeeper的配置,和前面说的基本一致,这个不 ...

  9. Linux下部署solrCloud

    1. 准备工作 这里我只是把我的师兄教我的关于Solrcloud搭建的过程,以及需要注意的地方文档化了.感谢他教会了我很多. 1.机子IP 三台安装linux系统的机子的IP地址为: 172.24.1 ...

随机推荐

  1. mysql出现Waiting for table metadata lock的原因及解决方案

    最近经常遇到mysql数据库死锁,郁闷死, show processlist; 时 Waiting for table metadata lock 能一直锁很久 下面有官网的一段话,可以理解下 htt ...

  2. OpenVPN多处理之-netns容器与iptables CLUSTER

    假设还是沉湎于之前的战果以及强加的感叹,不要冥想,将其升华. 1.C还是脚本 以前,我用bash组织了复杂的iptables,ip rule等逻辑来配合OpenVPN,将其应用于差点儿全部能够想象得到 ...

  3. c++ inheritance -- 继承

    c++ inheritance -- 继承 终于要决心弄明白继承了,以前仅限于大学时学习,以后工作也没有用,现在就依照(百度百科)文章写些测试的代码. 文章说 ==================== ...

  4. [置顶] Oracle 11g Data Guard Role Transitions: Failover

    Role TransitionsInvolving Physical Standby Databases A database operates in one of the following mut ...

  5. Matlab图像处理系列1———线性变换和直方图均衡

    注:本系列来自于图像处理课程实验,用Matlab实现最主要的图像处理算法 图像点处理是图像处理系列的基础,主要用于让我们熟悉Matlab图像处理的编程环境.灰度线性变换和灰度拉伸是对像素灰度值的变换操 ...

  6. R语言与数据分析之六:时间序列简介

    今年在某服装企业蹲点了4个多月,之间非常长一段时间在探索其现货和期货预測.时间序列也是做销售预測的首选,今天和小伙伴分享下时间序列的基本性质和怎样用R来挖据时间序列的相关属性. 首先读入一个时间序列: ...

  7. QT 多线程程序设计(也有不少例子)

    QT通过三种形式提供了对线程的支持.它们分别是,一.平台无关的线程类,二.线程安全的事件投递,三.跨线程的信号-槽连接.这使得开发轻巧的多线程Qt程序更为容易,并能充分利用多处理器机器的优势.多线程编 ...

  8. Face Alignment at 3000FPS(C++版)工程配置

    源地址:http://blog.csdn.net/sunshine_in_moon/article/details/49838245/ 3000FPS是人脸对齐算法,特点是速度快!我利用的是think ...

  9. RT3070 USB WIFI 在连接socket编程过程中问题总结

    最近耗时多天,成功的将RT3070驱动.并解决了socket的网络编程,成功的在BA9G10上面实现了USB wif.连上家里的无线路由器,通过ubuntu下面建立的服务端程序,将BA9G10中的数据 ...

  10. Java线程的生命周期(转)

    Java线程的生命周期 一个线程的产生是从我们调用了start方法开始进入Runnable状态,即可以被调度运行状态,并没有真正开始运行,调度器可以将CPU分配给它,使线程进入Running状态,真正 ...