【原创】大数据基础之ElasticSearch(1)简介、安装、使用
ElasticSearch 6.6.0
官方:https://www.elastic.co/
一 简介
ElasticSearch简单来说是对lucene的分布式封装,增加了shard(每个shard是一个子索引,也是一个lucene的index)和replica的概念;所以在ElasticSearch也可以见到lucene中的概念,比如index、document等。
Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
es是一个可扩展、开源的全文检索和分析引擎。es可以近乎实时的存储、搜索和分析大规模数据;es通常作为底层的技术或者引擎使得应用可以实现复杂查询的需求和场景。
核心概念
1 集群概念
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Cluster(集群)由一个或多个Node组成,Cluster作为一个整体持有所有的数据,同时提供索引和查询功能;cluster有一个唯一的名字,默认是elasticsearch。
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
一个Node(节点)是一台服务器,作为集群的组成部分,存储数据同时参与集群的索引和查询功能;每个Node也有一个唯一的名字,默认是UUID。
每个Node都可以通过指定一个集群的名字来加入cluster;
2 索引概念
Index
An index is a collection of documents that have somewhat similar characteristics. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
一个Index是多个具有相似特征的document的集合;一个Index也有一个名字(全小写)。
Type(废弃)
A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version.
一个type是一个逻辑概念,表示index中的category或者partition,未来会被废弃掉;
废弃进度:
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Types will be deprecated in APIs in Elasticsearch 7.0.0, and completely removed in 8.0.0.
废弃原因:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions. In an SQL database, tables are independent of each other. The columns in one table have no bearing on columns with the same name in another table. This is not the case for fields in a mapping type.
Document
A document is a basic unit of information that can be indexed. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format.
一个document是一个索引的基本信息单元,格式为json;
3 其他
Near Realtime (NRT)
Elasticsearch is a near-realtime search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
es是一个近乎实时的搜索平台,近乎实时的含义是从插入document到可以搜索到document的延迟很小,通常是1s。
Shards
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
一个索引如果非常大可能会超出单机硬件限制,es的解决方案是将索引划分为多个子索引,每个子索引也叫shard(分片)。当创建索引的时候,可以指定shard的数量,每个shard都是独立的索引;
shard很重要:1)水平扩展;2)分布式和并行操作;
Replicas
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
网络环境中失败(硬件故障或网络故障)在所难免,所以failover就很重要;es允许你配置每个shard有0个或多个备份,也叫replica(副本);
replica很重要:1)高可用;2)并行操作;
Summary
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).
The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may also change the number of replicas dynamically anytime. You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
每个index可以被拆分为多个shard,每个索引都可以被复制0或多次;一旦被复制,每个索引都会有primary shard(主分片,被复制数据)和replica shard(复制分片,从主分片复制数据);
shard和replica的数量可以在创建索引的时候配置,后续也可以修改;
默认的shard数量是5,默认的replica数量是1;
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards API.
每个es的shard都是一个lucene的index,lucene的index中有document数量限制,如果超出数量,es需要增加更多的分片;
二 安装
1 docker安装
# docker run elasticsearch
2 ambari安装
详见:https://www.cnblogs.com/barneywill/p/10281678.html
3 手工tar安装
# curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.6.2.tar.gz
# tar -xvf elasticsearch-6.6.2.tar.gz
# cd elasticsearch-6.6.2
配置文件
config/elasticsearch.yml
可以配置集群名称,数据目录(和hdfs一样,可以配置多个硬盘提升性能)等;
启动
# bin/elasticsearch
4 手工yum安装
# yum install elasticsearch
配置文件
/etc/elasticsearch/elasticsearch.yml
启动
# service elasticsearch start
启动之后访问
# curl http://$es_server:9200
{
"name" : "y8NAQ9f",
"cluster_name" : "es_jdc",
"cluster_uuid" : "fZU17hfjTKO2IDUwnGmbEg",
"version" : {
"number" : "6.6.0",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "a9861f4",
"build_date" : "2019-01-24T11:27:09.439740Z",
"build_snapshot" : false,
"lucene_version" : "7.6.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
如果启动报错
ERROR: [1] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
当前配置
# sysctl -a|grep vm.max_map_count
vm.max_map_count = 65530
修改方法
sysctl -w vm.max_map_count=262144
or
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
重要配置:https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
三 使用
Elasticsearch uses standard RESTful APIs and JSON. We also build and maintain clients in many languages such as Java, Python, .NET, SQL, and PHP.
其中rest api方式,详见 https://www.cnblogs.com/barneywill/p/10296419.html
在hadoop和es之间进行数据迁移,参考:https://www.elastic.co/products/hadoop
【原创】大数据基础之ElasticSearch(1)简介、安装、使用的更多相关文章
- 【原创】大数据基础之ElasticSearch(2)常用API整理
Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to i ...
- 【原创】大数据基础之ElasticSearch(4)es数据导入过程
1 准备analyzer 内置analyzer 参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis- ...
- 【原创】大数据基础之ElasticSearch(5)重要配置及调优
Index Settings 重要索引配置 Index level settings can be set per-index. Settings may be: 1 static 静态索引配置 Th ...
- 【原创】大数据基础之ElasticSearch(3)升级
elasticsearch版本升级方案 常用的滚动升级过程(Rolling Upgrade)如下: $ curl -XPUT '$es_server:9200/_cluster/settings?pr ...
- 大数据基础环境--jdk1.8环境安装部署
1.环境说明 1.1.机器配置说明 本次集群环境为三台linux系统机器,具体信息如下: 主机名称 IP地址 操作系统 hadoop1 10.0.0.20 CentOS Linux release 7 ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 大数据篇:ElasticSearch
ElasticSearch ElasticSearch是什么 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. ...
- CentOS6安装各种大数据软件 第八章:Hive安装和配置
相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...
- 大数据应用日志采集之Scribe 安装配置指南
大数据应用日志采集之Scribe 安装配置指南 大数据应用日志采集之Scribe 安装配置指南 1.概述 Scribe是Facebook开源的日志收集系统,在Facebook内部已经得到大量的应用.它 ...
随机推荐
- 基于HTML5 的互联网+地铁行业
前言 近几年,互联网与交通运输的融合,改变了交易模式,影响着运输组织和经营方式,改变了运输主体的市场结构.模糊了运营与非营运的界限,也更好的实现了交通资源的集约共享,同时使得更多依靠外力和企业推动交通 ...
- 深度理解 React Suspense(附源码解析)
本文介绍与 Suspense 在三种情景下使用方法,并结合源码进行相应解析.欢迎关注个人博客. Code Spliting 在 16.6 版本之前,code-spliting 通常是由第三方库来完成的 ...
- codeforces#1011C. Fly (二分,注意精度)
题意:火箭经过1到n号星球,并回到1号星球,现在给出每消耗一砘燃油能带起的火箭质量a[i]和b[i],a[i]代表在第i个星球起飞,b[i]代表在第i个星球降落.求出最少消耗的汽油.保证:如果不能完成 ...
- Java泛型中<? extends E>和<? super E>的区别
这篇文章谈一谈Java泛型声明<? extends E>和<? super E>的作用和区别 <? extends E> <? extends E> 是 ...
- [模板] 匈牙利算法&&二分图最小字典序匹配
匈牙利算法 简介 匈牙利算法是一种求二分图最大匹配的算法. 时间复杂度: 邻接表/前向星: \(O(n * m)\), 邻接矩阵: \(O(n^3)\). 空间复杂度: 邻接表/前向星: \(O(n ...
- Python future使用
Python的每个新版本都会增加一些新的功能,或者对原来的功能作一些改动.有些改动是不兼容旧版本的,也就是在当前版本运行正常的代码,到下一个版本运行就可能不正常了. 从Python 2.7到Pytho ...
- P3389 【模板】高斯消元法
高斯消元求解n元一次线性方程组的板子题: 先举个栗子: • 2x + y - z = 8-----------① •-3x - y + 2z = -11---------② •-2x + y + ...
- C++ bitset 常用函数及运算符
C++ bitset--高端压位卡常题必备STL 以下内容翻译自cplusplus.com,极大地锻炼了我的英语能力. bitset存储二进制数位. bitset就像一个bool类型的数组一样,但是有 ...
- Window下Eclipse+Tomcat远程调试
需求: 项目在开发环境跑得好好的,但是当发布到服务器上时,却出现了一些意外的问题.服务器上不可能给你装IDE调试工具啊,又没有很好的日志帮助下,这时候就用到了JVM的Java Platfo ...
- PowerDesigner生成pdm(适用Mysql)
废话不多说,直接开始: 1.首先安装所需要的驱动以及应用程序 ①和② 是 Mysql数据库连接驱动 ,根据PowerDesigner的位数来选择下载 下载地址:https://dev.mysql.co ...