Elasticsearch-->Get Started--> Exploring Your Data
Exploring Your Data
Sample Dataset
Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset.
I’ve prepared a sample of fictitious JSON documents of customer bank account information.
Each document has the following schema:
- {
- "account_number": 0,
- "balance": 16623,
- "firstname": "Bradshaw",
- "lastname": "Mckenzie",
- "age": 29,
- "gender": "F",
- "address": "244 Columbus Place",
- "employer": "Euron",
- "email": "bradshawmckenzie@euron.com",
- "city": "Hobucken",
- "state": "CO"
- }
For the curious, this data was generated using www.json-generator.com/
, so please ignore the actual values and semantics of the data as these are all randomly generated.
Loading the Sample Dataset
You can download the sample dataset (accounts.json) from here.
Extract it to our current directory and let’s load it into our cluster as follows:
- curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"
- curl "localhost:9200/_cat/indices?v"
And the response:
- health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
- yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6kb 128.6kb
Which means that we just successfully bulk indexed 1000 documents into the bank index (under the _doc
type).
自己本地的查询结果如下:
- health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
- yellow open customer p6H8gEOdQAWBuSN2HDEjZA .4kb .4kb
- yellow open bank l45mhl-7QNibqbmbi2Jmbw .6kb .6kb
导入数据后的response
{
"took" : 1306,
"errors" : false,
"items" : [
{
"index" : {
"_index" : "bank",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"forced_refresh" : true,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
{
"index" : {
"_index" : "bank",
"_type" : "_doc",
"_id" : "6",
"_version" : 1,
"result" : "created",
"forced_refresh" : true,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
查看Setting
GET /_all/_settings HTTP/1.1
Host: localhost:9200
{
"customer": {
"settings": {
"index": {
"creation_date": "1552893444305",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "p6H8gEOdQAWBuSN2HDEjZA",
"version": {
"created": "6060199"
},
"provided_name": "customer"
}
}
},
"bank": {
"settings": {
"index": {
"creation_date": "1552962656704",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "l45mhl-7QNibqbmbi2Jmbw",
"version": {
"created": "6060199"
},
"provided_name": "bank"
}
}
}
}
The Search API
Now let’s start with some simple searches.
There are two basic ways to run searches:
one is by sending search parameters through the REST request URI
and the other by sending them through the REST request body.
The request body method allows you to be more expressive and also to define your searches in a more readable JSON format.
We’ll try one example of the request URI method but for the remainder of this tutorial, we will exclusively be using the request body method.
The REST API for search is accessible from the _search
endpoint.
This example returns all documents in the bank index:
- GET /bank/_search?q=*&sort=account_number:asc&pretty
Let’s first dissect the search call.
We are searching (_search
endpoint) in the bank index,
and the q=*
parameter instructs Elasticsearch to match all documents in the index.
The sort=account_number:asc
parameter indicates to sort the results using the account_number
field of each document in an ascending order.
The pretty
parameter, again, just tells Elasticsearch to return pretty-printed JSON results.
And the response (partially shown):
{
"took" : 179,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "0",
"_score" : null,
"_source" : {
"account_number" : 0,
"balance" : 16623,
"firstname" : "Bradshaw",
"lastname" : "Mckenzie",
"age" : 29,
"gender" : "F",
"address" : "244 Columbus Place",
"employer" : "Euron",
"email" : "bradshawmckenzie@euron.com",
"city" : "Hobucken",
"state" : "CO"
},
"sort" : [
0
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "1",
"_score" : null,
"_source" : {
"account_number" : 1,
"balance" : 39225,
"firstname" : "Amber",
"lastname" : "Duke",
"age" : 32,
"gender" : "M",
"address" : "880 Holmes Lane",
"employer" : "Pyrami",
"email" : "amberduke@pyrami.com",
"city" : "Brogan",
"state" : "IL"
},
"sort" : [
1
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "2",
"_score" : null,
"_source" : {
"account_number" : 2,
"balance" : 28838,
"firstname" : "Roberta",
"lastname" : "Bender",
"age" : 22,
"gender" : "F",
"address" : "560 Kingsway Place",
"employer" : "Chillium",
"email" : "robertabender@chillium.com",
"city" : "Bennett",
"state" : "LA"
},
"sort" : [
2
]
},
As for the response, we see the following parts:
took
– time in milliseconds for Elasticsearch to execute the searchtimed_out
– tells us if the search timed out or not_shards
– tells us how many shards were searched, as well as a count of the successful/failed searched shardshits
– search resultshits.total
– total number of documents matching our search criteriahits.hits
– actual array of search results (defaults to first 10 documents)hits.sort
- sort key for results (missing if sorting by score)hits._score
andmax_score
- ignore these fields for now
Here is the same exact search above using the alternative request body method:
- GET /bank/_search
- {
- "query": { "match_all": {} },
- "sort": [
- { "account_number": "asc" }
- ]
- }
The difference here is that instead of passing q=*
in the URI, we provide a JSON-style query request body to the _search
API. We’ll discuss this JSON query in the next section.
It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark完全的 contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.
Introducing the Query Language
Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries.
This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.
Going back to our last example, we executed this query:
- GET /bank/_search
- {
- "query": { "match_all": {} }
- }
Dissecting仔细分析 the above, the query
part tells us what our query definition is and the match_all
part is simply the type of query that we want to run.
The match_all
query is simply a search for all documents in the specified index.
In addition to the query
parameter, we also can pass other parameters to influence the search results.
In the example in the section above we passed in sort
, here we pass in size
:
- GET /bank/_search
- {
- "query": { "match_all": {} },
- "size": 1
- }
Note that if size
is not specified, it defaults to 10.
This example does a match_all
and returns documents 10 through 19:
- GET /bank/_search
- {
- "query": { "match_all": {} },
- "from": 10,
- "size": 10
- }
The from
parameter (0-based) specifies which document index to start from and the size
parameter specifies how many documents to return starting at the from parameter.
This feature is useful when implementing paging of search results. Note that if from
is not specified, it defaults to 0.
This example does a match_all
and sorts the results by account balance in descending order and returns the top 10 (default size) documents.
- GET /bank/_search
- {
- "query": { "match_all": {} },
- "sort": { "balance": { "order": "desc" } }
- }
Executing Searches
Now that we have seen a few of the basic search parameters, let’s dig in some more into the Query DSL.
Let’s first take a look at the returned document fields.
By default, the full JSON document is returned as part of all searches.
This is referred to as the source (_source
field in the search hits).
If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.
This example shows how to return two fields, account_number
and balance
(inside of _source
), from the search:
- GET /bank/_search
- {
- "query": { "match_all": {} },
- "_source": ["account_number", "balance"]
- }
Note that the above example simply reduces the _source
field. It will still only return one field named _source
but within it, only the fields account_number
and balance
are included.
If you come from a SQL background, the above is somewhat similar in concept to the SQL SELECT FROM
field list.
Now let’s move on to the query part. Previously, we’ve seen how the match_all
query is used to match all documents.
Let’s now introduce a new query called the match
query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).
This example returns the account numbered 20:
- GET /bank/_search
- {
- "query": { "match": { "account_number": 20 } }
- }
This example returns all accounts containing the term "mill" in the address:
- GET /bank/_search
- {
- "query": { "match": { "address": "mill" } }
- }
This example returns all accounts containing the term "mill" or "lane" in the address:
- GET /bank/_search
- {
- "query": { "match": { "address": "mill lane" } }
- }
This example is a variant of match
(match_phrase
) that returns all accounts containing the phrase "mill lane" in the address:
- GET /bank/_search
- {
- "query": { "match_phrase": { "address": "mill lane" } }
- }
Let’s now introduce the bool
query. The bool
query allows us to compose smaller queries into bigger queries using boolean logic.
This example composes two match
queries and returns all accounts containing "mill" and "lane" in the address:
- GET /bank/_search
- {
- "query": {
- "bool": {
- "must": [
- { "match": { "address": "mill" } },
- { "match": { "address": "lane" } }
- ]
- }
- }
- }
In the above example, the bool must
clause specifies all the queries that must be true for a document to be considered a match.
In contrast, this example composes two match
queries and returns all accounts containing "mill" or "lane" in the address:
- GET /bank/_search
- {
- "query": {
- "bool": {
- "should": [
- { "match": { "address": "mill" } },
- { "match": { "address": "lane" } }
- ]
- }
- }
- }
In the above example, the bool should
clause specifies a list of queries either of which must be true for a document to be considered a match.
This example composes two match
queries and returns all accounts that contain neither "mill" nor "lane" in the address:
- GET /bank/_search
- {
- "query": {
- "bool": {
- "must_not": [
- { "match": { "address": "mill" } },
- { "match": { "address": "lane" } }
- ]
- }
- }
- }
In the above example, the bool must_not
clause specifies a list of queries none of which must be true for a document to be considered a match.
We can combine must
, should
, and must_not
clauses simultaneously inside a bool
query.
Furthermore, we can compose bool
queries inside any of these bool
clauses to mimic any complex multi-level boolean logic.
This example returns all accounts of anybody who is 40 years old but doesn’t live in ID(aho):
- GET /bank/_search
- {
- "query": {
- "bool": {
- "must": [
- { "match": { "age": "40" } }
- ],
- "must_not": [
- { "match": { "state": "ID" } }
- ]
- }
- }
- }
Executing Filters
In the previous section, we skipped over a little detail called the document score (_score
field in the search results).
The score is a numeric value that is a relative measure of how well the document matches the search query that we specified.
The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.
But queries do not always need to produce scores, in particular when they are only used for "filtering" the document set.
Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.
The bool
query that we introduced in the previous section also supports filter
clauses which allow us to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed.
As an example, let’s introduce the range
query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.
This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive. In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.
- GET /bank/_search
- {
- "query": {
- "bool": {
- "must": { "match_all": {} },
- "filter": {
- "range": {
- "balance": {
- "gte": 20000,
- "lte": 30000
- }
- }
- }
- }
- }
- }
Dissecting the above, the bool query contains a match_all
query (the query part) and a range
query (the filter part).
We can substitute替代 any other queries into the query and the filter parts.
In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.
In addition to the match_all
, match
, bool
, and range
queries, there are a lot of other query types that are available and we won’t go into them here.
Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.
Executing Aggregations
Aggregations provide the ability to group and extract statistics from your data.
The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions.
In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response.
This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.
To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):
- GET /bank/_search
- {
- "size": 0,
- "aggs": {
- "group_by_state": {
- "terms": {
- "field": "state.keyword"
- }
- }
- }
- }
In SQL, the above aggregation is similar in concept to:
- SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC LIMIT 10;
And the response (partially shown):
- {
- "took": 29,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "skipped" : 0,
- "failed": 0
- },
- "hits" : {
- "total" : 1000,
- "max_score" : 0.0,
- "hits" : [ ]
- },
- "aggregations" : {
- "group_by_state" : {
- "doc_count_error_upper_bound": 20,
- "sum_other_doc_count": 770,
- "buckets" : [ {
- "key" : "ID",
- "doc_count" : 27
- }, {
- "key" : "TX",
- "doc_count" : 27
- }, {
- "key" : "AL",
- "doc_count" : 25
- }, {
- "key" : "MD",
- "doc_count" : 25
- }, {
- "key" : "TN",
- "doc_count" : 23
- }, {
- "key" : "MA",
- "doc_count" : 21
- }, {
- "key" : "NC",
- "doc_count" : 21
- }, {
- "key" : "ND",
- "doc_count" : 21
- }, {
- "key" : "ME",
- "doc_count" : 20
- }, {
- "key" : "MO",
- "doc_count" : 20
- } ]
- }
- }
- }
We can see that there are 27 accounts in ID
(Idaho), followed by 27 accounts in TX
(Texas), followed by 25 accounts in AL
(Alabama), and so forth.
Note that we set size=0
to not show search hits because we only want to see the aggregation results in the response.
Building on the previous aggregation, this example calculates the average account balance by state (again only for the top 10 states sorted by count in descending order):
- GET /bank/_search
- {
- "size": 0,
- "aggs": {
- "group_by_state": {
- "terms": {
- "field": "state.keyword"
- },
- "aggs": {
- "average_balance": {
- "avg": {
- "field": "balance"
- }
- }
- }
- }
- }
- }
Notice how we nested the average_balance
aggregation inside the group_by_state
aggregation.
This is a common pattern for all the aggregations.
You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.
Building on the previous aggregation, let’s now sort on the average balance in descending order:
- GET /bank/_search
- {
- "size": 0,
- "aggs": {
- "group_by_state": {
- "terms": {
- "field": "state.keyword",
- "order": {
- "average_balance": "desc"
- }
- },
- "aggs": {
- "average_balance": {
- "avg": {
- "field": "balance"
- }
- }
- }
- }
- }
- }
This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender:
- GET /bank/_search
- {
- "size": 0,
- "aggs": {
- "group_by_age": {
- "range": {
- "field": "age",
- "ranges": [
- {
- "from": 20,
- "to": 30
- },
- {
- "from": 30,
- "to": 40
- },
- {
- "from": 40,
- "to": 50
- }
- ]
- },
- "aggs": {
- "group_by_gender": {
- "terms": {
- "field": "gender.keyword"
- },
- "aggs": {
- "average_balance": {
- "avg": {
- "field": "balance"
- }
- }
- }
- }
- }
- }
- }
- }
There are many other aggregations capabilities that we won’t go into detail here.
The aggregations reference guide is a great starting point if you want to do further experimentation.
Elasticsearch-->Get Started--> Exploring Your Data的更多相关文章
- Elasticsearch基本用法(2)--Spring Data Elasticsearch
Spring Data Elasticsearch是Spring Data项目下的一个子模块. 查看 Spring Data的官网:http://projects.spring.io/spring-d ...
- (十四)Exploring Your Data
Sample Dataset Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic ...
- ElasticSearch 问题分析:No data nodes with HTTP-enabled available
环境:ES-5.4.0版本,部署方式:3master node+2client node+3data node 说明:data node和client node都配置了http.enabled: fa ...
- elasticsearch data importing
ElasticSearch stores each piece of data in a document. That's what I need. Using the bulk API. Trans ...
- Spring Data 整合 ElasticSearch搜索服务器
一.基于 maven 导入坐标(pom.xml文件) <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi ...
- 031 Spring Data Elasticsearch学习笔记---重点掌握第5节高级查询和第6节聚合部分
Elasticsearch提供的Java客户端有一些不太方便的地方: 很多地方需要拼接Json字符串,在java中拼接字符串有多恐怖你应该懂的 需要自己把对象序列化为json存储 查询到结果也需要自己 ...
- Spring Data Elasticsearch基本使用
目录 1. 创建工程 2. 配置application.yaml文件 3. 实体类及注解 4. 测试创建索引 5. 增删改操作 5.1增加 5.2 修改(id存在就是修改,否则就是插入) 5.3 批量 ...
- SprignBoot整合Spring Data Elasticsearch
一.原生java整合elasticsearch的API地址 https://www.elastic.co/guide/en/elasticsearch/client/java-api/6.2/java ...
- How To Install and Configure Elasticsearch on Ubuntu 14.04
Reference: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-elasticsear ...
随机推荐
- 用 hashcat 破解 WIFI WPA2破解
首先用CDlinux系统进行抓包,CDlinux抓包我就不详细说明 到这里可以查看如何安装CDlinux http://jingyan.baidu.com/article/7f766daf5173a9 ...
- JDK历史版本下载地址
JDK历史版本下载地址: http://www.oracle.com/technetwork/java/archive-139210.html -startupplugins/org.eclipse. ...
- 【转】求职面试-HR会问你什么问题?
前言 面试是程序员们经常探讨的话题,只要你通过前面的技术面,最后一面必然是HR面试,基本上到了这关你离Offer的距离应该不会太远了,但有的公司的HR是有刷入的权利,如果你并不能很好的应对HR的问题, ...
- linux常用命令:rmdir 命令
今天学习一下linux中命令: rmdir命令.rmdir是常用的命令,该命令的功能是删除空目录,一个目录被删除之前必须是空的.(注意,rm - r dir命令可代替rmdir,但是有很大危险性.)删 ...
- flask 在视图函数中验证表单
在视图函数中验证表单 因为现在的basic_form视图同时接受两种类型的请求:GET请求和POST请求.所以我们要根据请求方法的不同执行不同的代码.具体来说,首先是实例化表单,如果是GET请求,就渲 ...
- render函数
vue2.0之render函数 虽然vue推荐用template来创建你的html,但是在某些时候你也会用到render函数. 虚拟DOM Vue 通过建立一个虚拟 DOM 对真实 DOM 发生的 ...
- Python+OpenCV图像处理(十二)—— 图像梯度
简介:图像梯度可以把图像看成二维离散函数,图像梯度其实就是这个二维离散函数的求导. Sobel算子是普通一阶差分,是基于寻找梯度强度.拉普拉斯算子(二阶差分)是基于过零点检测.通过计算梯度,设置阀值, ...
- Django框架----Form组件补充
一.Form类 创建Form类时,主要涉及到 [字段] 和 [插件],字段用于对用户请求数据的验证,插件用于自动生成HTML; 1.Django内置字段如下: 1 Field 2 required=T ...
- 自写Jquery插件 Tab
原创文章,转载请注明出处,谢谢!https://www.cnblogs.com/GaoAnLee/p/9067017.html 每每看到别人写的Jquery插件,自己也试着学习尝试,终有结果,废话不多 ...
- 通过 Java 线程堆栈进行性能瓶颈分析
改善性能意味着用更少的资源做更多的事情.为了利用并发来提高系统性能,我们需要更有效的利用现有的处理器资源,这意味着我们期望使 CPU 尽可能出于忙碌状态(当然,并不是让 CPU 周期出于应付无用计算, ...