Elasticsearch 第六篇：聚合统计查询

h2.post_title { background-color: rgba(43, 102, 149, 1); color: rgba(255, 255, 255, 1); font-size: 18px !important; font-weight: bold !important; line-height: 38px !important; padding-left: 10px; border-radius: 4px }

前面一直没有记录 Elasticsearch 的聚合查询或者其它复杂的查询。本篇做一下笔记，为了方便测试，索引数据依然是第五篇生成的测试索引库 db_student_test ，别名是 student_test

第一部分基本聚合

1、最大值 max、最小值 min、平均值 avg 、总和 sum

场景：查询语文、数学、英语这三科的最大值、最小值、平均值

POST  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "max_chinese" : { "max" : { "field" : "chinese" } },

        "min_chinese" : { "min" : { "field" : "chinese" } },

        "avg_chinese" : { "avg" : { "field" : "chinese" } },

        "max_math": { "max" : { "field" : "math" } },

        "min_math": { "min" : { "field" : "math" } },

        "avg_math": { "avg" : { "field" : "math" } },

        "max_english": { "max" : { "field" : "english" } },

        "min_english": { "min" : { "field" : "english" } },

        "avg_english": { "avg" : { "field" : "english" } }

    }

}

查询结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "avg_english": {

            "value": 57.78366490546798

        },

        "max_chinese": {

            "value": 98

        },

        "min_chinese": {

            "value": 25

        },

        "min_math": {

            "value": 15

        },

        "max_english": {

            "value": 98

        },

        "avg_chinese": {

            "value": 59.353859695794505

        },

        "avg_math": {

            "value": 56.92907568735187

        },

        "min_english": {

            "value": 21

        },

        "max_math": {

            "value": 99

        }

    }

}

也可以来查询语文科目分数总和，相当于 sql 的 sum 逻辑，虽然在这里并没有什么意义：

POST  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "sum_chinese" : { "sum" : { "field" : "chinese" } }

    }

}

2、求个数，相当于 sql 的 count 逻辑

场景：查询所有学生总数，这里随便 count 一个字段就可以，例如数学这个字段

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "age_count": {

      "value_count": {

        "field": "math"

      }

    }

  }

}

返回结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "age_count": {

            "value": 50084828

        }

    }

}

课间总数是：50084828 跟第五篇我们生成的数据总量一致

3、distinct 聚合，相当于 sql 的 count ( distinct )

场景：统计语文成绩有多少种值

POST  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "type_count" : {

            "cardinality" : {

                "field" : "chinese"

            }

        }

    }

}

返回结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "type_count": {

            "value": 74

        }

    }

}

从结果上看，只有74个不同的分数，与第五篇随机生成数据的规则匹配

4、统计聚合

场景：查询语文成绩总个数、最大值、最小值、平均值、总和等

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "chinese_stats": {

      "stats": {

        "field": "chinese"

      }

    }

  }

}

返回结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "chinese_stats": {

            "count": 50084828,

            "min": 25,

            "max": 98,

            "avg": 59.353859695794505,

            "sum": 2972727854

        }

    }

}

5、加强版统计聚合，查询结果在上面的基础上，加上方差等统计学上的数据

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "chinese_stats": {

      "extended_stats": {

        "field": "chinese"

      }

    }

  }

}

6、分位聚合统计

默认的分位是 1% 5% 25% 50% 75% 95% 99% 《= 的概念

分位数的概念：25% 的分位数是 54，意思是小于等于 54 的样本占据了总样本的 25% ，即是 54 这个数将最底层的1/4 的数据分割出来。

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "chinese_percents": {

      "percentiles": {

        "field": "chinese"

      }

    }

  }

}

也可以自定义分位：

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "chinese_percents": {

      "percentiles": {

        "field": "chinese",

        "percents" : [10,20,30,40,50,60,70,80,90]

      }

    }

  }

}

7、范围聚合统计

场景：分别查询语文成绩小于40分、小于50分、小于60分的比例

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs": {

    "gge_perc_rank": {

      "percentile_ranks": {

        "field": "chinese",

        "values": [40,50,60]

      }

    }

  }

}

以上是查询成绩小于40，小于50，小于60的占比，得到的数据是： 21.29% 36.09% 51.12% 可以看到这是一个接近等差的数列，可见测试数据的随机性还是很好的。

第二部分其它聚合方式

1、Term 聚合

场景：想知道学生的语文成绩，在所有分数值上的个数

POST  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "genres" : {

            "terms" : {

                "field" : "chinese"

            }

        }

    }

}

这个查询会将字段Chinese进行聚合，例如87分聚合成一个组，88分聚合成一个组，等等；

但是这里默认是按组的大小排序，而且不会将所有的组都显示出来，数量太小的组可能被忽略，查询结果如下：

{

    "took": 1,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "genres": {

            "doc_count_error_upper_bound": 0,

            "sum_other_doc_count": 42560269,

            "buckets": [

                {

                    "key": 61,

                    "doc_count": 752863

                },

                {

                    "key": 68,

                    "doc_count": 752835

                },

                {

                    "key": 55,

                    "doc_count": 752749

                },

                {

                    "key": 59,

                    "doc_count": 752444

                },

                {

                    "key": 76,

                    "doc_count": 752405

                },

                {

                    "key": 74,

                    "doc_count": 752309

                },

                {

                    "key": 56,

                    "doc_count": 752283

                },

                {

                    "key": 49,

                    "doc_count": 752273

                },

                {

                    "key": 52,

                    "doc_count": 752201

                },

                {

                    "key": 50,

                    "doc_count": 752197

                }

            ]

        }

    }

}

如果想要自定义筛选条件，Term聚合还可以按照以下设定来查询：

post  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "genres" : {

            "terms" : {

                "field" : "chinese",

                 "size" : 100,                     // 可能有100个不用的分数，我们将全部都展示出来

                 "order" : { "_count" : "asc" },   // 按照组数由小到大排序

                  "min_doc_count": 752200          //过滤条件：组数最小值是752200

            }

        }

    }

}

查询结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "genres": {

            "doc_count_error_upper_bound": 0,

            "sum_other_doc_count": 0,

            "buckets": [

                {

                    "key": 52,

                    "doc_count": 752201

                },

                {

                    "key": 49,

                    "doc_count": 752273

                },

                {

                    "key": 56,

                    "doc_count": 752283

                },

                {

                    "key": 74,

                    "doc_count": 752309

                },

                {

                    "key": 76,

                    "doc_count": 752405

                },

                {

                    "key": 59,

                    "doc_count": 752444

                },

                {

                    "key": 55,

                    "doc_count": 752749

                },

                {

                    "key": 68,

                    "doc_count": 752835

                },

                {

                    "key": 61,

                    "doc_count": 752863

                }

            ]

        }

    }

}

2、Filter 聚合

Filter 聚合会先进行条件过滤，在进行聚合

场景：查询华南理工大学的学生的数学科目平均分（先筛选学校，再进行分数统计聚合）

{

    "aggs" : {

        "scut_math_avg" : {

            "filter" : { "term": { "school": "华南理工大学" } },

            "aggs" : {

                "avg_price" : { "avg" : { "field" : "math" } }

            }

        }

    }

}

查询结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "scut_math_avg": {

            "doc_count": 1854993,

            "avg_price": {

                "value": 56.93080027795253

            }

        }

    }

}

3、Filters 多重聚合

场景：查询各个学校，语文、数学、英语的平均分都是多少，可以采用多重聚合，速度可能有点慢，如下

POST  http://localhost:9200/student_test1/_search?size=0

{

  "aggs" : {

    "messages" : {

      "filters" : {

        "filters" : {

          "school_1" :   { "term" : { "school" : "华南理工大学" }},

          "school_2" : { "term" : { "school" : "中山大学" }},

          "school_3" : { "match" : { "school" : "暨南大学" }}

        }

      },

      "aggs" : {

           "avg_chinese" : { "avg" : { "field" : "chinese" } },

           "avg_math" : { "avg" : { "field" : "math" } }

      }

    }

  }

}

于是得到结果：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "messages": {

            "buckets": {

                "school_1": {

                    "doc_count": 1854993,

                    "avg_chinese": {

                        "value": 59.353236912484306

                    },

                    "avg_math": {

                        "value": 56.93080027795253

                    }

                },

                "school_2": {

                    "doc_count": 1855016,

                    "avg_chinese": {

                        "value": 59.349129064115886

                    },

                    "avg_math": {

                        "value": 56.93540918245449

                    }

                },

                "school_3": {

                    "doc_count": 44519876,

                    "avg_chinese": {

                        "value": 59.35397212247402

                    },

                    "avg_math": {

                        "value": 56.92948502372289

                    }

                }

            }

        }

    }

}

4、Range 范围聚合

场景：想要查询语文成绩各个分数段的人数，可以这样查询

POST  http://localhost:9200/student_test1/_search?size=0
{

    "aggs" : {

        "chinese_ranges" : {

            "range" : {

                "field" : "chinese",

                "ranges" : [

                    { "to" : 60 },

                    { "from" : 60, "to" : 75 },

                    { "from" : 75, "to" : 85 },

                    { "from" : 85 }

                ]

            }

        }

    }

}

查询结果是：

{

    "took": 0,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "chinese_ranges": {

            "buckets": [

                {

                    "key": "*-60.0",

                    "to": 60,

                    "doc_count": 25096839

                },

                {

                    "key": "60.0-75.0",

                    "from": 60,

                    "to": 75,

                    "doc_count": 11278543

                },

                {

                    "key": "75.0-85.0",

                    "from": 75,

                    "to": 85,

                    "doc_count": 7424634

                },

                {

                    "key": "85.0-*",

                    "from": 85,

                    "doc_count": 6284812

                }

            ]

        }

    }

}

这个返回结果的组名分别是 *-60.0 60.0-75.0 75.0-85.0 85.0-*
如果我们不想要这样的组名，可以自定义组名，例如：

POST  http://localhost:9200/student_test1/_search?size=0

{

    "aggs" : {

        "chinese_ranges" : {

            "range" : {

                "field" : "chinese",

                "keyed" : true,

                "ranges" : [

                    { "key" : "不及格", "to" : 60 },

                    { "key" : "及格", "from" : 60, "to" : 75 },

                    { "key" : "良好", "from" : 75, "to" : 85 },

                    { "key" : "优秀", "from" : 85 }

                ]

            }

        }

    }

}

查询结果将会是：

{

    "took": 1675,

    "timed_out": false,

    "_shards": {

        "total": 1,

        "successful": 1,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 10000,

            "relation": "gte"

        },

        "max_score": null,

        "hits": []

    },

    "aggregations": {

        "chinese_ranges": {

            "buckets": {

                "不及格": {

                    "to": 60,

                    "doc_count": 25096839

                },

                "及格": {

                    "from": 60,

                    "to": 75,

                    "doc_count": 11278543

                },

                "良好": {

                    "from": 75,

                    "to": 85,

                    "doc_count": 7424634

                },

                "优秀": {

                    "from": 85,

                    "doc_count": 6284812

                }

            }

        }

    }

}

还有其它各种各样的、复杂的聚合查询，都是可以网上查资料，甚至还支持推荐系统的一些计算方法，例如矩阵的概念等等。

还可以参考 https://blog.csdn.net/alex_xfboy/article/details/86100037