使用ES对中文文章进行分词，并进行词频统计排序

前言：首先有这样一个需求，需要统计一篇10000字的文章，需要统计里面哪些词出现的频率比较高，这里面比较重要的是如何对文章中的一段话进行分词，例如“北京是×××的首都”，“北京”，“×××”，“中华”，“华人”，“人民”，“共和国”，“首都”这些是一个词，需要切分出来，而“京是”“民共”这些就不是有意义的词，所以不能分出来。这些分词的规则如果自己去写，是一件很麻烦的事，利用开源的IK分词，就可以很容易的做到。并且可以根据分词的模式来决定分词的颗粒度。

ik_max_word: 会将文本做最细粒度的拆分，比如会将“×××国歌”拆分为“×××,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；

ik_smart: 会做最粗粒度的拆分，比如会将“×××国歌”拆分为“×××,国歌”。

一：首先要准备环境

如果有ES环境可以跳过前两步，这里我假设你只有一台刚装好的CentOS6.X系统，方便你跑通这个流程。

（1）安装jdk。

$ wget http://download.oracle.com/otn-pub/java/jdk/8u111-b14/jdk-8u111-linux-x64.rpm

$ rpm -ivh jdk-8u111-linux-x64.rpm

（2）安装ES

$ wget  https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.4.2/elasticsearch-2.4.2.rpm

$ rpm -iv elasticsearch-2.4.2.rpm

（3）安装IK分词器

在github上面下载1.10.2版本的ik分词，注意：es版本为2.4.2，兼容的版本为1.10.2。

$ mkdir /usr/share/elasticsearch/plugins/ik

$ wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.2/elasticsearch-analysis-ik-1.10.2.zip

$ unzip elasticsearch-analysis-ik-1.10.2.zip -d /usr/share/elasticsearch/plugins/ik

（4）配置ES

$ vim /etc/elasticsearch/elasticsearch.yml

###### Cluster ######

cluster.name: test

###### Node ######

node.name: test-10.10.10.10

node.master: true

node.data: true

###### Index ######

index.number_of_shards: 5

index.number_of_replicas: 0

###### Path ######

path.data: /data/elk/es

path.logs: /var/log/elasticsearch

path.plugins: /usr/share/elasticsearch/plugins

###### Refresh ######

refresh_interval: 5s

###### Memory ######

bootstrap.mlockall: true

###### Network ######

network.publish_host: 10.10.10.10

network.bind_host: 0.0.0.0

transport.tcp.port: 9300

###### Http ######

http.enabled: true

http.port : 9200

###### IK ########

index.analysis.analyzer.ik.alias: [ik_analyzer]

index.analysis.analyzer.ik.type: ik

index.analysis.analyzer.ik_max_word.type: ik

index.analysis.analyzer.ik_max_word.use_smart: false

index.analysis.analyzer.ik_smart.type: ik

index.analysis.analyzer.ik_smart.use_smart: true

index.analysis.analyzer.default.type: ik

（5）启动ES

$ /etc/init.d/elasticsearch start

（6）检查es节点状态

$ curl localhost:9200/_cat/nodes?v    #看到一个节点正常

host         ip           heap.percent ram.percent load node.role master name

10.10.10.10 10.10.10.10           16          52 0.00 d         *      test-10.10.10.10

$ curl localhost:9200/_cat/health?v   #集群状态为green

epoch      timestamp cluster            status node.total node.data shards pri relo init

1483672233 11:10:33  test               green           1         1     0   0    0    0

二：检测分词功能

（1）创建测试索引

$ curl -XPUT http://localhost:9200/test

（2）创建mapping

$ curl -XPOST http://localhost:9200/test/fulltext/_mapping -d'

  {

      "fulltext": {

               "_all": {

              "analyzer": "ik"

          },

          "properties": {

              "content": {

                  "type" : "string",

                  "boost" : 8.0,

                  "term_vector" : "with_positions_offsets",

                  "analyzer" : "ik",

                  "include_in_all" : true

              }

          }

      }

  }'

（3）测试数据

$ curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d '{ "text":"美国留给伊拉克的是个烂摊子吗" }'

返回内容：

{

  "tokens" : [ {

    "token" : "美国",

    "start_offset" : 0,

    "end_offset" : 2,

    "type" : "CN_WORD",

    "position" : 0

  }, {

    "token" : "留给",

    "start_offset" : 2,

    "end_offset" : 4,

    "type" : "CN_WORD",

    "position" : 1

  }, {

    "token" : "伊拉克",

    "start_offset" : 4,

    "end_offset" : 7,

    "type" : "CN_WORD",

    "position" : 2

  }, {

    "token" : "伊",

    "start_offset" : 4,

    "end_offset" : 5,

    "type" : "CN_WORD",

    "position" : 3

  }, {

    "token" : "拉",

    "start_offset" : 5,

    "end_offset" : 6,

    "type" : "CN_CHAR",

    "position" : 4

  }, {

    "token" : "克",

    "start_offset" : 6,

    "end_offset" : 7,

    "type" : "CN_WORD",

    "position" : 5

  }, {

    "token" : "个",

    "start_offset" : 9,

    "end_offset" : 10,

    "type" : "CN_CHAR",

    "position" : 6

  }, {

    "token" : "烂摊子",

    "start_offset" : 10,

    "end_offset" : 13,

    "type" : "CN_WORD",

    "position" : 7

  }, {

    "token" : "摊子",

    "start_offset" : 11,

    "end_offset" : 13,

    "type" : "CN_WORD",

    "position" : 8

  }, {

    "token" : "摊",

    "start_offset" : 11,

    "end_offset" : 12,

    "type" : "CN_WORD",

    "position" : 9

  }, {

    "token" : "子",

    "start_offset" : 12,

    "end_offset" : 13,

    "type" : "CN_CHAR",

    "position" : 10

  }, {

    "token" : "吗",

    "start_offset" : 13,

    "end_offset" : 14,

    "type" : "CN_CHAR",

    "position" : 11

  } ]

}

三：开始导入真正的数据

（1）将中文的文本文件上传到linux上面。

$ cat /tmp/zhongwen.txt

京津冀重污染天气持续 督查发现有企业恶意生产

《孤芳不自赏》被指“抠像演戏” 制片人：特效不到位

奥巴马不顾特朗普反对坚持外迁关塔那摩监狱囚犯

.

.

.

.

韩媒：日本叫停韩日货币互换磋商 韩财政部表遗憾

中国百万年薪须交40多万个税 精英无奈出国发展

注意：确保文本文件编码为utf-8，否则后面传到es会乱码。

$ vim /tmp/zhongwen.txt

命令模式下输入：set fineencoding,即可看到fileencoding=utf-8。

如果是 fileencoding=utf-16le，则输入：set fineencoding=utf-8

（2）创建索引和mapping

创建索引

$ curl -XPUT http://localhost:9200/index

创建mapping #对要分词的字段message进行分词器设置和fielddata设置。

$ curl -XPOST http://localhost:9200/index/logs/_mapping -d '

{

  "logs": {

    "_all": {

      "analyzer": "ik"

    },

    "properties": {

      "path": {

        "type": "string"

      },

      "@timestamp": {

        "format": "strict_date_optional_time||epoch_millis",

        "type": "date"

      },

      "@version": {

        "type": "string"

      },

      "host": {

        "type": "string"

      },

      "message": {

        "include_in_all": true,

        "analyzer": "ik",

        "term_vector": "with_positions_offsets",

        "boost": 8,

        "type": "string",

        "fielddata" : { "format" : "true" }

      },

      "tags": {

        "type": "string"

      }

    }

  }

}'

（3）使用logstash 将文本文件写入到es中

安装logstash

$ wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.1/elasticsearch-2.1.1.rpm

$ rpm -ivh  logstash-2.1.1.rpm

配置logstash

$ vim /etc/logstash/conf.d/logstash.conf

input {

  file {

      codec => 'json'

      path => "/tmp/zhongwen.txt"

      start_position => "beginning"

  }

}

output {

    elasticsearch {

      hosts => "10.10.10.10:9200"

      index => "index"

      flush_size => 3000

      idle_flush_time => 2

      workers => 4

     }

  stdout { codec => rubydebug }

}

启动

$ /etc/init.d/logstash start

查看stdout输出，就能判断是否写入es中。

$ tail -f /var/log/logstash.stdout

（4）检查索引中是否有数据

$ curl 'localhost:9200/_cat/indices/index?v'  #可以看到有6007条数据。

health status index pri rep docs.count docs.deleted store.size pri.store.size

green  open   index   5   0       6007            0      2.5mb          2.5mb

$ curl -XPOST  "http://localhost:9200/index/_search?pretty"

{

  "took" : 1,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 5227,

    "max_score" : 1.0,

    "hits" : [ {

      "_index" : "index",

      "_type" : "logs",

      "_id" : "AVluC7Dpbw7ZlXPmUTSG",

      "_score" : 1.0,

      "_source" : {

        "message" : "中国百万年薪须交40多万个税 精英无奈出国发展",

        "tags" : [ "_jsonparsefailure" ],

        "@version" : "1",

        "@timestamp" : "2017-01-05T09:52:56.150Z",

        "host" : "0.0.0.0",

        "path" : "/tmp/333.log"

      }

    }, {

      "_index" : "index",

      "_type" : "logs",

      "_id" : "AVluC7Dpbw7ZlXPmUTSN",

      "_score" : 1.0,

      "_source" : {

        "message" : "奥巴马不顾特朗普反对坚持外迁关塔那摩监狱囚犯",

        "tags" : [ "_jsonparsefailure" ],

        "@version" : "1",

        "@timestamp" : "2017-01-05T09:52:56.222Z",

        "host" : "0.0.0.0",

        "path" : "/tmp/333.log"

      }

}

四：开始计算分词的词频，排序

（1）查询所有词出现频率最高的top10

$ curl -XGET "http://localhost:9200/index/_search?pretty" -d'

{

    "size" : 0,

    "aggs" : {

        "messages" : {

            "terms" : {

               "size" : 10,

              "field" : "message"

            }

        }

    }

}'

返回结果

{

  "took" : 3,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 6007,

    "max_score" : 0.0,

    "hits" : [ ]

  },

  "aggregations" : {

    "messages" : {

      "doc_count_error_upper_bound" : 154,

      "sum_other_doc_count" : 94992,

      "buckets" : [ {

        "key" : "一",

        "doc_count" : 1582

      }, {

        "key" : "后",

        "doc_count" : 560

      }, {

        "key" : "人",

        "doc_count" : 541

      }, {

        "key" : "家",

        "doc_count" : 538

      }, {

        "key" : "出",

        "doc_count" : 489

      }, {

        "key" : "发",

        "doc_count" : 451

      }, {

        "key" : "个",

        "doc_count" : 440

      }, {

        "key" : "州",

        "doc_count" : 421

      }, {

        "key" : "岁",

        "doc_count" : 405

      }, {

        "key" : "子",

        "doc_count" : 402

      } ]

    }

  }

}

（2）查询所有两字词出现频率最高的top10

$ curl -XGET "http://localhost:9200/index/_search?pretty" -d'

{

    "size" : 0,

    "aggs" : {

        "messages" : {

            "terms" : {

                 "size" : 10,

              "field" : "message",

                "include" : "[\u4E00-\u9FA5][\u4E00-\u9FA5]"

            }

        }

    },

   "highlight": {

     "fields": {

      "message": {}

    }

  }

}'

{

  "took" : 22,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 6007,

    "max_score" : 0.0,

    "hits" : [ ]

  },

  "aggregations" : {

    "messages" : {

      "doc_count_error_upper_bound" : 73,

      "sum_other_doc_count" : 42415,

      "buckets" : [ {

        "key" : "女子",

        "doc_count" : 291

      }, {

        "key" : "男子",

        "doc_count" : 264

      }, {

        "key" : "竟然",

        "doc_count" : 257

      }, {

        "key" : "上海",

        "doc_count" : 255

      }, {

        "key" : "这个",

        "doc_count" : 238

      }, {

        "key" : "女孩",

        "doc_count" : 174

      }, {

        "key" : "这些",

        "doc_count" : 167

      }, {

        "key" : "一个",

        "doc_count" : 159

      }, {

        "key" : "注意",

        "doc_count" : 143

      }, {

        "key" : "这样",

        "doc_count" : 142

      } ]

    }

  }

}

（3）查询所有两字词且不包含“女”字，出现频率最高的top10

curl -XGET "http://localhost:9200/index/_search?pretty" -d'

{

    "size" : 0,

    "aggs" : {

        "messages" : {

            "terms" : {

              "size" : 10,

              "field" : "message",

              "include" : "[\u4E00-\u9FA5][\u4E00-\u9FA5]",

              "exclude" : "女.*"

            }

        }

    },

   "highlight": {

     "fields": {

      "message": {}

    }

  }

}'

{

  "took" : 19,

  "timed_out" : false,

  "_shards" : {

    "total" : 5,

    "successful" : 5,

    "failed" : 0

  },

  "hits" : {

    "total" : 5227,

    "max_score" : 0.0,

    "hits" : [ ]

  },

  "aggregations" : {

    "messages" : {

      "doc_count_error_upper_bound" : 71,

      "sum_other_doc_count" : 41773,

      "buckets" : [ {

        "key" : "男子",

        "doc_count" : 264

      }, {

        "key" : "竟然",

        "doc_count" : 257

      }, {

        "key" : "上海",

        "doc_count" : 255

      }, {

        "key" : "这个",

        "doc_count" : 238

      }, {

        "key" : "这些",

        "doc_count" : 167

      }, {

        "key" : "一个",

        "doc_count" : 159

      }, {

        "key" : "注意",

        "doc_count" : 143

      }, {

        "key" : "这样",

        "doc_count" : 142

      }, {

        "key" : "重庆",

        "doc_count" : 142

      }, {

        "key" : "结果",

        "doc_count" : 137

      } ]

    }

  }

}

还有更多的分词策略，例如设置近义词（设置“番茄”和“西红柿”为同义词，搜索“番茄”，“西红柿”也会出来），设置拼音分词（搜索“zhonghua”，“中华”也可以搜索出来）等等。

使用ES对中文文章进行分词，并进行词频统计排序的更多相关文章

python jieba分词小说与词频统计
1.知识点 """ 1)cut() a) codecs.open() 解决编码问题 b) f.readline() 读取一行,也可以使用f.readlines()读取多行 ...
Python大数据：jieba 中文分词，词频统计
# -*- coding: UTF-8 -*- import sys import numpy as np import pandas as pd import jieba import jieba. ...
【第二周】Java实现英语文章词频统计（改进1）
本周根据杨老师的spec对英语文章词频统计进行了改进 1.需求分析: 对英文文章中的英文单词进行词频统计并按照有大到小的顺序输出, 2.算法思想: (1)构建一个类用于存放英文单词及其出现的次数 cl ...
ES 1.7安装ik分词elasticsearch-analysis-ik-1.2.5
IK简介 https://www.cnblogs.com/yjf512/p/4789239.html https://www.cnblogs.com/xing901022/p/5910139.html ...
Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）
解决问题的方案 Hadoop上的中文分词与词频统计实践首先来推荐相关材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-c ...
pyhanlp 中文词性标注与分词简介
pyhanlp 中文词性标注与分词简介 pyhanlp实现的分词器有很多,同时pyhanlp获取hanlp中分词器也有两种方式第一种是直接从封装好的hanlp类中获取,这种获取方式一共可以获取五种分 ...
解决国外模板h1、h2、h3...不显示中文文章标题的问题
如果你经常用国外好看的网页模版时候,会遇到不显示中文文章标题的情况,显示英文标题却正常.遇到这个情况很多人认为应该修改CSS的font-family的字体,其实这是错误的,与CSS无关. 出现这种情况 ...
Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）
声明: 1)本文由我bitpeach原创撰写,转载时请注明出处,侵权必究. 2)本小实验工作环境为Windows系统下的百度云(联网),和Ubuntu系统的hadoop1-2-1(自己提前配好).如不 ...
使用 WinEdt 来写中文文章or 建模论文
找了几乎两个小时…… 后来发现… WinEdt 是可以用来写中文文章的…而并非只能英文文章或演示文稿… \documentclass{article} \usepackage{CJK} \begin{ ...

随机推荐

Shell—三剑客（grep、sed、awk）
grep命令详解文本搜索工具,根据用户指定的“模式(pattern)”对目标文本进行过滤,显示被模式匹配到的行. 命令格式:grep [options] pattern filename.gr ...
word保存为pdf
word保存为pdf word保存为pdf word保存为pdf
word最近文档清除
html学习之二（常用标签练习）
<!DOCTYPE html><head> <meta charset="utf-8"> <title>锚点链接</title ...
template指针小测试
测试结论: 1 函数指针 -- 使用形参固定的一系列函数作为某个函数的形参 -- callback机制 2 模板指针 -- 使用形参可变的一系列函数作为某个函数的形参 -- 3 typename -- ...
testNG xml文件详解
网上看到一篇整理的非常详细的xml文件详解,分享一下: 1 <?xml version="1.0" encoding="UTF-8"?> 2 < ...
【笔记】Java微服务之路(持续更新)
微服务架构的说明: 微服务的架构风格是将一个单体的应用程序开发拆解为一组"小"的服务,这里的"小"是以业务边界来区分的,而不是根据代码的多少区分.每个服务都运 ...
快速了解Electron：新一代基于Web的跨平台桌面技术
本文引用了作者“ ConardLi”的<用JS开发跨平台桌面应用,从原理到实践>一文部分内容,原文链接:segmentfault.com/a/1190000019426512,感谢原作者的 ...
python 中in 的用法
1. 作用为成员运算符在字符串内操作,如果字符串包含相关字符则返回True,如果不包含则返回False 当然处理不单单是只有单个字符,多个连续的字符也是可以处理的 # 单个字符 a= ...
Web页面精确定位
Web端页面定位相关一.获取宽高相关属性 scrollHeight:获取对象的滚动高度: scrollLeft:设置或获取位于对象左边界和窗口中目前可见内容的最左端之间的距离: scrollTop: ...

使用ES对中文文章进行分词，并进行词频统计排序

使用ES对中文文章进行分词，并进行词频统计排序的更多相关文章

随机推荐

热门专题