数据挖掘：基于Spark+HanLP实现影视评论关键词抽取(1)

1. 背景

近日项目要求基于爬取的影视评论信息，抽取影视的关键字信息。考虑到影视评论数据量较大，因此采用Spark处理框架。关键词提取的处理主要包含分词+算法抽取两部分。目前分词工具包较为主流的，包括哈工大的LTP以及HanLP，而关键词的抽取算法较多，包括TF-IDF、TextRank、互信息等。本次任务主要基于LTP、HanLP、Ac双数组进行分词，采用TextRank、互信息以及TF-IDF结合的方式进行关键词抽取。

说明：本项目刚开始接触，因此效果层面需迭代调优。

2. 技术选型

(1) 词典

1) 基于HanLP项目提供的词典数据，具体可参见HanLP的github。

2) 考虑到影视的垂直领域特性，引入腾讯的嵌入的汉语词，参考该地址。

(2) 分词

1) LTP分词服务：基于Docker Swarm部署多副本集服务，通过HTTP协议请求，获取分词结果(部署方法可百度); 也可以直接在本地加载，放在内存中调用，效率更高(未尝试)

2) AC双数组：基于AC双数组，采用最长匹配串，采用HanLP中的AC双数组分词器

(3) 抽取

1) 经典的TF-IDF：基于词频统计实现

2) TextRank：借鉴于PageRank算法，基于HanLP提供的接口

3) 互信息：基于HanLP提供的接口

3. 实现代码

(1) 代码结构

1) 代码将分词服务进行函数封装，基于不同的名称，执行名称指定的分词

2) TextRank、互信息、LTP、AC双数组等提取出分词或短语，最后均通过TF-IDF进行统计计算

(2) 整体代码

1) 主体代码：细节层面与下载的原始评论数据结构有关，因此无需过多关注，只需关注下主体流程即可

 def extractFilmKeyWords(algorithm: String): Unit ={
     // 测试
 	println(HanLPSpliter.getInstance.seg("如何看待《战狼2》中的爱国情怀？"))
 
     val sc = new SparkContext(new SparkConf().setAppName("extractFileKeyWords").set("spark.driver.maxResultSize", "3g"))
 
     val baseDir = "/work/ws/video/parse/key_word"
 
     import scala.collection.JavaConversions._
     def extractComments(sc: SparkContext, inputInfo: (String, String)): RDD[(String, List[String])] = {
       sc.textFile(s"$baseDir/data/${inputInfo._2}")
         .map(data => {
           val json = JSONObjectEx.fromObject(data.trim)
           if(null == json) ("", List())
           else{
             val id = json.getStringByKeys("_id")
             val comments: List[String] = json.getArrayInfo("comments", "review").toList
             val reviews: List[String] = json.getArrayInfo("reviews", "review").toList
             val titles: List[String] = json.getArrayInfo("reviews", "title").toList
             val texts = (comments ::: reviews ::: titles).filter(f => !CleanUtils.isEmpty(f))
             (IdBuilder.getSourceKey(inputInfo._1, id), texts)
           }
         })
     }
 
     // 广播停用词
     val filterWordRdd = sc.broadcast(sc.textFile(s"$baseDir/data/stopwords.txt").map(_.trim).distinct().collect().toList)
 
     def formatOutput(infos: List[(Int, String)]): String ={
       infos.map(info => {
         val json = new JSONObject()
         json.put("status", info._1)
         try{
           json.put("res", info._2)
         } catch {
           case _ => json.put("res", "[]")
         }
         json.toString.replaceAll("[\\s]+", "")
       }).mkString(" | ")
     }
 
     def genContArray(words: List[String]): JSONArray ={
       val arr = new JSONArray()
       words.map(f => {
         val json = new JSONObject()
         json.put("cont", f)
         arr.put(json)
       })
       arr
     }
 
 	// 基于LTP分词服务
     def splitWordByLTP(texts: List[String]): List[(Int, String)] ={
       texts.map(f => {
         val url = "http://dev.content_ltp.research.com/ltp"
         val params = new util.HashMap[String, String]()
         params.put("s", f)
         params.put("f", "json")
         params.put("t", "ner")
         // 调用LTP分词服务
         val result = HttpPostUtil.httpPostRetry(url, params).replaceAll("[\\s]+", "")
         if (CleanUtils.isEmpty(result)) (0, f) else {
           val resultArr = new JSONArray()
 
           val jsonArr = try { JSONArray.fromString(result) } catch { case _ => null}
           if (null != jsonArr && 0 < jsonArr.length()) {
             for (i <- 0 until jsonArr.getJSONArray(0).length()) {
               val subJsonArr = jsonArr.getJSONArray(0).getJSONArray(i)
               for (j <- 0 until subJsonArr.length()) {
                 val subJson = subJsonArr.getJSONObject(j)
                 if(!filterWordRdd.value.contains(subJson.getString("cont"))){
                   resultArr.put(subJson)
                 }
               }
             }
           }
           if(resultArr.length() > 0) (1, resultArr.toString) else (0, f)
         }
       })
     }
 
 	// 基于AC双数组搭建的分词服务
     def splitWordByAcDoubleTreeServer(texts: List[String]): List[(Int, String)] ={
       texts.map(f => {
         val splitResults = SplitQueryHelper.splitQueryText(f)
           .filter(f => !CleanUtils.isEmpty(f) && !filterWordRdd.value.contains(f.toLowerCase)).toList
         if (0 == splitResults.size) (0, f) else (1, genContArray(splitResults).toString)
       })
     }
 
 	// 内存加载AC双数组
     def splitWordByAcDoubleTree(texts: List[String]): List[(Int, String)] ={
       texts.map(f => {
         val splitResults =  HanLPSpliter.getInstance().seg(f)
           .filter(f => !CleanUtils.isEmpty(f) && !filterWordRdd.value.contains(f.toLowerCase)).toList
         if (0 == splitResults.size) (0, f) else (1, genContArray(splitResults).toString)
       })
     }
 
 	// TextRank
     def splitWordByTextRank(texts: List[String]): List[(Int, String)] ={
       texts.map(f => {
         val splitResults = HanLP.extractKeyword(f, 100)
           .filter(f => !CleanUtils.isEmpty(f) && !filterWordRdd.value.contains(f.toLowerCase)).toList
         if (0 == splitResults.size) (0, f) else {
           val arr = genContArray(splitResults)
           if(0 == arr.length()) (0, f) else (1, arr.toString)
         }
       })
     }
 
 	// 互信息
     def splitWordByMutualInfo(texts: List[String]): List[(Int, String)] ={
       texts.map(f => {
         val splitResults = HanLP.extractPhrase(f, 50)
           .filter(f => !CleanUtils.isEmpty(f) && !filterWordRdd.value.contains(f.toLowerCase)).toList
         if (0 == splitResults.size) (0, f) else {
           val arr = genContArray(splitResults)
           if(0 == arr.length()) (0, f) else (1, arr.toString)
         }
       })
     }
 
     // 提取分词信息
     val unionInputRdd = sc.union(
 	  extractComments(sc, SourceType.DB -> "db_review.json"),
       extractComments(sc, SourceType.MY -> "my_review.json"),
       extractComments(sc, SourceType.MT -> "mt_review.json"))
       .filter(_._2.nonEmpty)
 
     unionInputRdd.cache()
 
     unionInputRdd.map(data => {
       val splitResults = algorithm match {
         case "ltp" => splitWordByLTP(data._2)
         case "acServer" => splitWordByAcDoubleTreeServer(data._2)
         case "ac" => splitWordByAcDoubleTree(data._2)
         case "textRank" => splitWordByTextRank(data._2)
         case "mutualInfo" => splitWordByMutualInfo(data._2)
       }
 
       val output = formatOutput(splitResults)
       s"${data._1}\t$output"
     }).saveAsTextFile(HDFSFileUtil.clean(s"$baseDir/result/wordSplit/$algorithm"))
 
     val splitRDD = sc.textFile(s"$baseDir/result/wordSplit/$algorithm/part*", 30)
       .flatMap(data => {
         if(data.split("\\t").length < 2) None
         else{
           val sourceKey = data.split("\\t")(0)
           val words = data.split("\\t")(1).split(" \\| ").flatMap(f => {
             val json = JSONObjectEx.fromObject(f.trim)
             if (null != json && "".equals(json.getStringByKeys("status"))) {
               val jsonArr = try { JSONArray.fromString(json.getStringByKeys("res")) } catch { case _ => null }
               var result: List[(String, String)] = List()
               if (jsonArr != null) {
                 for (j <- 0 until jsonArr.length()) {
                   val json = jsonArr.getJSONObject(j)
                   val cont = json.getString("cont")
                   result ::= (cont, cont)
                 }
               }
               result.reverse
             } else None
           }).toList
           Some((sourceKey, words))
         }
       }).filter(_._2.nonEmpty)
 
     splitRDD.cache()
 
     val totalFilms = splitRDD.count()
 
     val idfRdd = splitRDD.flatMap(result => {
       result._2.map(_._1).distinct.map((_, 1))
     }).groupByKey().filter(f => f._2.size > 1).map(f => (f._1, Math.log(totalFilms * 1.0 / (f._2.sum + 1))))
 
     idfRdd.cache()
     idfRdd.map(f => s"${f._1}\t${f._2}").saveAsTextFile(HDFSFileUtil.clean(s"$baseDir/result/idf/$algorithm"))
 
     val idfMap = sc.broadcast(idfRdd.collectAsMap())
     // 计算TF
     val tfRdd = splitRDD.map(result => {
       val totalWords = result._2.size
       val keyWords = result._2.groupBy(_._1)
         .map(f => {
           val word = f._1
           val tf = f._2.size * 1.0 / totalWords
           (tf * idfMap.value.getOrElse(word, 0D), word)
         }).toList.sortBy(_._1).reverse.filter(_._2.trim.length > 1).take(50)
       (result._1, keyWords)
     })
 
     tfRdd.cache()
     tfRdd.map(f => {
       val json = new JSONObject()
       json.put("_id", f._1)
 
       val arr = new JSONArray()
       for (keyWord <- f._2) {
         val subJson = new JSONObject()
         subJson.put("score", keyWord._1)
         subJson.put("word", keyWord._2)
         arr.put(subJson)
       }
       json.put("keyWords", arr)
       json.toString
     }).saveAsTextFile(HDFSFileUtil.clean(s"$baseDir/result/keyword/$algorithm/withScore"))
 
     tfRdd.map(f => s"${f._1}\t${f._2.map(_._2).toList.mkString(",")}")
       .saveAsTextFile(HDFSFileUtil.clean(s"$baseDir/result/keyword/$algorithm/noScore"))
 
     tfRdd.unpersist()
 
     splitRDD.unpersist()
     idfMap.unpersist()
     idfRdd.unpersist()
 
     unionInputRdd.unpersist()
     filterWordRdd.unpersist()
     sc.stop()
   }

2) 基于HanLP提供的AC双数组封装

 import com.google.common.collect.Lists;
 import com.hankcs.hanlp.HanLP;
 import com.hankcs.hanlp.seg.Segment;
 import com.hankcs.hanlp.seg.common.Term;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.Serializable;
 import java.util.List;
 
 public class HanLPSpliter implements Serializable{
     private static Logger logger = LoggerFactory.getLogger(Act.class);
 
     private static HanLPSpliter instance = null;
 
     private static Segment segment = null;
 
     private static final String PATH = "conf/tencent_word_act.txt";
 
     public static HanLPSpliter getInstance() {
         if(null == instance){
             instance = new HanLPSpliter();
         }
         return instance;
     }
 
     public HanLPSpliter(){
         this.init();
     }
 
     public void init(){
         initSegment();
     }
 
     public void initSegment(){
         if(null == segment){
             addDict();
             HanLP.Config.IOAdapter = new HadoopFileIOAdapter();
             segment = HanLP.newSegment("dat");
             segment.enablePartOfSpeechTagging(true);
             segment.enableCustomDictionaryForcing(true);
         }
     }
 
     public List<String> seg(String text){
         if(null == segment){
             initSegment();
         }
 
         List<Term> terms = segment.seg(text);
         List<String> results = Lists.newArrayList();
         for(Term term : terms){
             results.add(term.word);
         }
         return results;
     }
 }

3) HanLP加载HDFS中的自定义词典

 import com.hankcs.hanlp.corpus.io.IIOAdapter;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 
 import java.io.IOException;
 import java.io.InputStream;
 import java.io.OutputStream;
 import java.net.URI;
 
 public class HadoopFileIOAdapter implements IIOAdapter{
     @Override
     public InputStream open(String path) throws IOException {
         Configuration conf = new Configuration();
         FileSystem fs = FileSystem.get(URI.create(path), conf);
         return fs.open(new Path(path));
     }
 
     @Override
     public OutputStream create(String path) throws IOException {
         Configuration conf = new Configuration();
         FileSystem fs = FileSystem.get(URI.create(path), conf);
         OutputStream out = fs.create(new Path(path));
         return out;
     }
 }

4. 采坑总结

(1) Spark中实现HanLP自定义词典的加载

由于引入腾讯的嵌入词，因此使用HanLP的自定义词典功能，参考的方法如下：

a. 《基于hanLP的中文分词详解-MapReduce实现&自定义词典文件》，该方法适用于自定义词典的数量较少的情况，如果词典量较大，如腾讯嵌入词820W+，理论上jar包较为臃肿

b. 《Spark中使用HanLP分词》，该方法的好处在于无需手工构件词典的bin文件，操作简单

切记：如果想让自定义词典生效，需先将data/dictionary/custom中的bin文件删除。通过HanLP源码得知，如果存在bin文件，则直接加载该bin文件，否则会将custom中用户自定义的词典重新加载，在指定的环境中(如本地或HDFS)中自动生成bin文件。

腾讯820W词典，基于HanLP生成bin文件的时间大概为30分钟。

(2) Spark异常

Spark执行过程中的异常信息：

1) 异常1

a. 异常信息：

Job aborted due to stage failure: Total size of serialized results of 3979 tasks (1024.2 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

b. 解决：通过设置spark.driver.maxResultSize=4G，参考：《Spark排错与优化》

2) 异常2

a. 异常信息：java.lang.OutOfMemoryError: Java heap space

b. 解决：参考https://blog.csdn.net/guohecang/article/details/52088117

如有问题，请留言回复！