文本自动摘要 - 阅读笔记


自动文摘要解决的问题描述很简单,就是用一些精炼的话来概括整篇文章的大意,用户通过阅读文摘就可以了解到原文要表达的意思。

问题包括两种解决思路,

  • 一种是extractive,抽取式的,从原文中找到一些关键的句子,组合成一篇摘要;【最主流、应用最多、最容易的方法】
  • 另外一种是abstractive,摘要式的,这需要计算机可以读懂原文的内容,并且用自己的意思将其表达出来。【相对来说更有一种真正人工智能的味道】

单文档Extractive (抽取式)Summarization 

抽取式的方法基于一个假设,一篇文档的核心思想可以用文档中的某一句或几句话来概括。那么摘要的任务就变成了找到文档中最重要的几句话,也就是一个句子排序的问题。

排序针对不同的问题,需要提出不同的指标,比如有的应用关心的是相关性,有的关心的是时效性,有的关心的是新颖性等等,在这个层面上来讨论排序,会有不同的模型。 
一般的抽取式摘要问题,会考虑相关性新颖性两个指标。相关性是指摘要所用的句子最能够代表本文档的意思,而新颖性是指候选句子包含的冗余信息要少,尽可能每句话都可以独立地表达出一种独立的意思。

下面简单介绍一些思路。

1.预处理 
NLP任务的标准流程中第一步都是预处理,将拿到的文本做分句,这里有两种可能性,一是用句点或者其他可以表达一句话结尾的符号作为分隔,另外一种是用逗号作为分隔符获取句子。

2.词、句表示 
这一步的思路是:将词、句子表示成计算机能理解的量,然后计算一些指标进行排序。这个地方也是各种算法、模型最大的不同之处: 
(1)Bag Of Words。词袋模型将词定义为一个维度,一句话表示成在所有词张成的空间中的一个高维稀疏向量。 
(2)TF-IDF。可以理解为带权重的词袋模型,计算出每个词的TFIDF值,作为该词的权重。 
(3)LDA/LSI。将整篇文档利用TFIDF模型表示成一个矩阵,做SVD降维分解,生成两个矩阵,一个是文档-话题矩阵、另一个是词-话题矩阵。得到词-话题矩阵之后,可以得到句子-话题矩阵。 
(4)Word Embedding。Tomas Mikolov提出的Word2Vec,用了很多技巧和近似的思路让word很容易地表示成一个低维稠密向量,在很多情况下都可以达到不错的效果。词成为了一个向量,句子也可有很多种方法表示成一个向量。

3.排序 
这里介绍两种常见的方法。 
(1)基于图排序 
将文档的每句话作为节点,句子之间的相似度作为边权值构建图模型,用pagerank算法进行求解,得到每个句子的得分。

(2)基于特征 
特征工程在深度学习火之前是解决特定领域问题的良药,这里用到的特征包括: 
1)句子长度,长度为某个长度的句子为最理想的长度,依照距离这个长度的远近来打分。 
2)句子位置,根据句子在全文中的位置,给出分数。(比如每段的第一句是核心句的比例大概是70%) 
3)句子是否包含标题词,根据句子中包含标题词的多少来打分。 
4)句子关键词打分,文本进行预处理之后,按照词频统计出排名前10的关键词,通过比较句子中包含关键词的情况,以及关键词分布的情况来打分。 
代表算法是TextTeaser。

4.后处理 
排序之后的结果只考虑了相关性并没有考虑新颖性,非常有可能出现排名靠前的几句话表达的都是相似的意思。所以需要引入一个惩罚因子,将新颖性考虑进去。对所有的句子重新打分,如下公式: 
a score(i) + (1-a) similarity(i, i-1), i = 2,3,….N 
序号i表示排序后的顺序,从第二句开始,排第一的句子不需要重新计算,后面的句子必须被和前一句的相似度进行惩罚。 
这个算法就是所谓的MMR(Maximum Margin Relevance)

5.输出 
输出的结果一般是取排序后的前N句话,这里涉及到一个非常重要的问题,也是一直自动文摘质量被诟病的问题,可读性。因为各个句子都是从不同的段落中选择出来的,如果只是生硬地连起来生成摘要的话,很难保证句子之间的衔接和连贯。保证可读性是一件很难的事情。 
这里有一个取巧的方法,就是将排序之后的句子按照原文中的顺序输出,可以在一定程度下保证一点点连贯性。

参考: 
[1] TextRank源码阅读笔记

https://gist.github.com/rsarxiv/11470a8d763b2845f671061c21230435

[2] TextTeaser源码阅读笔记

https://gist.github.com/rsarxiv/4e949264b3bda98828b84cf2991e57e4

关于abstractive,摘要式的,详见原文链接: http://blog.csdn.net/lu839684437/article/details/71600410


代码分析:https://github.com/DataTeaser/textteaser

main.py

from summarizer import Summarizer

def getInput():
with open('input.txt') as file:
content = file.readlines() # remove unnecessary \n
content = [c.replace('\n', '') for c in content if c != '\n'] title = content[0]
text = content[-(len(content) - 1):] return {'title': title, 'text': ' '.join(text)} # ##################### input = getInput()
input['text'] = input['text'].decode("ascii", "ignore") input['text'] = " ".join(input['text'].replace("\n", " ").split()) summarizer = Summarizer()
result = summarizer.summarize(input['text'], input['title'], 'Undefined', 'Undefined')
result = summarizer.sortScore(result)
result = summarizer.sortSentences(result[:30]) print 'Summary:' for r in result:
print r['sentence']
# print r['totalScore']
# print r['order']

summarizer.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
from parser import Parser class Summarizer:
def __init__(self):
self.parser = Parser() def summarize(self, text, title, source, category):
sentences = self.parser.splitSentences(text)
titleWords = self.parser.removePunctations(title)
titleWords = self.parser.splitWords(title)
(keywords, wordCount) = self.parser.getKeywords(text) topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)
result = self.computeScore(sentences, titleWords, topKeywords)
result = self.sortScore(result) return result def getTopKeywords(self, keywords, wordCount, source, category):
# Add getting top keywords in the database here
for keyword in keywords:
articleScore = 1.0 * keyword['count'] / wordCount
keyword['totalScore'] = articleScore * 1.5 return keywords def sortScore(self, dictList):
return sorted(dictList, key=lambda x: -x['totalScore']) def sortSentences(self, dictList):
return sorted(dictList, key=lambda x: x['order']) def computeScore(self, sentences, titleWords, topKeywords):
keywordList = [keyword['word'] for keyword in topKeywords]
summaries = [] for i, sentence in enumerate(sentences):
sent = self.parser.removePunctations(sentence)
words = self.parser.splitWords(sent) sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList) titleFeature = self.parser.getTitleScore(titleWords, words)
sentenceLength = self.parser.getSentenceLengthScore(words)
sentencePosition = self.parser.getSentencePositionScore(i, len(sentences))
keywordFrequency = (sbsFeature + dbsFeature) / 2.0 * 10.0
totalScore = (titleFeature * 1.5 + keywordFrequency * 2.0 + sentenceLength * 0.5 + sentencePosition * 1.0) / 4.0 summaries.append({
# 'titleFeature': titleFeature,
# 'sentenceLength': sentenceLength,
# 'sentencePosition': sentencePosition,
# 'keywordFrequency': keywordFrequency,
'totalScore': totalScore,
'sentence': sentence,
'order': i
}) return summaries def sbs(self, words, topKeywords, keywordList):
score = 0.0 if len(words) == 0:
return 0 for word in words:
word = word.lower()
index = -1 if word in keywordList:
index = keywordList.index(word) if index > -1:
score += topKeywords[index]['totalScore'] return 1.0 / abs(len(words)) * score def dbs(self, words, topKeywords, keywordList):
k = len(list(set(words) & set(keywordList))) + 1
summ = 0.0
firstWord = {}
secondWord = {} for i, word in enumerate(words):
if word in keywordList:
index = keywordList.index(word) if firstWord == {}:
firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
else:
secondWord = firstWord
firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
distance = firstWord['i'] - secondWord['i'] summ += (firstWord['score'] * secondWord['score']) / (distance ** 2) return (1.0 / k * (k + 1.0)) * summ

parser.py: we should implement our parser for law case database.

# !/usr/bin/python
# -*- coding: utf-8 -*-
import nltk.data
import os class Parser:
def __init__(self):
self.ideal = 20.0
self.stopWords = self.getStopWords() def getKeywords(self, text):
text = self.removePunctations(text)
words = self.splitWords(text)
words = self.removeStopWords(words)
uniqueWords = list(set(words)) keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords]
keywords = sorted(keywords, key=lambda x: -x['count']) return (keywords, len(words)) def getSentenceLengthScore(self, sentence):
return (self.ideal - abs(self.ideal - len(sentence))) / self.ideal # Jagadeesh, J., Pingali, P., & Varma, V. (2005). Sentence Extraction Based Single Document Summarization. International Institute of Information Technology, Hyderabad, India, 5.
def getSentencePositionScore(self, i, sentenceCount):
normalized = i / (sentenceCount * 1.0) if normalized > 0 and normalized <= 0.1:
return 0.17
elif normalized > 0.1 and normalized <= 0.2:
return 0.23
elif normalized > 0.2 and normalized <= 0.3:
return 0.14
elif normalized > 0.3 and normalized <= 0.4:
return 0.08
elif normalized > 0.4 and normalized <= 0.5:
return 0.05
elif normalized > 0.5 and normalized <= 0.6:
return 0.04
elif normalized > 0.6 and normalized <= 0.7:
return 0.06
elif normalized > 0.7 and normalized <= 0.8:
return 0.04
elif normalized > 0.8 and normalized <= 0.9:
return 0.04
elif normalized > 0.9 and normalized <= 1.0:
return 0.15
else:
return 0 def getTitleScore(self, title, sentence):
titleWords = self.removeStopWords(title)
sentenceWords = self.removeStopWords(sentence)
matchedWords = [word for word in sentenceWords if word in titleWords] return len(matchedWords) / (len(title) * 1.0) def splitSentences(self, text):
tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file__)).decode('utf-8') + '/trainer/english.pickle') return tokenizer.tokenize(text) def splitWords(self, sentence):
return sentence.lower().split() def removePunctations(self, text):
return ''.join(t for t in text if t.isalnum() or t == ' ') def removeStopWords(self, words):
return [word for word in words if word not in self.stopWords] def getStopWords(self):
with open(os.path.dirname(os.path.abspath(__file__)) + '/trainer/stopWords.txt') as file:
words = file.readlines() return [word.replace('\n', '') for word in words]

[IR] Extraction-based Text Summarization的更多相关文章

  1. DL4NLP —— seq2seq+attention机制的应用:文档自动摘要(Automatic Text Summarization)

    两周以前读了些文档自动摘要的论文,并针对其中两篇( [2] 和 [3] )做了presentation.下面把相关内容简单整理一下. 文本自动摘要(Automatic Text Summarizati ...

  2. 【论文简读】 Deep web data extraction based on visual

    <Deep web data extraction based on visual information processing>作者 J Liu 上海海事大学 2017 AIHC会议登载 ...

  3. Learning LexRank——Graph-based Centrality as Salience in Text Summarization(一)

    (1)What is Sentence Centrality and Centroid-based Summarization ? Extractive summarization works by ...

  4. [阅读笔记]Zhang Y. 3D Information Extraction Based on GPU.2010.

    1.立体视觉基础 深度定义为物体间的距离 视差定义为同一点在左图(reference image) 和右图( target image) 中的x坐标差. 根据左图中每个点的视差得到的灰度图称为视差图. ...

  5. 【论文阅读】An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

    懒得转成文字再写一遍了,直接把做过的PPT放出来吧. 论文连接:https://link.zhihu.com/?target=https%3A//arxiv.org/pdf/1804.09003v1. ...

  6. 【论文速读】Fangfang Wang_CVPR2018_Geometry-Aware Scene Text Detection With Instance Transformation Network

    Han Hu--[ICCV2017]WordSup_Exploiting Word Annotations for Character based Text Detection 作者和代码 caffe ...

  7. XiangBai——【CVPR2018】Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

    XiangBai——[CVPR2018]Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentat ...

  8. Abstractive Summarization

    Sequence-to-sequence Framework A Neural Attention Model for Abstractive Sentence Summarization Alexa ...

  9. 《A computer-aided healthcare system for cataract classification and grading based on fundus image analysis》学习笔记

    Abstract This paper presents a fundus image analysis based computer aided system for automatic class ...

随机推荐

  1. php boolean

    要明确地将一个值转换成 boolean,用 (bool) 或者 (boolean) 来强制转换 var_dump((); // true 当转换为 boolean 时,以下值被认为是 FALSE: 1 ...

  2. web开发必备的浏览器常识

    浏览器内核: 1.使用Trident内核的浏览器:IE.Maxthon.TT.The World等: 2.使用Gecko内核的浏览器:Netcape6及以上版本.FireFox.MozillaSuit ...

  3. [转]java.util.Date和java.sql.Date转换

    Date 的类型转换:首先记住java.util.Date 为 java.sql.Date的父类 1.将java.util.Date 转换为 java.sql.Date java.lang.Class ...

  4. vim less vi 不显示富文本 ESC

    如图: 使用 less -r xxx.log 即可显示如下

  5. org.hibernate.QueryException: JPA-style positional param was not an integral ordinal; nested exception is java.lang.IllegalArgumentException: org.hibernate.QueryException: JPA-style positional param w

    org.hibernate.QueryException: JPA-style positional param was not an integral ordinal; nested excepti ...

  6. Mac下打开多个eclipse

    命令行执行: open -n /Eclipse所在路径/Eclipse.app  

  7. SoapUI Pro Project Solution Collection-access the soapui object

    Technorati 标签: Soapui pro,web service,apI Testing

  8. 【java】解析java网络

    目录结构: contents structure [+] 模拟Post与Get请求 设置Authorization头信息 基于TCP的网络编程 TCP协议简介 半关闭的Socket TCP长链接 TC ...

  9. Android UI系列-----RelativeLayout的相关属性

    本篇随笔将主要记录一些RelatieLayout的相关属性,并将猜拳游戏通过RelativeLayout实现出来 RelativeLayout的几组属性 第一组属性:android:layout_be ...

  10. 详解shell编程中2>&1用法

    在使用 linux 命令或者 shell 编程时,这个用法常会遇到 2>&1 下面看一个命令示例,然后分析下他是如何工作的: ls foo > /dev/null 2>&am ...