[IR] Extraction-based Text Summarization
文本自动摘要 - 阅读笔记
- 一种是extractive,抽取式的,从原文中找到一些关键的句子,组合成一篇摘要;【最主流、应用最多、最容易的方法】
- 另外一种是abstractive,摘要式的,这需要计算机可以读懂原文的内容,并且用自己的意思将其表达出来。【相对来说更有一种真正人工智能的味道】
单文档Extractive (抽取式)Summarization
(1)Bag Of Words。词袋模型将词定义为一个维度,一句话表示成在所有词张成的空间中的一个高维稀疏向量。
(4)Word Embedding。Tomas Mikolov提出的Word2Vec,用了很多技巧和近似的思路让word很容易地表示成一个低维稠密向量,在很多情况下都可以达到不错的效果。词成为了一个向量,句子也可有很多种方法表示成一个向量。
a score(i) + (1-a) similarity(i, i-1), i = 2,3,….N
这个算法就是所谓的MMR(Maximum Margin Relevance)
[1] TextRank源码阅读笔记
[2] TextTeaser源码阅读笔记
关于abstractive,摘要式的,详见原文链接: http://blog.csdn.net/lu839684437/article/details/71600410
from summarizer import Summarizer def getInput():
with open('input.txt') as file:
content = file.readlines() # remove unnecessary \n
content = [c.replace('\n', '') for c in content if c != '\n'] title = content[0]
text = content[-(len(content) - 1):] return {'title': title, 'text': ' '.join(text)} # ##################### input = getInput()
input['text'] = input['text'].decode("ascii", "ignore") input['text'] = " ".join(input['text'].replace("\n", " ").split()) summarizer = Summarizer()
result = summarizer.summarize(input['text'], input['title'], 'Undefined', 'Undefined')
result = summarizer.sortScore(result)
result = summarizer.sortSentences(result[:30]) print 'Summary:' for r in result:
print r['sentence']
# print r['totalScore']
# print r['order']
# -*- coding: utf-8 -*-
from parser import Parser class Summarizer:
def __init__(self):
self.parser = Parser() def summarize(self, text, title, source, category):
sentences = self.parser.splitSentences(text)
titleWords = self.parser.removePunctations(title)
titleWords = self.parser.splitWords(title)
(keywords, wordCount) = self.parser.getKeywords(text) topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)
result = self.computeScore(sentences, titleWords, topKeywords)
result = self.sortScore(result) return result def getTopKeywords(self, keywords, wordCount, source, category):
# Add getting top keywords in the database here
for keyword in keywords:
articleScore = 1.0 * keyword['count'] / wordCount
keyword['totalScore'] = articleScore * 1.5 return keywords def sortScore(self, dictList):
return sorted(dictList, key=lambda x: -x['totalScore']) def sortSentences(self, dictList):
return sorted(dictList, key=lambda x: x['order']) def computeScore(self, sentences, titleWords, topKeywords):
keywordList = [keyword['word'] for keyword in topKeywords]
summaries = [] for i, sentence in enumerate(sentences):
sent = self.parser.removePunctations(sentence)
words = self.parser.splitWords(sent) sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList) titleFeature = self.parser.getTitleScore(titleWords, words)
sentenceLength = self.parser.getSentenceLengthScore(words)
sentencePosition = self.parser.getSentencePositionScore(i, len(sentences))
keywordFrequency = (sbsFeature + dbsFeature) / 2.0 * 10.0
totalScore = (titleFeature * 1.5 + keywordFrequency * 2.0 + sentenceLength * 0.5 + sentencePosition * 1.0) / 4.0 summaries.append({
# 'titleFeature': titleFeature,
# 'sentenceLength': sentenceLength,
# 'sentencePosition': sentencePosition,
# 'keywordFrequency': keywordFrequency,
'totalScore': totalScore,
'sentence': sentence,
'order': i
}) return summaries def sbs(self, words, topKeywords, keywordList):
score = 0.0 if len(words) == 0:
return 0 for word in words:
word = word.lower()
index = -1 if word in keywordList:
index = keywordList.index(word) if index > -1:
score += topKeywords[index]['totalScore'] return 1.0 / abs(len(words)) * score def dbs(self, words, topKeywords, keywordList):
k = len(list(set(words) & set(keywordList))) + 1
summ = 0.0
firstWord = {}
secondWord = {} for i, word in enumerate(words):
if word in keywordList:
index = keywordList.index(word) if firstWord == {}:
firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
secondWord = firstWord
firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
distance = firstWord['i'] - secondWord['i'] summ += (firstWord['score'] * secondWord['score']) / (distance ** 2) return (1.0 / k * (k + 1.0)) * summ
parser.py: we should implement our parser for law case database.
# !/usr/bin/python
# -*- coding: utf-8 -*-
import nltk.data
import os class Parser:
def __init__(self):
self.ideal = 20.0
self.stopWords = self.getStopWords() def getKeywords(self, text):
text = self.removePunctations(text)
words = self.splitWords(text)
words = self.removeStopWords(words)
uniqueWords = list(set(words)) keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords]
keywords = sorted(keywords, key=lambda x: -x['count']) return (keywords, len(words)) def getSentenceLengthScore(self, sentence):
return (self.ideal - abs(self.ideal - len(sentence))) / self.ideal # Jagadeesh, J., Pingali, P., & Varma, V. (2005). Sentence Extraction Based Single Document Summarization. International Institute of Information Technology, Hyderabad, India, 5.
def getSentencePositionScore(self, i, sentenceCount):
normalized = i / (sentenceCount * 1.0) if normalized > 0 and normalized <= 0.1:
return 0.17
elif normalized > 0.1 and normalized <= 0.2:
return 0.23
elif normalized > 0.2 and normalized <= 0.3:
return 0.14
elif normalized > 0.3 and normalized <= 0.4:
return 0.08
elif normalized > 0.4 and normalized <= 0.5:
return 0.05
elif normalized > 0.5 and normalized <= 0.6:
return 0.04
elif normalized > 0.6 and normalized <= 0.7:
return 0.06
elif normalized > 0.7 and normalized <= 0.8:
return 0.04
elif normalized > 0.8 and normalized <= 0.9:
return 0.04
elif normalized > 0.9 and normalized <= 1.0:
return 0.15
return 0 def getTitleScore(self, title, sentence):
titleWords = self.removeStopWords(title)
sentenceWords = self.removeStopWords(sentence)
matchedWords = [word for word in sentenceWords if word in titleWords] return len(matchedWords) / (len(title) * 1.0) def splitSentences(self, text):
tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file__)).decode('utf-8') + '/trainer/english.pickle') return tokenizer.tokenize(text) def splitWords(self, sentence):
return sentence.lower().split() def removePunctations(self, text):
return ''.join(t for t in text if t.isalnum() or t == ' ') def removeStopWords(self, words):
return [word for word in words if word not in self.stopWords] def getStopWords(self):
with open(os.path.dirname(os.path.abspath(__file__)) + '/trainer/stopWords.txt') as file:
words = file.readlines() return [word.replace('\n', '') for word in words]
[IR] Extraction-based Text Summarization的更多相关文章
- DL4NLP —— seq2seq+attention机制的应用:文档自动摘要(Automatic Text Summarization)
两周以前读了些文档自动摘要的论文,并针对其中两篇( [2] 和 [3] )做了presentation.下面把相关内容简单整理一下. 文本自动摘要(Automatic Text Summarizati ...
- 【论文简读】 Deep web data extraction based on visual
<Deep web data extraction based on visual information processing>作者 J Liu 上海海事大学 2017 AIHC会议登载 ...
- Learning LexRank——Graph-based Centrality as Salience in Text Summarization(一)
(1)What is Sentence Centrality and Centroid-based Summarization ? Extractive summarization works by ...
- [阅读笔记]Zhang Y. 3D Information Extraction Based on GPU.2010.
1.立体视觉基础 深度定义为物体间的距离 视差定义为同一点在左图(reference image) 和右图( target image) 中的x坐标差. 根据左图中每个点的视差得到的灰度图称为视差图. ...
- 【论文阅读】An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches
懒得转成文字再写一遍了,直接把做过的PPT放出来吧. 论文连接:https://link.zhihu.com/?target=https%3A//arxiv.org/pdf/1804.09003v1. ...
- 【论文速读】Fangfang Wang_CVPR2018_Geometry-Aware Scene Text Detection With Instance Transformation Network
Han Hu--[ICCV2017]WordSup_Exploiting Word Annotations for Character based Text Detection 作者和代码 caffe ...
- XiangBai——【CVPR2018】Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
XiangBai——[CVPR2018]Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentat ...
- Abstractive Summarization
Sequence-to-sequence Framework A Neural Attention Model for Abstractive Sentence Summarization Alexa ...
- 《A computer-aided healthcare system for cataract classification and grading based on fundus image analysis》学习笔记
Abstract This paper presents a fundus image analysis based computer aided system for automatic class ...
- php boolean
要明确地将一个值转换成 boolean,用 (bool) 或者 (boolean) 来强制转换 var_dump((); // true 当转换为 boolean 时,以下值被认为是 FALSE: 1 ...
- web开发必备的浏览器常识
浏览器内核: 1.使用Trident内核的浏览器:IE.Maxthon.TT.The World等: 2.使用Gecko内核的浏览器:Netcape6及以上版本.FireFox.MozillaSuit ...
- [转]java.util.Date和java.sql.Date转换
Date 的类型转换:首先记住java.util.Date 为 java.sql.Date的父类 1.将java.util.Date 转换为 java.sql.Date java.lang.Class ...
- vim less vi 不显示富文本 ESC
如图: 使用 less -r xxx.log 即可显示如下
- org.hibernate.QueryException: JPA-style positional param was not an integral ordinal; nested exception is java.lang.IllegalArgumentException: org.hibernate.QueryException: JPA-style positional param w
org.hibernate.QueryException: JPA-style positional param was not an integral ordinal; nested excepti ...
- Mac下打开多个eclipse
命令行执行: open -n /Eclipse所在路径/Eclipse.app
- SoapUI Pro Project Solution Collection-access the soapui object
Technorati 标签: Soapui pro,web service,apI Testing
- 【java】解析java网络
目录结构: contents structure [+] 模拟Post与Get请求 设置Authorization头信息 基于TCP的网络编程 TCP协议简介 半关闭的Socket TCP长链接 TC ...
- Android UI系列-----RelativeLayout的相关属性
本篇随笔将主要记录一些RelatieLayout的相关属性,并将猜拳游戏通过RelativeLayout实现出来 RelativeLayout的几组属性 第一组属性:android:layout_be ...
- 详解shell编程中2>&1用法
在使用 linux 命令或者 shell 编程时,这个用法常会遇到 2>&1 下面看一个命令示例,然后分析下他是如何工作的: ls foo > /dev/null 2>&am ...