向量空间模型实现文档查询（Vector Space Model to realize document query）

xml中文档（query）的结构：

<topic>

<number>CIRB010TopicZH006</number>

<title>科索沃難民潮</title>

<question>

查詢科索沃戰爭中的難民潮情況，以及國際間對其采取的援助。

</question>

<narrative>

相關文件內容包含科省難民湧入的地點、人數。受安置的狀況，難民潮引發的問題，参與救援之國家與國際組織，其援助策略與行動內容之報導。

</narrative>

<concepts>

科省、柯省、科索沃、柯索伏、難民、難民潮、難民營、援助、收容、救援、醫療、人道、避難、馬其頓、土耳其、外交部、國際、聯合國、紅十字會、阿爾巴尼亞裔難民。

</concepts>

</topic>

文档列表的样子（file-list）

CIRB010/cdn/loc/CDN_LOC_0001457

CIRB010/cdn/loc/CDN_LOC_0000294

CIRB010/cdn/loc/CDN_LOC_0000120

CIRB010/cdn/loc/CDN_LOC_0000661

CIRB010/cdn/loc/CDN_LOC_0001347

CIRB010/cdn/loc/CDN_LOC_0000439

词库的样子（vocab.all）中文的话是单个字一行

utf8

Copper

version

EGCG

432Kbps

RESERVECHARDONNAY

TommyHolloway

platts

Celeron266MHz

VOLKSWAGEN

INDEX

SmarTone

倒排文档的表示（inverted-file）

词库中词的行号1 词库中词的行号2（-1表示单个词，仅仅考虑1）文档个数

文档在列表中的行数词出现的次数

代码实现仅仅是考虑单个的字

# -*- coding: utf-8 -*-

#!usr/bin/python

import sys

import getopt

from xml.dom.minidom import parse

import xml.dom.minidom

import scipy.sparse as sp

from numpy import *

from math import log

from sklearn.preprocessing import normalize

#deal with the argv

def main(argv):

	ifFeedback=False

	try:

		opts,args=getopt.getopt(argv,'ri:o:m:d:',[])

	except getopt.GetoptError:

		# run input

		print 'wrong input'

	for opt,arg in opts:

		if opt=='-r' and ifFeedback==False:

			ifFeedback=True

		elif opt=='-i':

			queryFile=arg

		elif opt=='-o':

			rankedList=arg

		elif opt=='-m':

			modelDir=arg

		elif opt=='-d':

			NTCIRDir=arg

		else:

			pass

	return ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#if __name__=='__main__' :

#get the path in the arguments

ifFeedback,queryFile,rankedList,modelDir,NTCIRDir=main(sys.argv[1:])

#print ifFeedback,queryFile,rankedList,modelDir,NTCIRDir

#get the file path in the model-dir

vocab=modelDir+'/vocab.all'

fileList=modelDir+'/file-list'

invList=modelDir+'/inverted-file'

#read

pf=open(vocab,'r')

vocab=pf.read()

pf.close()

pf=open(fileList,'r')

fileList=pf.read()

pf.close()

pf=open(invList,'r')

invList=pf.read()

pf.close()

#splitlines

vocab=vocab.splitlines();

fileList=fileList.splitlines()

invList=invList.splitlines()

# vocab dict

vocabDict={}

k=0

while k <len(vocab):

	vocabDict[vocab[k]]=k

	k+=1

#get the TF and IDF matrix

#dimension:

#tfMatrix=sp.csr_matrix(len(fileList),len(vocab))

IDFVector=zeros(len(vocab))

totalDocs=len(fileList)

count=0

tempMatrix=zeros((len(fileList),len(vocab)))

while count<len(invList):

	postings=invList[count]

	post=postings.split(' ')

	k=1

	#just deal with the single word

	if(len(post)>2 and post[1]=='-1'):

		IDFVector[int(post[0])]=int(post[2])

		while k<=int(post[2]):

			line=invList[count+k].split(' ')

			tempMatrix[int(line[0])][int(post[0])]=int(line[1])

			k+=1

	count+=k

tfMatrix=sp.csr_matrix(tempMatrix)

#BM25

doclens=tfMatrix.sum(1)

avglen=doclens.mean()

k=7

b=0.7

#

tp1=tfMatrix*(k+1)

tp2=k*(1-b+b*doclens/avglen)

tfMatrix.data+=array(tp2[tfMatrix.tocoo().row]).reshape(len(tfMatrix.data))

tfMatrix.data=tp1.data/tfMatrix.data

#calculate the idf

k=0

while k<len(vocab):

	if IDFVector[k]!=0:

		IDFVector[k]=log(float(totalDocs)/IDFVector[k])

	k+=1

#tf-idf

tfMatrix.data*=IDFVector[tfMatrix.indices]

#row normalization for tf-idf matrix

normalize(tfMatrix,norm='l2',axis=1,copy=False)

#deal with the query

doc=xml.dom.minidom.parse(queryFile)

root=doc.documentElement

topics=root.getElementsByTagName('topic')

rankList=''

for topic in topics:

	#query vector

	qVector=zeros(len(vocab))

	number=topic.getElementsByTagName('number')[0].childNodes[0].data

	title=topic.getElementsByTagName('title')[0].childNodes[0].data

	question=topic.getElementsByTagName('question')[0].childNodes[0].data

	narrative=topic.getElementsByTagName('narrative')[0].childNodes[0].data

	concepts=topic.getElementsByTagName('concepts')[0].childNodes[0].data

        narrative+=question+concepts

	for w in narrative:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

	for w in title:

		if vocabDict.has_key(w.encode('utf8')):

			qVector[vocabDict[w.encode('utf8')]]+=1

#...normalization

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	if ifFeedback:

		topk=[]

		for score,k in simCount[:20]:

			topk.append(k)

		d=tfMatrix[topk,:].sum(0)/20

		qVector+=array(0.8*d).reshape(len(qVector))

	#.....

	normalize(qVector,norm='l2',axis=1,copy=False)

	#similarity compute:

	#a sparse matrix

	sim=tfMatrix*(sp.csr_matrix(qVector).transpose())

	sim=sim.toarray()

	k=0

	simCount=[]

	while k<len(fileList):

		tup=(sim[k],k)

		simCount.append(tup)

		k+=1

	#sort

	simCount.sort(reverse=True)

	simCount=simCount[:100]

	#.....

	num=number.split('ZH')

	num=num[1]

	for sim in simCount:

		name=fileList[sim[1]]

		name=name.split('/')

		name=name[3].lower()

		rank=num+' '+name

		rankList+=rank+'\n'

pf=open(rankedList,'w')

pf.write(rankList)

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

向量空间模型(Vector Space Model)的理解
1. 问题描述给你若干篇文档,找出这些文档中最相似的两篇文档? 相似性,可以用距离来衡量.而在数学上,可使用余弦来计算两个向量的距离. \[cos(\vec a, \vec b)=\frac {\v ...
Solr相似度名词：VSM(Vector Space Model)向量空间模型
最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查找相似度最高的.最优的结果.索性就直接跳到这个问题看,很多资料都提到了VSM(Vector Space Model)即向 ...
向量空间模型（Vector Space Model）
搜索结果排序是搜索引擎最核心的构成部分,很大程度上决定了搜索引擎的质量好坏.虽然搜索引擎在实际结果排序时考虑了上百个相关因子,但最重要的因素还是用户查询与网页内容的相关性.(ps:百度最臭名朝著的“竞 ...
ES搜索排序，文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
转：Lucene之计算相似度模型VSM(Vector Space Model) : tf-idf与交叉熵关系，cos余弦相似度
原文:http://blog.csdn.net/zhangbinfly/article/details/7734118 最近想学习下Lucene ,以前运行的Demo就感觉很神奇,什么原理呢,尤其是查 ...
Elasticsearch增删改查之 —— mget多文档查询
之前说过了针对单一文档的增删改查,基本也算是达到了一个基本数据库的功能.本篇主要描述的是多文档的查询,通过这个查询语法,可以根据多个文档的查询条件,返回多个文档集合. 更多内容可以参考我整理的ELK文 ...
ES 父子文档查询
父子文档的特点 1. 父/子文档是完全独立的. 2. 父文档更新不会影响子文档. 3. 子文档更新不会影响父文档或者其它子文档. 父子文档的映射与索引 1. 父子关系 type 的建立必须在索引新建或 ...
css盒子模型、文档流、相对与绝对定位、浮动与清除模型
一.CSS中的盒子模型标准模式和混杂模式(IE).在标准模式下浏览器按照规范呈现页面:在混杂模式下,页面以一种比较宽松的向后兼容的方式显示.混杂模式通常模拟老式浏览器的行为以防止老站点无法工作. h ...
Elasticsearch文档查询
简单数据集到目前为止,已经了解了基本知识,现在我们尝试用更逼真的数据集,这儿已经准备好了一份虚构的JSON,关于客户银行账户信息的.每个文档的结构如下: { , , "firstname& ...

随机推荐

centos 装 android studio (2)
这里,我打算安装 JDK 1.8. $ sudo add-apt-repository ppa:webupd8team/java $ sudoapt-get update $ sudoapt-get ...
Configure Red Hat Enterprise Linux shared disk cluster for SQL Server
下面一步一步介绍一下如何在Red Hat Enterprise Linux系统上为SQL Server配置共享磁盘集群(Shared Disk Cluster)及其相关使用(仅供测试学习之用,基础篇) ...
【LeetCode】Grid Illumination(网格照明)
这道题是LeetCode里的第1001道题. 题目要求: 在 N x N 的网格上,每个单元格 (x, y) 上都有一盏灯,其中 0 <= x < N 且 0 <= y < N ...
论文《Piexel Recurrent Nerual Network》总结
论文<Piexel Recurrent Nerual Network>总结论文:<Pixel Recurrent Nerual Network> 时间:2016 作者:Aar ...
组合数学的卡特兰数 TOJ 3551: Game of Connections
这个就是卡特兰数的经典问题直接用这个公式就好了,但是这个题涉及大数的处理h(n)=h(n-1)*(4*n-2)/(n+1) 其实见过好几次大数的处理了,有一次他存的恰好不多于30位,直接分成两部分l ...
【drp 12】再识转发和重定向：SpringMVC无法跳转页面
最近再使用SpringMVC进行页面跳转的时候,不知道发生了什么,始终都无法正确跳转.后来问题解决了,发现是对于转发和重定向没有能很好的理解,以此写篇博客,权当做积累了! 声明:本博客的所有代码,均为 ...
cell展开的几种方式
一.插入新的cell 原理: (1)定义是否展开,和展开的cell的下标 @property (assign, nonatomic) BOOL isExpand; //是否展开 @property ( ...
【Bzoj】1001狼抓兔子（平面图最小割转对偶图最短路）
YEAH 题目链接终于做对这道题啦建图的艰辛难以言表- - 顺便说一句我队列转STL啦狼抓兔子的地图符合平面图定义,于是将该图转成对偶图并求出对偶图的最短路即可. 这篇博客给了我极大的帮助 ...
spring两个核心IOC、AOP
Spring是一个开放源代码的设计层面框架,他解决的是业务逻辑层和其他各层的松耦合问题,因此它将面向接口的编程思想贯穿整个系统应用.Spring是于2003 年兴起的一个轻量级的Java 开发框架,由 ...
P2085 最小函数值 (堆)
题目描述有n个函数,分别为F1,F2,...,Fn.定义Fi(x)=Aix^2+Bix+Ci (x∈N*).给定这些Ai.Bi和Ci,请求出所有函数的所有函数值中最小的m个(如有重复的要输出多个). ...

向量空间模型实现文档查询（Vector Space Model to realize document query）

向量空间模型实现文档查询（Vector Space Model to realize document query）的更多相关文章

随机推荐

热门专题