TF-IDF 原理与实现

1.原理

\[TF-IDF = tf_{t,d} \times idf_{t}\\
tf_{t,d} = \frac{术语t在文档d中出现的次数}{文档d的总术语数}\\
idf_{t} = \log(\frac{文档d总数}{包含术语t的文档数})
\]

2. 伪代码

3.实现

同级目录下需要有 documents 文件夹，在该文件夹下存放文档集。

# !/usr/bin/python

# -*- coding: utf-8 -*-

import os

import math

def set_doc():

    docs = dict()

    for d in os.listdir(os.getcwd() + os.sep + "documents"):

        docs[d] = list()

        with open(os.getcwd() + os.sep + "documents" + os.sep + d, encoding="ANSI") as f:

            for line in f:

                for word in line.strip().split(" "):

                    docs[d].append(word)

    return docs

def tf(docs, keyword):

    tfs = dict()

    for doc in docs:

        for word in docs[doc]:

            if keyword in word:

                try:

                    tfs[doc] = tfs[doc] + 1

                except KeyError:

                    tfs[doc] = 1

        try:

            tfs[doc] = tfs[doc] / len(docs[doc])

        except KeyError:

            tfs[doc] = int(0)

    return tfs

def idf(docs, keyword):

    doc_with_keyword = set()

    for doc in docs:

        for word in docs[doc]:

            if keyword in word:

                doc_with_keyword.add(doc)

    return math.log(len(docs) / len(doc_with_keyword))

def tf_idf(tfs, term_idf):

    term_tf_idf = dict()

    for doc in tfs:

        term_tf_idf[doc] = tfs[doc] * term_idf

    return term_tf_idf

if __name__ == "__main__":

    keyword = "people"

    docs = set_doc()

    tfs = tf(docs, keyword)

    term_idf = idf(docs, keyword)

    term_tf_idf = tf_idf(tfs, term_idf)

    term_tf_idf = sorted(term_tf_idf.items(), key=lambda d:d[1], reverse=True)

    print(term_tf_idf)

References

[1] 数学之美，吴军，人民邮电出版社

[2] 信息检索导论， Christopher D. Manning，人民邮电出版社

TF-IDF原理与实现的更多相关文章

Elasticsearch由浅入深（十）搜索引擎：相关度评分 TF&IDF算法、doc value正排索引、解密query、fetch phrase原理、Bouncing Results问题、基于scoll技术滚动搜索大量数据
相关度评分 TF&IDF算法 Elasticsearch的相关度评分(relevance score)算法采用的是term frequency/inverse document frequen ...
基于TF/IDF的聚类算法原理
一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...
信息检索中的TF/IDF概念与算法的解释
https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...
tf idf公式及sklearn中TfidfVectorizer
在文本挖掘预处理之向量化与Hash Trick中我们讲到在文本挖掘的预处理中,向量化之后一般都伴随着TF-IDF的处理,那么什么是TF-IDF,为什么一般我们要加这一步预处理呢?这里就对TF-IDF的 ...
TF/IDF（term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
使用solr的函数查询,并获取tf*idf值
1. 使用函数df(field,keyword) 和idf(field,keyword). http://118.85.207.11:11100/solr/mobile/select?q={!func ...
TF/IDF计算方法
FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...
tf–idf算法解释及其python代码实现(下)
tf–idf算法python代码实现这是我写的一个tf-idf的简单实现的代码,我们知道tfidf=tf*idf,所以可以分别计算tf和idf值在相乘,首先我们创建一个简单的语料库,作为例子,只有四 ...
tf–idf算法解释及其python代码实现(上)
tf–idf算法解释 tf–idf, 是term frequency–inverse document frequency的缩写,它通常用来衡量一个词对在一个语料库中对它所在的文档有多重要,常用在信息 ...
文本分类学习（三）特征权重（TF/IDF）和特征提取
上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的 ...

随机推荐

git 冲突解决办法
1. Pull is not possible because you have unmerged files. 症状:pull的时候 $ git pull Pull is not possible ...
css--clearfix浮动
解读浮动闭合最佳方案:clearfix: http://www.daqianduan.com/3606.html clearfix清除浮动进化史:http://www.admin10000.com/d ...
Python Anaconda使用
选择Python 科学计算器发行版 Python用于科学计算的一些常用工具和库 IPython-增强的交互环境:支持变量自动补全,自动缩进,支持 bash shell 命令,内置了许多很有用的功能和函 ...
php删除文件或文件夹
<?php function deleteDir($dir) { if (!$handle = @opendir($dir)) { return false; } while (false != ...
C# mongodb $set或$addToSet批量更新很慢原因
C# mongodb $set或$addToSet批量更新很慢原因的解决方法:关键字段要建立索引
sqlserver Distributed Transaction 分布式事务
在webapi+ef+sqlserver开发项目时,利用transcope实现应用层级的事务时,偶尔会报分布式事务错误,而且很而复现,特别蛋疼.现将自己的解决方法初步整理下. 分析原因:搭建repos ...
node.js中ws模块创建服务端和客户端，网页WebSocket客户端
首先下载websocket模块,命令行输入 npm install ws 1.node.js中ws模块创建服务端 // 加载node上websocket模块 ws; var ws = require( ...
django后台的制作
参考:http://zengestudy.blog.51cto.com/1702365/1902660 http://www.cnblogs.com/fnng/p/3737964.html 实现与后台 ...
How to export a model from SolidWorks to Google SketchUp
How to export a model from SolidWorks to Google SketchUp While Google SketchUp is not a professional ...
C语言堆栈的区别
堆(heap)和栈(stack)有什么区别?? 简单的可以理解为: heap:是由malloc之类函数分配的空间所在地.地址是由低向高增长的. astack:是自动分配变量,以及函数调用的时候所使用的 ...

TF-IDF原理与实现