Python基础:一起来面向对象 (二) 之搜索引擎
实例 搜索引擎
一个搜索引擎由搜索器、索引器、检索器和用户接口四个部分组成
搜索器就是爬虫(scrawler),爬出的内容送给索引器生成索引(Index)存储在内部数据库。用户通过用户接口发出询问(query),询问解析后送达检索器,检索器高效检索后,将结果返回给用户。
以下5个文件为爬取的搜索样本。
# # 1.txt
# I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. # # 2.txt
# I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. # # 3.txt
# I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. # # 4.txt
# This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . . # # 5.txt
# And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"
简单搜索引擎
class SearchEngineBase(object):
def __init__(self):
pass
#将文件作为id,与内容一起送到process_corpus
def add_corpus(self, file_path):
with open(file_path, 'r') as fin:
text = fin.read()
self.process_corpus(file_path, text)
#索引器 将文件路径作为id,将处理的内容作为索引存储
def process_corpus(self, id, text):
raise Exception('process_corpus not implemented.')
#检索器
def search(self, query):
raise Exception('search not implemented.') #多态
def main(search_engine):
for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
search_engine.add_corpus(file_path) while True:
query = input()
results = search_engine.search(query)
print('found {} result(s):'.format(len(results)))
for result in results:
print(result) class SimpleEngine(SearchEngineBase):
def __init__(self):
super(SimpleEngine, self).__init__()
self.__id_to_texts = dict() def process_corpus(self, id, text):
self.__id_to_texts[id] = text def search(self, query):
results = []
for id, text in self.__id_to_texts.items():
if query in text:
results.append(id)
return results search_engine = SimpleEngine()
main(search_engine) ########## 输出 ##########
# simple
# found 0 result(s):
# whe
# found 2 result(s):
# 1.txt
# 5.txt
词袋模型 (bag of words model)
import re
class BOWEngine(SearchEngineBase):
def __init__(self):
super(BOWEngine, self).__init__()
self.__id_to_words = dict() def process_corpus(self, id, text):
self.__id_to_words[id] = self.parse_text_to_words(text) def search(self, query):
query_words = self.parse_text_to_words(query)
results = []
for id, words in self.__id_to_words.items():
if self.query_match(query_words, words):
results.append(id)
return results @staticmethod
def query_match(query_words, words):
for query_word in query_words:
if query_word not in words:
return False
return True
#for query_word in query_words:
# return False if query_word not in words else True #result = filter(lambda x:x not in words,query_words)
#return False if (len(list(result)) > 0) else True @staticmethod
def parse_text_to_words(text):
# 使用正则表达式去除标点符号和换行符
text = re.sub(r'[^\w ]', ' ', text)
# 转为小写
text = text.lower()
# 生成所有单词的列表
word_list = text.split(' ')
# 去除空白单词
word_list = filter(None, word_list)
# 返回单词的 set
return set(word_list) search_engine = BOWEngine()
main(search_engine) ########## 输出 ##########
# i have a dream
# found 3 result(s):
# 1.txt
# 2.txt
# 3.txt
# freedom children
# found 1 result(s):
# 5.txt
缺点:每次search还是需要遍历所有文档
Inverted index 倒序索引
import re
class BOWInvertedIndexEngine(SearchEngineBase):
def __init__(self):
super(BOWInvertedIndexEngine, self).__init__()
self.inverted_index = dict() #生成索引 word -> id
def process_corpus(self, id, text):
words = self.parse_text_to_words(text)
for word in words:
if word not in self.inverted_index:
self.inverted_index[word] = []
self.inverted_index[word].append(id) #{'little':['1.txt','2.txt'],...} def search(self, query):
query_words = list(self.parse_text_to_words(query))
query_words_index = list()
for query_word in query_words:
query_words_index.append(0) # 如果某一个查询单词的倒序索引为空,我们就立刻返回
for query_word in query_words:
if query_word not in self.inverted_index:
return [] result = []
while True: # 首先,获得当前状态下所有倒序索引的 index
current_ids = [] for idx, query_word in enumerate(query_words):
current_index = query_words_index[idx]
current_inverted_list = self.inverted_index[query_word] #['1.txt','2.txt'] # 已经遍历到了某一个倒序索引的末尾,结束 search
if current_index >= len(current_inverted_list):
return result
current_ids.append(current_inverted_list[current_index]) # 然后,如果 current_ids 的所有元素都一样,那么表明这个单词在这个元素对应的文档中都出现了
if all(x == current_ids[0] for x in current_ids):
result.append(current_ids[0])
query_words_index = [x + 1 for x in query_words_index]
continue # 如果不是,我们就把最小的元素加一
min_val = min(current_ids)
min_val_pos = current_ids.index(min_val)
query_words_index[min_val_pos] += 1 @staticmethod
def parse_text_to_words(text):
# 使用正则表达式去除标点符号和换行符
text = re.sub(r'[^\w ]', ' ', text)
# 转为小写
text = text.lower()
# 生成所有单词的列表
word_list = text.split(' ')
# 去除空白单词
word_list = filter(None, word_list)
# 返回单词的 set
return set(word_list) search_engine = BOWInvertedIndexEngine()
main(search_engine) ########## 输出 ########## # little
# found 2 result(s):
# 1.txt
# 2.txt
# little vicious
# found 1 result(s):
# 2.txt
LRUCache
import pylru
class LRUCache(object):
def __init__(self, size=32):
self.cache = pylru.lrucache(size) def has(self, key):
return key in self.cache def get(self, key):
return self.cache[key] def set(self, key, value):
self.cache[key] = value class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
def __init__(self):
super(BOWInvertedIndexEngineWithCache, self).__init__()
LRUCache.__init__(self) def search(self, query):
if self.has(query):
print('cache hit!')
return self.get(query) result = super(BOWInvertedIndexEngineWithCache, self).search(query)
self.set(query, result) return result search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine) ########## 输出 ##########
# little
# found 2 result(s):
# 1.txt
# 2.txt
# little
# cache hit!
# found 2 result(s):
# 1.txt
# 2.txt
注意BOWInvertedIndexEngineWithCache继承了两个类。
Python基础:一起来面向对象 (二) 之搜索引擎的更多相关文章
- Python基础学习笔记(二)变量类型
参考资料: 1. <Python基础教程> 2. http://www.runoob.com/python/python-chinese-encoding.html 3. http://w ...
- (Python基础教程之十二)Python读写CSV文件
Python基础教程 在SublimeEditor中配置Python环境 Python代码中添加注释 Python中的变量的使用 Python中的数据类型 Python中的关键字 Python字符串操 ...
- python基础整理4——面向对象装饰器惰性器及高级模块
面向对象编程 面向过程:根据业务逻辑从上到下写代码 面向对象:将数据与函数绑定到一起,进行封装,这样能够更快速的开发程序,减少了重复代码的重写过程 面向对象编程(Object Oriented Pro ...
- python基础-9.1 面向对象进阶 super 类对象成员 类属性 私有属性 查找源码类对象步骤 类特殊成员 isinstance issubclass 异常处理
上一篇文章介绍了面向对象基本知识: 面向对象是一种编程方式,此编程方式的实现是基于对 类 和 对象 的使用 类 是一个模板,模板中包装了多个“函数”供使用(可以讲多函数中公用的变量封装到对象中) 对象 ...
- python基础篇_006_面向对象
面向对象 1.初识类: # 定义一个函数,我们使用关键字 def """ def 函数名(参数): '''函数说明''' 函数体 return 返回值 "&qu ...
- python基础学习 Day19 面向对象的三大特性之多态、封装 property的用法(1)
一.课前内容回顾 继承作用:提高代码的重用性(要继承父类的子类都实现相同的方法:抽象类.接口) 继承解释:当你开始编写两个类的时候,出现了重复的代码,通过继承来简化代码,把重复的代码放在父类中. 单继 ...
- python基础学习 Day19 面向对象的三大特性之多态、封装
一.课前内容回顾 继承作用:提高代码的重用性(要继承父类的子类都实现相同的方法:抽象类.接口) 继承解释:当你开始编写两个类的时候,出现了重复的代码,通过继承来简化代码,把重复的代码放在父类中. 单继 ...
- python基础学习Day17 面向对象的三大特性之继承、类与对象名称空间小试
一.课前回顾 类:具有相同属性和方法的一类事物 实例化:类名() 过程: 开辟了一块内存空间 执行init方法 封装属性 自动的把self返回给实例化对象的地方 对象:实例 一个实实在在存在的实体 组 ...
- python基础笔记之面向对象
# class Foo:# name="kevin"## def __init__(self,puppy):# self.tomato= 'red'# self.dog = pup ...
随机推荐
- 键值对集合Dictionary<K,V>根据索引提取数据
Dictionary<K,V>中ToList方法返回 List<KeyValuePair<K,V>>定义可设置检索的键/值对
- Pascal Hexagrammum Mysticum 的深度探索
PASCAL . Hexagrammum Mysticum . (六角迷魂图) . 的深度探索 . 英中对比.英文蓝色,译文黑色,译者补充说明用紫红色 (已校完,但尚未定稿,想再整理并补充内容 ...
- 【linux驱动分析】之dm9000驱动分析(三):sk_buff结构分析
[linux驱动分析]之dm9000驱动分析(一):dm9000原理及硬件分析 [linux驱动分析]之dm9000驱动分析(二):定义在板文件里的资源和设备以及几个宏 [linux驱动分析]之dm9 ...
- 标准C头文件
ISO C标准定义的头文件: POSIX标准定义的必须的头文件: POSIX标准定义的XSI可选头文件: POSIX标准定义的可选头文件:
- 移动Web开发实践
移动设备的高速发展给用户带来了非常大的便利.用户使用Android.iPhone和其他移动设备非常easy接入互联网. 移动设备对网页的性能要求比較高.以下就说说Mobile Web开发的一些心得. ...
- grunt简单教程
Grunt简单教程 1.grunt简单介绍 Grunt是一个基于任务的命令行工具.依赖于node.js环境. 它能帮你合并js文件,压缩js文件,验证js.编译less,合并css.还能够配置自己主动 ...
- oracle的shared、dedicated模式解析
主要參考文档:http://www.itpub.net/thread-1714191-1-1.html Oracleh有两种server模式shared mode和dedicated mode. De ...
- LeetCode 112 Path Sum(路径和)(BT、DP)(*)
翻译 给定一个二叉树root和一个和sum, 决定这个树是否存在一条从根到叶子的路径使得沿路全部节点的和等于给定的sum. 比如: 给定例如以下二叉树和sum=22. 5 / \ 4 8 / / \ ...
- 自己定义msi安装包的运行过程
有时候我们须要在程序中运行还有一个程序的安装.这就须要我们去自己定义msi安装包的运行过程. 比方我要做一个安装管理程序,能够依据用户的选择安装不同的子产品.当用户选择了三个产品时,假设分别显示这三个 ...
- SQL常见问题及解决备忘
1.mysql中:you cant't specify tartget table for update in from clause 错误 含义:在同一语句中update或delete某张表的时候, ...