【原创】shadowebdict开发日记：基于linux的简明英汉字典（三）

全系列目录：

承接上文。

现在来进行response模块的开发。

这一模块所完成的任务是，如果本地的词库中没有用户需要查询的词汇，那么就去网络上寻找到相应的词条作为结果返回，并存入本地数据库。

我选择的网上的源是iciba，理由很简单，不需要复杂的cookie管理，所查词汇的内容基本集成在返回的html源文件中。

值得注意的是，如果请求过于频繁，那么会被iciba ban掉，所以如果要利用这段代码爬iciba的词库，请自行加个sleep。不过好像我代码中也有，注意改下便是。

该模块的逻辑为：

0、提供一个接口给其他模块调用，输入为待查词汇。

1、构造url请求，获得返回的数据。

2、根据数据的格式，解析返回的数据并获取相应词条的内容

3、按照约定的格式返回相应词条的内容给调用其的其他模块

具体的做法参考源代码

# -*- coding:utf-8 -*-

__author__ = 'wmydx'

import urllib

import re

import urllib2

import time

class GetResponse:

    def __init__(self):

        self.url = 'http://www.iciba.com/'

        self.isEng = re.compile(r'(([a-zA-Z]*)(\s*))*$')

        self.group_pos = re.compile(r'<div class="group_pos">(.*?)</div>', re.DOTALL)

        self.net_paraphrase = re.compile(r'<div class="net_paraphrase">(.*?)</div>', re.DOTALL)

        self.sentence = re.compile(r'<dl class="vDef_list">(.*?)</dl>', re.DOTALL)

    def process_input(self, word):

        word = word.strip()

        word = word.replace(' ', '_')

        return word

    def get_data_from_web(self, word):

        headers = {'Referer': 'http://www.iciba.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) \

        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'}

        request = urllib2.Request(self.url + word, headers=headers)

        while True:

            try:

                f = urllib2.urlopen(request).read()

                break

            except:

                pass

        return f

    def get_eng_from_chinese(self, word):

        word = self.process_input(word)

        word = urllib.quote(word)

        data = self.get_data_from_web(word)

        label_lst = re.compile(r'<span class="label_list">(.*?)</span>', re.DOTALL)

        label_itm = re.compile(r'<label>(?P<item>.*?)</a>(.*?)</label>', re.DOTALL)

        first = label_lst.search(data)

        data = data[first.start():first.end()]

        start_itm = 0

        res = []

        while 1:

            second = label_itm.search(data, start_itm)

            if not second:

                break

            word = self.get_sentence_from_dt(data[second.start('item'):second.end('item')])

            res.append(word)

            start_itm = second.end()

        return res

    def get_dict_data(self, word):

        englst = []

        res = []

        match = self.isEng.match(word)

        if not match:

            englst = self.get_eng_from_chinese(word)

        else:

            englst.append(word)

        for item in englst:

            word = self.process_input(item)

            data = self.get_data_from_web(word)

            if data.find('对不起，没有找到') != -1:

                res.append(-1)

            else:

                tmp_dict = self.analysis_eng_data(data)

                tmp_dict['word'] = word

                tmp_dict['times'] = 1

                res.append(tmp_dict)

        return res

    def analysis_eng_data(self, data):

        res = {}

        explain = self.group_pos.search(data)

        if explain:

            explain = data[explain.start():explain.end()]

            res['explain'] = self.generate_explain(explain)

        else:

            res['explain'] = -1

        net_explain = self.net_paraphrase.search(data)

        if net_explain:

            net_explain = data[net_explain.start():net_explain.end()]

            res['net_explain'] = self.generate_net_explain(net_explain)

        else:

            res['net_explain'] = -1

        sentence_start = 0

        sentence_end = len(data)

        sentence_lst = []

        while sentence_start < sentence_end:

            sentence = self.sentence.search(data, sentence_start)

            if sentence:

                sentence_str = data[sentence.start():sentence.end()]

            else:

                break

            sentence_lst.append(self.generate_sentence(sentence_str))

            sentence_start = sentence.end()

        res['sentence'] = "\n\n".join(sentence_lst)

        return res

    def generate_explain(self, target):

        start_word = 0

        end_word = len(target)

        meta_word = re.compile(r'<strong class="fl">(?P<meta_word>.*?)</strong>', re.DOTALL)

        label_lst = re.compile(r'<span class="label_list">(.*?)</span>', re.DOTALL)

        label_itm = re.compile(r'<label>(?P<item>.*?)</label>', re.DOTALL)

        res = ''

        while start_word < end_word:

            first = meta_word.search(target, start_word)

            if first:

                word_type = target[first.start('meta_word'):first.end('meta_word')]

            else:

                break

            res += word_type + ' '

            second = label_lst.search(target, first.end('meta_word'))

            start_label = second.start()

            end_label = second.end()

            while start_label < end_label:

                third = label_itm.search(target, start_label)

                if third:

                    res += target[third.start('item'):third.end('item')]

                    start_label = third.end()

                else:

                    break

            res += '\n'

            start_word = end_label

        return res

    def generate_net_explain(self, target):

        start_itm = 0

        end_itm = len(target)

        li_item = re.compile(r'<li>(?P<item>.*?)</li>', re.DOTALL)

        res = '网络释义： '

        while 1:

            first = li_item.search(target, start_itm)

            if first:

                res += target[first.start('item'):first.end('item')]

            else:

                break

            start_itm = first.end()

        return res

    def generate_sentence(self, target):

        res = ''

        english = re.compile(r'<dt>(?P<eng>.*?)</dt>', re.DOTALL)

        chinese = re.compile(r'<dd>(?P<chn>.*?)</dd>', re.DOTALL)

        first = english.search(target)

        second = chinese.search(target)

        res += self.get_sentence_from_dt(target[first.start('eng'):first.end('eng')]) + '\n'

        res += target[second.start('chn'):second.end('chn')]

        return res

    def get_sentence_from_dt(self, target):

        res = ''

        length = len(target)

        index = 0

        while index < length:

            if target[index] == '<':

                while target[index] != '>':

                    index += 1

            else:

                res += target[index]

            index += 1

        return res

if __name__ == '__main__':

    p = GetResponse()

    test = ['hello', 'computer', 'nothing', 'bad guy', 'someday']

    for item in test:

        res = p.get_dict_data(item)

        for key in res:

            for (k, v) in key.items():

                print "dict[%s]=" % k, v

            print

        time.sleep(3)

【原创】shadowebdict开发日记：基于linux的简明英汉字典（三）的更多相关文章

【原创】shadowebdict开发日记：基于linux的简明英汉字典（四）
全系列目录: [原创]shadowebdict开发日记:基于linux的简明英汉字典(一) [原创]shadowebdict开发日记:基于linux的简明英汉字典(二) [原创]shadowebdic ...
【原创】shadowebdict开发日记：基于linux的简明英汉字典（二）
全系列目录: [原创]shadowebdict开发日记:基于linux的简明英汉字典(一) [原创]shadowebdict开发日记:基于linux的简明英汉字典(二) [原创]shadowebdic ...
【原创】shadowebdict开发日记：基于linux的简明英汉字典（一）
全系列目录: [原创]shadowebdict开发日记:基于linux的简明英汉字典(一) [原创]shadowebdict开发日记:基于linux的简明英汉字典(二) [原创]shadowebdic ...
Go 语言开发的基于 Linux 虚拟服务器的负载平衡平台 Seesaw
负载均衡系统 Seesaw Seesaw是由我们网络可靠性工程师用 Go 语言开发的基于 Linux 虚拟服务器的负载平衡平台,就像所有好的项目一样,这个项目也是为了解决实际问题而产生的. Seesa ...
嵌入式Linux驱动开发日记
嵌入式Linux驱动开发日记主机硬件环境开发机:虚拟机Ubuntu12.04 内存: 1G 硬盘:80GB 目标板硬件环境 CPU: SP5V210 (开发板:QT210) SDRAM: 512M ...
用VSCode开发一个基于asp.net core 2.0/sql server linux(docker)/ng5/bs4的项目(1)
最近使用vscode比较多. 学习了一下如何在mac上使用vscode开发asp.netcore项目. 这里是我写的关于vscode的一篇文章: https://www.cnblogs.com/cgz ...
Linux系统启动那些事—基于Linux 3.10内核【转】
转自:https://blog.csdn.net/shichaog/article/details/40218763 Linux系统启动那些事—基于Linux 3.10内核 csdn 我的空间的下载地 ...
轻易实现基于linux或win运行的聊天服务端程序
对于不了解网络编程的开发人员来说,编写一个良好的服务端通讯程序是一件比较麻烦的事情.然而通过EC这个免费组件你可以非常简单地构建一个基于linux或win部署运行的网络服务程序.这种便利性完全得益于m ...
高性能Linux服务器第10章基于Linux服务器的性能分析与优化
高性能Linux服务器第10章基于Linux服务器的性能分析与优化作为一名Linux系统管理员,最主要的工作是优化系统配置,使应用在系统上以最优的状态运行.但硬件问题.软件问题.网络环境等 ...

随机推荐

apk应用的反编译和源代码的生成
对于反编译一直持有无所谓有或无的态度.经过昨天一下午的尝试,也有了点心得和体会: 先给大家看看编译的过程和我们反编译的过程概图吧: 例如以下是反编译工具的根文件夹结构: 三个目录也实际上是下面三个步骤 ...
UVA - 11388 GCD LCM
II U C ONLINE C ON TEST Problem D: GCD LCM Input: standard input Output: standard output The GC ...
Python学习入门基础教程(learning Python)--5.3 Python写文件基础
前边我们学习了一下Python下如何读取一个文件的基本操作,学会了read和readline两个函数,本节我们学习一下Python下写文件的基本操作方法. 这里仍然是举例来说明如何写文件.例子的功能是 ...
跨域GET、POST请求
跨域GET.POST请求的小结重点:跨域POST大量数据: JQuery:$.ajax/$.getJSON支持jsonp格式的跨域,但是只支持GET方式,暂不支持POST: CORS:w3c关于跨域 ...
EF中的事务处理的初步理解
http://yanwushu.byethost7.com/?p=87 1. EF对事务进行了封装:context的saveChange()是有事务性的. 2. 依赖多个不同的Context的操作(即 ...
loj1201（最大独立集）
传送门:A Perfect Murder 题意:有一群苍蝇,之间有一些是朋友关系,如果杀了一只苍蝇,那么它的朋友们都会有警惕性,再也杀不了这些朋友了,问最多能杀多少只苍蝇. 分析:根据朋友性连边,最多 ...
在word 中复选框划勾或叉的方法
输入大写字母R.大写字母Q ,然后将字体改为Wingdings 2, 就分离得到带框的勾和叉.
只能从脚本中调用在类定义上有[ScriptService]属性的Web服务问题的解决方案
ajax调用webservice中的接口时, 会出现[只能从脚本中调用在类定义上有[ScriptService]属性的...]的异常. 这是因为, 在.net3.5中, 访问web服务, 要对web服 ...
Wix学习整理（5）——安装时填写注册表
原文:Wix学习整理(5)--安装时填写注册表一 Microsoft操作系统的注册表什么是注册表? 注册表是Mircrosoft Windows中的一个重要的数据库,用于存储系统和应用程序的设置信 ...
hdu2712（贪心）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=2712 题意:是求最短的非子串(子串不要求连续)的长度. 分析:把序列划分为尽量多(假设为ans)的含有 ...

【原创】shadowebdict开发日记：基于linux的简明英汉字典（三）

全系列目录：

【原创】shadowebdict开发日记：基于linux的简明英汉字典（三）的更多相关文章

随机推荐

热门专题