python爬虫学习之使用XPath解析开奖网站

实例需求：运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息，并且保存为txt文件。

实例环境：python3.7
　　　　　 BeautifulSoup库、XPath(需手动安装)
　　　　　 urllib库(内置的python库，无需手动安装)

实例网站：

　　第一步，点击链接http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html进入网站，查看网站基本信息，注意一共要爬取118页数据。

　　第二步，查看网页源代码，熟悉网页结构，标签等信息。

实例代码：

#encoding=utf-8

#pip install lxml

from bs4 import BeautifulSoup

import urllib.request

from lxml import etree

class GetDoubleColorBallNumber(object):

    def __init__(self):

        self.urls = []

        self.getUrls()

        self.items = self.spider(self.urls)

        self.pipelines(self.items)

    def getUrls(self):

        URL = r'http://kaijiang.zhcw.com/zhcw/html/ssq/list.html'

        htmlContent = self.getResponseContent(URL)

        soup = BeautifulSoup(htmlContent, 'html.parser')

        tag = soup.find_all('p')[-1]

        pages = tag.strong.get_text()

        pages = '3'

        for i in range(2, int(pages)+1):

            url = r'http://kaijiang.zhcw.com/zhcw/html/ssq/list_' + str(i) + '.html'

            self.urls.append(url)

    #3、	网络模块（NETWORK）

    def getResponseContent(self, url):

        try:

            response = urllib.request.urlopen(url)

        except urllib.request.URLError as e:

            raise e

        else:

            return response.read().decode("utf-8")

    #3、爬虫模块（Spider）

    def spider(self,urls):

        items = []

        for url in urls:

            try:

                html = self.getResponseContent(url)

                xpath_tree = etree.HTML(html)

                trTags = xpath_tree.xpath('//tr[not(@*)]')   # 匹配所有tr下没有任何属性的节点

                for tag in trTags:

                    # if tag.xpath('../html'):

                    #     print("找到了html标签")

                    # if tag.xpath('/td/em'):

                    #     print("****************")

                    #如果存在em子孙节点

                    if tag.xpath('./td/em'):

                        item = {}

                        item['date'] = tag.xpath('./td[1]/text()')[0]

                        item['order'] = tag.xpath('./td[2]/text()')[0]

                        item['red1'] = tag.xpath('./td[3]/em[1]/text()')[0]

                        item['red2'] = tag.xpath('./td[3]/em[2]/text()')[0]

                        item['red3'] = tag.xpath('./td[3]/em[3]/text()')[0]

                        item['red4'] = tag.xpath('./td[3]/em[4]/text()')[0]

                        item['red5'] = tag.xpath('./td[3]/em[5]/text()')[0]

                        item['red6'] = tag.xpath('./td[3]/em[6]/text()')[0]

                        item['blue'] = tag.xpath('./td[3]/em[7]/text()')[0]

                        item['money'] = tag.xpath('./td[4]/strong/text()')[0]

                        item['first'] = tag.xpath('./td[5]/strong/text()')[0]

                        item['second'] = tag.xpath('./td[6]/strong/text()')[0]

                        items.append(item)

            except Exception as e:

                print(str(e))

                raise e

        return items

    def pipelines(self,items):

        fileName = u'双色球.txt'

        with open(fileName, 'w') as fp:

            for item in items:

                fp.write('%s %s \t %s %s %s %s %s %s  %s \t %s \t %s %s \n'

                      %(item['date'],item['order'],item['red1'],item['red2'],item['red3'],item['red4'],item['red5'],item['red6'],item['blue'],item['money'],item['first'],item['second']))

if __name__ == '__main__':

    GDCBN = GetDoubleColorBallNumber()

实例结果：

python爬虫学习之使用XPath解析开奖网站的更多相关文章

Python爬虫学习之使用beautifulsoup爬取招聘网站信息
菜鸟一只,也是在尝试并学习和摸索爬虫相关知识. 1.首先分析要爬取页面结构.可以看到一列搜索的结果,现在需要得到每一个链接,然后才能爬取对应页面. 关键代码思路如下: html = getHtml(& ...
python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化
实例需求:运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息,并且保存为txt文件和excel文件. 实 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
Python爬虫教程-22-lxml-etree和xpath配合使用
Python爬虫教程-22-lxml-etree和xpath配合使用 lxml:python 的HTML/XML的解析器官网文档:https://lxml.de/ 使用前,需要安装安 lxml 包 ...
小白学 Python 爬虫（21）：解析库 Beautiful Soup（上）
小白学 Python 爬虫(21):解析库 Beautiful Soup(上) 人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前 ...
小白学 Python 爬虫（22）：解析库 Beautiful Soup（下）
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
小白学 Python 爬虫（23）：解析库 pyquery 入门
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
python爬虫学习01--电子书爬取
python爬虫学习01--电子书爬取 1.获取网页信息 import requests #导入requests库 ''' 获取网页信息 ''' if __name__ == '__main__': ...

随机推荐

2018-2019-2 20165315 《网络对抗技术》Exp4 恶意代码分析
2018-2019-2 20165315 <网络对抗技术>Exp4 恶意代码分析一.实验要求 1.系统运行监控使用如计划任务,每隔一分钟记录自己的电脑有哪些程序在联网,连接的外部IP是 ...
NoteBook学习（二）-------- Zeppelin简介与安装
Zeppelin官网地址: http://zeppelin.apache.org/ Github地址: https://github.com/apache/zeppelin (参照官网) 1.什么是z ...
wpf Binding 小记录
1.后台属性绑定: Grid g = new Grid() { Width = 60, Height = 100 }; g.SetValue(Panel.ZIndexProperty, 10); // ...
微信小程序登录流程
小程序登录流程参考 app.js需要做的 1,首先通过wx.login()拿到code,成功之后,发送数据,请求接口,把code发送给后端,换区openid,sessionKey,unionId,把 ...
Ztree的onClick和onCheck事件
如下图所示,点击框选中,再点击框取消.现在需加上点击字体也能选中,再点击则取消思路:点击事件是onClick,勾选的回调函数为onCheck,要实现上面需求,我们只需要在callback里新增一个点 ...
macOS HomeBrew更换源 brew常用命令说明
homebrew本身就是一个git仓库.使用homebrew安装软件包时,会自动先下载软件包,然后解压安装,但有时候下载会卡住,或者很慢,这个时候有以下几种方法: 1.临时的终止update,先con ...
CSS网页布局
注:优化样式表:增加css样式表的可读性减伤样式重复一.主要内容 1.布局分类;131 121 2.display属性排版 3.float属性排版(横向多列布局) 4.防止父类盒子塌陷二.标 ...
ClassLoader的工作机制
本文中主要介绍类加载器的工作机制一:首先什么是类加载器? 类加载器就是用来加载java类到java虚拟机中.java源程序经过编译之后形成字节码文件,类加载器将字节码文件加载到内存中,并转换成jav ...
MySQL ERROR 1064(42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near
通常出现该错误的原因是使用了 MySQL 的保留字解决方法是对使用的保留字使用反引号 (Tab键上面)
python 数据可视化 -- 生成可控的随机数据集合
生成可控的随机数据集合使用 numpy.random 模块 numpy.random.random(size=None) 返回 [0.0, 1.0) 区间的随机 floats, 默认返回一个 fl ...

python爬虫学习之使用XPath解析开奖网站

python爬虫学习之使用XPath解析开奖网站的更多相关文章

随机推荐

热门专题