python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化

实例需求：运用python语言爬取http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html这个开奖网站所有的信息，并且保存为txt文件和excel文件。

实例环境：python3.7
　　　　　 BeautifulSoup库、xlwt库(需手动安装)
　　　　　 urllib库、re库(内置的python库，无需手动安装)

实例网站：

　　第一步，点击链接http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html进入网站，查看网站基本信息，注意一共要爬取118页数据。

　　第二步，查看网页源代码，熟悉网页结构，标签等信息。

实例思路：　

　　一个爬虫程序的结构：
　 1、调度模块（Scheduler）：安排发起网络请求的策略
　 2、网络模块（network）：发起网络请求，并接受服务器返回
　 3、爬虫模块（Spider）：解析、爬取数据
　 4、Item模块：定义爬取的数据项
　 5、Piplines模块：对已经爬取的数据做后续处理（存入数据库、存入文件系统、传递给流式处理框架，等等）

　下面的示例程序基本实现了上述几个模板

实例代码：

　　getWinningNum.py

#encoding=utf-8

import re

from bs4 import BeautifulSoup

import urllib.request

from save2excel import SavaBallDate

#4、    Item模块  定义爬取的数据项

class DoubleColorBallItem(object):

    date = None

    order = None

    red1 = None

    red2 = None

    red3 = None

    red4 = None

    red5 = None

    red6 = None

    blue = None

    money = None

    firstPrize = None

    secondPrize = None

class GetDoubleColorBallNumber(object):

    def __init__(self):

        self.urls = []

        self.urls = self.getUrls()

        self.items = self.spider(self.urls)

        self.pipelines(self.items)

        SavaBallDate(self.items)

    # 获取 urls 的函数

    def getUrls(self):

        URL = r'http://kaijiang.zhcw.com/zhcw/html/ssq/list_1.html'

        htmlContent = self.getResponseContent(URL)

        soup = BeautifulSoup(htmlContent, 'lxml')

        tag = soup.find_all('p')[-1]

        pages = tag.strong.get_text()

        for i in range(1, int(pages)+1):

            url = r'http://kaijiang.zhcw.com/zhcw/html/ssq/list_' + str(i) + '.html'

            self.urls.append(url)

        return self.urls

    #3、    网络模块（NETWORK）发起网络请求，并接受服务器返回

    def getResponseContent(self, url):

        try:

            response = urllib.request.urlopen(url)

        except URLError as e:

            raise e

        else:

            return response.read().decode("utf-8")

    #3、    爬虫模块（Spider） 解析、爬取数据

    def spider(self,urls):

        items = []

        for url in urls:

            try:

                htmlContent = self.getResponseContent(url)

                soup = BeautifulSoup(htmlContent, 'lxml')

                tags = soup.find_all('tr', attrs={})

                for tag in tags:

                    if tag.find('em'):

                        item = DoubleColorBallItem()

                        tagTd = tag.find_all('td')

                        item.date = tagTd[0].get_text()

                        item.order = tagTd[1].get_text()

                        tagEm = tagTd[2].find_all('em')

                        item.red1 = tagEm[0].get_text()

                        item.red2 = tagEm[1].get_text()

                        item.red3 = tagEm[2].get_text()

                        item.red4 = tagEm[3].get_text()

                        item.red5 = tagEm[4].get_text()

                        item.red6 = tagEm[5].get_text()

                        item.blue = tagEm[6].get_text()

                        item.money = tagTd[3].find('strong').get_text()

                        item.firstPrize = tagTd[4].find('strong').get_text()

                        item.secondPrize = tagTd[5].find('strong').get_text()

                        items.append(item)

            except Exception as e:

                raise e

                # print(str(e))

        return items

    # Piplines模块：对已经爬取的数据做后续处理（存入数据库、存入文件系统、传递给流式处理框架，等等）

    def pipelines(self,items):

        fileName = u'双色球.txt'

        with open(fileName, 'w') as fp:   # a 为追加   w 为覆盖若存在

            for item in items:

                fp.write('%s %s \t %s %s %s %s %s %s  %s \t %s \t %s %s \n'

                      %(item.date,item.order,item.red1,item.red2,item.red3,item.red4,item.red5,item.red6,item.blue,item.money,item.firstPrize,item.secondPrize))

if __name__ == '__main__':

    GDCBN = GetDoubleColorBallNumber()

　　save2excel.py

#encoding=utf-8

import xlwt

class SavaBallDate(object):

    def __init__(self, items):

        self.items = items

        self.run(self.items)

    def run(self,items):

        fileName = u'双色球.xls'

        book = xlwt.Workbook(encoding='utf-8')

        sheet=book.add_sheet('ball', cell_overwrite_ok=True)

        sheet.write(0, 0, u'开奖日期')

        sheet.write(0, 1, u'期号')

        sheet.write(0, 2, u'红1')

        sheet.write(0, 3, u'红2')

        sheet.write(0, 4, u'红3')

        sheet.write(0, 5, u'红4')

        sheet.write(0, 6, u'红5')

        sheet.write(0, 7, u'红6')

        sheet.write(0, 8, u'蓝')

        sheet.write(0, 9, u'销售金额')

        sheet.write(0, 10, u'一等奖')

        sheet.write(0, 11, u'二等奖')

        i = 1

        while i <= len(items):

            item = items[i-1]

            sheet.write(i, 0, item.date)

            sheet.write(i, 1, item.order)

            sheet.write(i, 2, item.red1)

            sheet.write(i, 3, item.red2)

            sheet.write(i, 4, item.red3)

            sheet.write(i, 5, item.red4)

            sheet.write(i, 6, item.red5)

            sheet.write(i, 7, item.red6)

            sheet.write(i, 8, item.blue)

            sheet.write(i, 9, item.money)

            sheet.write(i, 10, item.firstPrize)

            sheet.write(i, 11, item.secondPrize)

            i += 1

        book.save(fileName)

if __name__ == '__main__':

    pass

实例结果：

　　数据量有点大，可能需要等一会儿，下面为程序运行结束后的文件夹。

　　__pycache__文件夹为程序运行自动生成的文件夹，不用管。

python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化的更多相关文章

python爬虫学习(一)：BeautifulSoup库基础及一般元素提取方法
最近在看爬虫相关的东西,一方面是兴趣,另一方面也是借学习爬虫练习python的使用,推荐一个很好的入门教程:中国大学MOOC的<python网络爬虫与信息提取>,是由北京理工的副教授嵩天老 ...
Python爬虫学习——使用selenium和phantomjs爬取js动态加载的网页
1.安装selenium pip install selenium Collecting selenium Downloading selenium-3.4.1-py2.py3-none-any.wh ...
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息 2018-07-21 23:53:02 larger5 阅读数 4123更多分类专栏: 网络爬虫版权声明: ...
Python爬虫教程-13-爬虫使用cookie爬取登录后的页面(人人网)（下）
Python爬虫教程-13-爬虫使用cookie爬取登录后的页面(下) 自动使用cookie的方法,告别手动拷贝cookie http模块包含一些关于cookie的模块,通过他们我们可以自动的使用co ...
python爬虫（三）用request爬取拉勾网职位信息
request.Request类如果想要在请求的时候添加一个请求头(增加请求头的原因是,如果不加请求头,那么在我们爬取得时候,可能会被限制),那么就必须使用request.Request类来实现,比 ...
Python爬虫学习三------requests+BeautifulSoup爬取简单网页
第一次第一次用MarkDown来写博客,先试试效果吧! 昨天2018俄罗斯世界杯拉开了大幕,作为一个伪球迷,当然也得为世界杯做出一点贡献啦. 于是今天就编写了一个爬虫程序将腾讯新闻下世界杯专题的相关新 ...
Python爬虫学习之使用beautifulsoup爬取招聘网站信息
菜鸟一只,也是在尝试并学习和摸索爬虫相关知识. 1.首先分析要爬取页面结构.可以看到一列搜索的结果,现在需要得到每一个链接,然后才能爬取对应页面. 关键代码思路如下: html = getHtml(& ...
python爬虫入门四：BeautifulSoup库(转)
正则表达式可以从html代码中提取我们想要的数据信息,它比较繁琐复杂,编写的时候效率不高,但我们又最好是能够学会使用正则表达式. 我在网络上发现了一篇关于写得很好的教程,如果需要使用正则表达式的话,参 ...
Python爬虫学习笔记-1.Urllib库
urllib 是python内置的基本库,提供了一系列用于操作URL的功能,我们可以通过它来做一个简单的爬虫. 0X01 基本使用简单的爬取一个页面: import urllib2 request ...

随机推荐

tp3.2 上传文件及下载文件
公共方法 UploadFile.class.php() // 开始 , , , ,];];,; ;; ::::::;,) {//文件上传失败 //捕获错误代码$this->error($file ...
生成二维码、条形码、带logo的二维码
Nuget安装ZXing.Net,帮助类: using System; using System.Collections.Generic; using System.Drawing; using Sy ...
将Windows系统移到另一个硬盘
原先的128GB SSD,给Windows用是够了,最近虚拟机用得多,靠以前的SSD外挂着用,实在有点不爽,就入手一个256GB的,重装系统是个令人头疼的事情,当然不能干.想起来以前另一个机器操作的时 ...
select 两层第二个select需要加别名
select t.id from (select xxx) t
MySQL千万级数据库查询怎么提高查询效率
在实际项目中,当MySQL表的数据达到百万级别时候,普通查询效率直线下降,而且当使用的where条件较多,其查询效率是让人无法容忍的.假如一个taobao订单查询详情要几十秒,可想而知的用户体验是多差 ...
局域网内yum源搭建
在一些环境下,服务器不能连接互联网,但是我们又偏偏需要安装一些软件,此时有点麻烦了.通过使用centos镜像文件,搭建yum本地源,局域网内所有服务器都可以连接这一本地源进行软件的下载和安装,下面具体 ...
optional install error: Package require os(darwin) not compatible with your platform(win32)
解决方法: cnpm rebuild node-sass cnpm install
maven跳过测试编译命令
mvn clean install/package/deploy -Dmaven.test.skip=true
UITextField 输入金额，小数点的控制输入
#pragma mark --- UITextFieldDelegate ---- (BOOL)textField:(UITextField *)textField shouldChangeChara ...
(PMP)第5章-----项目范围管理
产品范围:所具有的特征和功能项目范围:必须完成的工作. 5.1 规划范围管理输入工具与技术输出 1.项目章程 2.项目管理计划 (质量管理计划, 项目生命周期描述, 开发方法) 3.事业环境因 ...

python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化

python爬虫学习之使用BeautifulSoup库爬取开奖网站信息-模块化的更多相关文章

随机推荐

热门专题