爬虫学习（二）--爬取360应用市场app信息

欢迎加入python学习交流群 667279387

爬虫学习

爬虫学习（一）—爬取电影天堂下载链接

 爬虫学习（二）–爬取360应用市场app信息

代码环境：windows10， python 3.5

主要用的软件包：SQLAlchemy，re

初学爬虫，没有使用scrapy框架，而是自己简单打了一个框架。代码里面也没有考虑记录日志以及错误处理等方面的内容，只是能简单工作。如果需要可以在此源码的基础上面进行修改。源码下载地址在文章末尾。

1、分析网页源码

本次抓取主要抓取了app名字，下载次数，评分，开发公司，最新版本号，更新时间。

先打开一个具体的软件页面进行查看网页源码

http://zhushou.360.cn/detail/index/soft_id/77208

下面是截取含有具体信息的两个网页源码的片段。

<h2 id="app-name"><span title="360手机卫士-一键连免费wifi">360手机卫士-一键连免费wi...</span><cite class="verify_tag"></cite><cite class="white_tag"></cite></h2>

<div class="pf">                                            <span class="s-1 js-votepanel">8.8<em>分</em></span>

<span class="s-2"><a href="#comment-list" id="comment-num"><span class="js-comments review-count-all" style="margin:0;">0</span>条评价</a></span>

<span class="s-3">下载：187373万次</span>

<span class="s-3">15.82M</span>



<td width="50%"><strong>作者：</strong>北京奇虎科技有限公司</td>

<td width="50%"><strong>更新时间：</strong>2017-09-13</td>

                                                                                                        <td><strong>版本：</strong>7.7.4<!--versioncode:257--><!--updatetime:2017-09-13--></td>

                                                                                                                                                                            <td><strong>系统：</strong>Android 4.0.3以上</td>                                                                                                                                                                                                                                                                                                                       <td colspan="2"><strong>语言：</strong>中文</td>

本次解析也没有xpath解析，而是直接用正则来匹配。下面是正则匹配时用到的代码。

r_name = re.compile(u"<title>(.*?)_360手机助手</title>", re.DOTALL)

r_download_num = re.compile(u'<span class="s-3">下载：(.*?)次</span>', re.DOTALL)

r_score = re.compile(u'<span class="s-1 js-votepanel">(.*?)<em>分</em>', re.DOTALL)

r_author = re.compile(u"<strong>作者：</strong>(.*?)</td>", re.DOTALL)

r_version = re.compile(u"<strong>版本：</strong>(.*?)<!--", re.DOTALL)

r_update_time = re.compile(u"<strong>更新时间：</strong>(.*?)</td>", re.DOTALL)

下面是解析页面的用法

m = r_name.search(html)

app_name = m.group(m.lastindex).strip()

其他字段的解析基本类似。

2、设计数据库字段

这里是利用了SQLAlchemy来实现ORM。

class App360(BaseModel):

    __tablename__ = 'app360'

    id = Column(Integer, primary_key=True, autoincrement=True)

    soft_id = Column(Integer, nullable=False)

    name = Column(String(100), nullable=False)

    author = Column(String(50), nullable=False)

    download_num = Column(String(50), nullable=False)

    score = Column(Float, nullable=False)

    # comments_num = Column(Integer, nullable=False)

    update_time = Column(DateTime, nullable=False)

    version = Column(String(50))

数据库管理的代码，主要实现了数据库的初始化，以及数据的插入和查询。

class DbManager(object):

    def __init__(self, Dbstring):

        self.engine = create_engine(Dbstring, echo=True)

        self._dbSession = scoped_session(

            sessionmaker(

                autocommit=False,

                autoflush=False,

                bind=self.engine

            )

        )

    def init_db(self):

        BaseModel.metadata.create_all(self.engine)

    def closeDB(self):

        self._dbSession().close()

    def getAppWithSoftId(self, soft_id):

        db_item = self._dbSession().query(App360).filter(App360.soft_id == soft_id).first()

        if db_item:

            return db_item

        else:

            return None

    def saveAppItem(self, app_object):

        db_item = self._dbSession().query(App360).filter(App360.soft_id == app_object.soft_id).first()

        if not db_item:

            self._dbSession().add(app_object)

            self._dbSession().commit()

3、抓取页面

获取到一个页面里出来的所有app的soft_id

r_url = re.compile(u'<a sid="(.*?)" href=', re.DOTALL)

def get_onePage_SoftId(url):

    res_html = do_request(url)

    soft_ids = r_url.findall(res_html)

    return soft_ids

获取单个app的详细信息

def get_app_detail(soft_id):

    db_item = db.getAppWithSoftId(soft_id)

    if not db_item:

        url = "http://zhushou.360.cn/detail/index/soft_id/" + str(soft_id)

        app_html = do_request(url)

        app_item = extract_details(app_html, soft_id)

        db.saveAppItem(app_object=app_item)

这里简单粗暴的用了多个循环来获取，实际考虑性能的话，此处应该优化。后续有时间了再学习研究下怎么优化。

    for url in start_urls:

        for i in range(50):

            url = "http://zhushou.360.cn"+url+"?page=%s"%i

            ids = get_onePage_SoftId(url)

            for id in ids:

                get_app_detail(id)

获取到的数据截图如下：

源码下载地址：

链接：https://pan.baidu.com/s/1sl6xPEl 密码：k48g

————————————————————————————

后续经过改进，用了并行处理，快了很多，7000多条记录，大概10来分钟全部下载好了。

from utils import *

from concurrent import futures

from models import DbManager, App360

def get_app_detail(soft_id):

    db_item = db.getAppWithSoftId(soft_id)

    if not db_item:

        url = "http://zhushou.360.cn/detail/index/soft_id/" + str(soft_id)

        app_html = do_request(url)

        app_item = extract_details(app_html, soft_id)

        db.saveAppItem(app_object=app_item)

def get_onePage_SoftId(url):

    res_html = do_request(url)

    soft_ids = r_url.findall(res_html)

    if soft_ids:

        return soft_ids

    else:

        return []

if __name__=="__main__":

    # 初始化数据库

    DB_CONNECT_STRING = 'mysql+pymysql://root:hillstone@localhost:3306/app?charset=utf8'

    db = DbManager(Dbstring=DB_CONNECT_STRING)

    db.init_db()

    for url in start_urls:

        for i in range(50):

            one_url = "http://zhushou.360.cn"+url+"?page=%s"%str(i)

            #ids = get_onePage_SoftId(url)

            executor = futures.ThreadPoolExecutor(max_workers=20)

            results = executor.map(get_app_detail, get_onePage_SoftId(one_url))

更新后的代码地址：

https://github.com/Zhanben/python/tree/master/360Spider

如果源码对你有用，请评论下博客说声谢谢吧~

欢迎加入python学习交流群 667279387