0.准备工作

1.相关教程

Python 爬虫系列教程：http://cuiqingcai.com/1052.html

Python Web课程：http://www.cnblogs.com/moonache/p/5110322.html

Python 中文参考文档：http://python.usyiyi.cn/

2.说明

下面的代码基本只处于可用阶段，欠缺移植性，本篇Bolg更多是一种记录

本篇Bolg中使用的是Python2.7

CPU信息从该网址获取：http://zj.zol.com.cn/

3.效果

1.获取CPU型号和主频信息

1.神伤的AJAX

本来想直接爬，结果发现http://zj.zol.com.cn/ 翻页后链接不变，通过chrome F12的控制台发现是通过AJAX刷新的

而且发现了 http://zj.zol.com.cn/index.php?c=Ajax_ParamResponse&a=GetGoods&subcateId=28&type=0&priceId=noPrice&page=3&manuId=¶mStr=&keyword=&locationId=1&queryType=0

我只需修改page=n 即可获取第n页的CPU信息

2.获取CPU名字

格式举列

tag.contents[0] :AMD \u7cfb\u5217 A8-7670\uff08\u76d2\u88c5\uff09<\/a>\r\n\t\t\t\t\t <\/h3>\r\n\t\t\t\t\t

manufacturer:AMD

modalDetail:A8-7670

modal:AMD A8-7670

#-*- coding: UTF-8 -*-

import urllib

import re

from bs4 import BeautifulSoup

url='http://zj.zol.com.cn/index.php?c=Ajax_ParamResponse&a=GetGoods&subcateId=28&type=0&priceId=noPrice&page=2&manuId=&paramStr=&keyword=&locationId=1&queryType=0'

html = urllib.urlopen(url).read()

soup=BeautifulSoup(html,"html.parser")

listModal=[]

listSpecs=[]

tags = soup.find_all("a",attrs={"target":"\\\"_blank\\\""})

cnt=0

for tag in tags:

    cnt+=1

    modalSubstr=tag.contents[0]

    #print 'modalSubstr:'+modalSubstr

    manufacturer=re.findall('(.+?) ',modalSubstr)[0]#非贪心匹配 遇到空格即中止，返回第一个匹配项

    #print 'manufacturer:'+manufacturer

    detailSubstr=re.findall(' ([0-9a-zA-Z- ]+)',modalSubstr)

    #print detailSubstr

    detailSubstr0=detailSubstr[0]

    #针对i3、i5、i7的处理

    if "i3" in modalSubstr:

        modalDetail="i3 "+detailSubstr0

    elif "i5" in modalSubstr:

        modalDetail="i5 "+detailSubstr0

    elif "i7" in modalSubstr:

        modalDetail="i7 "+detailSubstr0

    else:

        modalDetail=detailSubstr0

    #针对APU的处理

    if modalDetail=="APU":

        modalDetail+=" "+detailSubstr[1]

    modal=manufacturer+" "+modalDetail

    print "modal:"+modal

效果

3.获取CPU主频

except IndexError:因为中关村网站上最后一款CPU的主频信息暂无，所以针对这个情况它的规格（specs）为“Data Missed”

#-*- coding: UTF-8 -*-

import urllib

import re

from bs4 import BeautifulSoup

url='http://zj.zol.com.cn/index.php?c=Ajax_ParamResponse&a=GetGoods&subcateId=28&type=0&priceId=noPrice&page=2&manuId=&paramStr=&keyword=&locationId=1&queryType=0'

html = urllib.urlopen(url).read()

soup=BeautifulSoup(html,"html.parser")

listModal=[]

listSpecs=[]

tags = soup.find_all("a",attrs={"target":"\\\"_blank\\\""})

cnt=0

for tag in tags:

    cnt+=1

    print cnt

    substr=str(tag)[100:500]

    #以title='\"开头+任意小数+ GHz结尾

    specsDictionary=re.findall(r'title=\'\\\"([0-9.]+GHz)',substr)

    try:

        specs=specsDictionary[0]

    except IndexError:

        specs="Data Missed"

    print specs

效果

4.循环读取下一页并自动终止

一共有16页，本来可以直接用循环，但经观察发现每页开头的内容中有page值。而且当地址中的page>=16，index.php都只会返回page=16的内容。所以有了下面的代码用来循环读取下一页并自动终止。

urlLeft='http://zj.zol.com.cn/index.php?c=Ajax_ParamResponse&a=GetGoods&subcateId=28&type=0&priceId=noPrice&page='

urlRight='&manuId=&paramStr=&keyword=&locationId=1&queryType=0'

urlPageIndex=1

while (1):

    url=urlLeft+str(urlPageIndex)+urlRight

    html = urllib.urlopen(url).read()

    soup=BeautifulSoup(html,"html.parser")

    soupSub=str(soup)[0:50]

    pageIndex=int(re.findall('page\":([0-9]+)',soupSub)[0])

    if urlPageIndex==pageIndex:

        tags = soup.find_all("a",attrs={"target":"\\\"_blank\\\""})

        cnt=0

        for tag in tags:

            ......省略

        print "yes"+str(urlPageIndex)

        urlPageIndex+=1

    else:

        print "no"+str(urlPageIndex)

        break

5.输出为csv

python内置了csv读取和导入，我参考crifan上的的csv导出

import csv

with open('excel_2010_ms-dos.csv', 'rb') as csvfile:

    spamreader = csv.reader(csvfile, dialect='excel')

    for row in spamreader:

        print ', '.join(row)

6.最终代码

#-*- coding: UTF-8 -*-

import urllib

import re

import csv

from bs4 import BeautifulSoup

listModal=[]

listSpecs=[]

urlLeft='http://zj.zol.com.cn/index.php?c=Ajax_ParamResponse&a=GetGoods&subcateId=28&type=0&priceId=noPrice&page='

urlRight='&manuId=&paramStr=&keyword=&locationId=1&queryType=0'

urlPageIndex=1

while (1):

    url=urlLeft+str(urlPageIndex)+urlRight

    html = urllib.urlopen(url).read()

    soup=BeautifulSoup(html,"html.parser")

    soupSub=str(soup)[0:50]

    pageIndex=int(re.findall('page\":([0-9]+)',soupSub)[0])

    if urlPageIndex==pageIndex:

        tags = soup.find_all("a",attrs={"target":"\\\"_blank\\\""})

        cnt=0

        for tag in tags:

            cnt+=1

            modalSubstr=tag.contents[0]

            manufacturer=re.findall('(.+?) ',modalSubstr)[0]#非贪心匹配 遇到空格即中止，返回第一个匹配项

            detailSubstr=re.findall(' ([0-9a-zA-Z- ]+)',modalSubstr)

            detailSubstr0=detailSubstr[0]

            #针对i3、i5、i7的处理

            if "i3" in modalSubstr:

                modalDetail="i3 "+detailSubstr0

            elif "i5" in modalSubstr:

                modalDetail="i5 "+detailSubstr0

            elif "i7" in modalSubstr:

                modalDetail="i7 "+detailSubstr0

            else:

                modalDetail=detailSubstr0

            #针对APU的处理

            if modalDetail=="APU":

                modalDetail+=" "+detailSubstr[1]

            modal=manufacturer+" "+modalDetail

            listModal.append(modal)

            substr=str(tag)[100:500]

            #以title='\"开头+任意小数+ GHz结尾

            specsDictionary=re.findall(r'title=\'\\\"([0-9.]+GHz)',substr)

            try:

                specs=specsDictionary[0]

            except IndexError:

                specs="Data Missed"

            listSpecs.append(specs)

        print "yes"+str(urlPageIndex)

        urlPageIndex+=1

    else:

        print "no"+str(urlPageIndex)

        break

with open('Config.csv', 'wb') as csvfile:

    spamwriter = csv.writer(csvfile, dialect='excel')

    #write 标题行

    spamwriter.writerow(['Config_Type','Config_Modal','Config_Specs','Config_MinorSpecs'])

    i=0

    for elementModal in listModal:

        spamwriter.writerow(['CPU',listModal[i], listSpecs[i]])

        i+=1

来自为知笔记(Wiz)

Python 爬取中关村CPU名字和主频的更多相关文章

利用python爬取58同城简历数据
利用python爬取58同城简历数据利用python爬取58同城简历数据最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用pyth ...
利用Python爬取豆瓣电影
目标:使用Python爬取豆瓣电影并保存MongoDB数据库中我们先来看一下通过浏览器的方式来筛选某些特定的电影: 我们把URL来复制出来分析分析: https://movie.douban.com ...
Python爬取LOL英雄皮肤
Python爬取LOL英雄皮肤 Python 爬虫一实现分析在官网上找到英雄皮肤的真实链接,查看多个后发现前缀相同,后面对应为英雄的ID和皮肤的ID,皮肤的ID从00开始顺序递增,而英雄ID跟 ...
萌新学习Python爬取B站弹幕+R语言分词demo说明
代码地址如下:http://www.demodashi.com/demo/11578.html 一.写在前面之前在简书首页看到了Python爬虫的介绍,于是就想着爬取B站弹幕并绘制词云,因此有了这样 ...
Python爬取网页信息
Python爬取网页信息的步骤以爬取英文名字网站(https://nameberry.com/)中每个名字的评论内容,包括英文名,用户名,评论的时间和评论的内容为例. 1.确认网址在浏览器中输入初 ...
steam夏日促销悄然开始，用Python爬取排行榜上的游戏打折信息
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 不知不觉,一年一度如火如荼的steam夏日促销悄然开始了.每年通过大大小小 ...
Python爬取 | 唯美女生图片
这里只是代码展示,且复制后不能直接运行,需要配置一些设置才行,具体请查看下方链接介绍: Python爬取 | 唯美女生图片 from selenium import webdriver from fa ...
Python爬取 | 王者荣耀英雄皮肤海报
这里只展示代码,具体介绍请点击下方链接. Python爬取 | 王者荣耀英雄皮肤海报 import requests import re import os import time import wi ...
Python 爬取途虎养车全系车型轮胎保养数据
Python 爬取途虎养车全系车型轮胎保养数据 2021.7.27 更新增加标题.发布时间参数 demo文末自行下载,需要完整数据私聊我 2021.2.19 更新增加大保养数据 2020. ...

随机推荐

TensorFlow实战之Softmax Regression识别手写数字
关于本文说明,本人原博客地址位于http://blog.csdn.net/qq_37608890,本文来自笔者于2018年02月21日 23:10:04所撰写内容(http://blog.c ...
创建hbase-indexer出现 0 running
新建hbase-indexer后通过hbase-indexer list-indexers发现SEP subscription ID: null并且0 running processes,如下: IN ...
【linux之用户，密码，组管理】
一.用户及密码用户账户超级用户:UID=0 root 普通用户:UID!=0 系统用户: 0<UID<500 为了维持系统的某些功能或者实现某些服务不能完成登录时候的身份验证普通用 ...
ubuntu16 ftp 服务 vsftp 配置
转载:沐心_ 地址:http://bbs.csdn.net/topics/392186116------------------------------------------------------ ...
Java经典编程题50道之三十八
编写一个函数:输入n为偶数时,调用函数求1/2+1/4+...+1/n:当输入n为奇数时,调用函数1/1+1/3+...+1/n. public class Example38 { public ...
机器学习之支持向量机（二）：SMO算法
注:关于支持向量机系列文章是借鉴大神的神作,加以自己的理解写成的:若对原作者有损请告知,我会及时处理.转载请标明来源. 序: 我在支持向量机系列中主要讲支持向量机的公式推导,第一部分讲到推出拉格朗日对 ...
git命令高级
Git 分支 - 分支的删除 git删除本地分支 git branch -D br git删除远程分支 git push origin :br (origin 后面有空格) clone服务器上的资源 ...
jdk 1.8 开发环境配置
计算机->右键->属性->高级系统设置->环境变量->系统变量新建系统变量:JAVA_HOME,变量值为:C:\Program Files (x86)\Java\jdk ...
正"/" 和反"\"的区别？
反斜杠"\"是电脑出现了之后为了表示程序设计里的特殊含义才发明的专用标点.就是说,除了程序设计领域外,任何地方你都不应该有使用反斜杠的时候,请永远使用正斜杠"/" ...
Duilib第一步(II)-Hello World
My first Duilib program 1. Prepare for development 打开DuiFarm项目DuiFarm.cpp文件,将除_tWinMain函数之外所有内容删除.删除 ...

Python 爬取 中关村CPU名字和主频