python爬虫爬取汽车页面信息，并附带分析（静态爬虫）

环境：

windows，python3.4

参考链接：

https://blog.csdn.net/weixin_36604953/article/details/78156605

代码：（亲测可以运行）

 import requests

 from bs4 import BeautifulSoup

 import re

 import random

 import time

 # 爬虫主函数

 def mm(url):

     # 设置目标url，使用requests创建请求

     header = {

         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}

     req0 = requests.get(url=url, headers=header)

     req0.encoding = "gb18030"  # 解决乱码问题

     html0 = req0.text

     # 使用BeautifulSoup创建html代码的BeautifulSoup实例，存为soup0

     soup0 = BeautifulSoup(html0, "html.parser")

     # 获取最后一页数字，对应-122（对照前一小节获取尾页的内容看你就明白了）

     total_page = int(soup0.find("div", class_="pagers").findAll("a")[-2].get_text())

     myfile = open("aika_qc_gn_1_1_1.txt", "a", encoding='gb18030', errors='ignore')  # 解决乱码问题

     print("user", " 来源", " 认为有用人数", " 类型", " comment")

     NAME = "user" + " 来源" + " 认为有用人数" + " 类型" + " comment"

     myfile.write(NAME + "\n")

     for i in list(range(1, total_page + 1)):

         # 设置随机暂停时间

         stop = random.uniform(1, 3)

         url = "http://newcar.xcar.com.cn/257/review/0/0_" + str(i) + ".htm"

         req = requests.get(url=url, headers=header)

         req.encoding = "gb18030"  # 解决乱码问题

         html = req.text

         soup = BeautifulSoup(html, "html.parser")

         contents = soup.find('div', class_="review_comments").findAll("dl")

         l = len(contents)

         for content in contents:

             tiaoshu = contents.index(content)

             try:

                 ss = "正在爬取第%d页的第%d的评论，网址为%s" % (i, tiaoshu + 1, url)

                 print(ss)  # 正在爬取的条数

                 try:

                     # 点评角度

                     comment_jiaodu = content.find("dt").find("em").find("a").get_text().strip().replace("\n",

                                                                                                         "").replace(

                         "\t", "").replace("\r", "")

                 except:

                     comment_jiaodu = "sunny"

                 try:

                     # 点评类型

                     comment_type0 = content.find("dt").get_text().strip().replace("\n", "").replace("\t", "").replace(

                         "\r",

                         "")

                     comment_type1 = comment_type0.split("【")[1]

                     comment_type = comment_type1.split("】")[0]

                 except:

                     comment_type = "sunny"

                 # 认为该条评价有用的人数

                 try:

                     useful = int(

                         content.find("dd").find("div", class_="useful").find("i").find(

                             "span").get_text().strip().replace(

                             "\n", "").replace("\t", "").replace("\r", ""))

                 except:

                     useful = "sunny"

                 # 评论来源

                 try:

                     comment_region = content.find("dd").find("p").find("a").get_text().strip().replace("\n",

                                                                                                        "").replace(

                         "\t", "").replace("\r", "")

                 except:

                     comment_region = "sunny"

                 # 评论者名称

                 try:

                     user = \

                         content.find("dd").find("p").get_text().strip().replace("\n", "").replace("\t", "").replace(

                             "\r",

                             "").split(

                             "：")[-1]

                 except:

                     user = "sunny"

                 # 评论内容

                 try:

                     comment_url = content.find('dt').findAll('a')[-1]['href']

                     urlc = comment_url

                     headerc = {

                         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}

                     reqc = requests.get(urlc, headers=headerc)

                     htmlc = reqc.text

                     soupc = BeautifulSoup(htmlc, "html.parser")

                     comment0 = \

                         soupc.find('div', id='mainNew').find('div', class_='maintable').findAll('form')[1].find('table',

                                                                                                                 class_='t_msg').findAll(

                             'tr')[1]

                     try:

                         comment = comment0.find('font').get_text().strip().replace("\n", "").replace("\t", "")

                     except:

                         comment = "sunny"

                     try:

                         comment_time = soupc.find('div', id='mainNew').find('div', class_='maintable').findAll('form')[

                                            1].find('table', class_='t_msg').find('div',

                                                                                  style='padding-top: 4px;float:left').get_text().strip().replace(

                             "\n", "").replace(

                             "\t", "")[4:]

                     except:

                         comment_time = "sunny"

                 except:

                     try:

                         comment = \

                             content.find("dd").get_text().split("\n")[-1].split('\r')[-1].strip().replace("\n",

                                                                                                           "").replace(

                                 "\t", "").replace("\r", "").split("：")[-1]

                     except:

                         comment = "sunny"

                 time.sleep(stop)

                 print(user, comment_region, useful, comment_type, comment)

                 tt = user + " " + comment_region + " " + str(useful) + " " + comment_type + " " + comment

                 myfile.write(tt + "\n")

             except Exception as e:

                 print(e)

                 s = "爬取第%d页的第%d的评论失败，网址为%s" % (i, tiaoshu + 1, url)

                 print(s)

                 pass

     myfile.close()

 # 统计评论分布

 def fenxi():

     myfile = open("aika_qc_gn_1_1_1.txt", "r")

     good = 0

     middle = 0

     bad = 0

     nn = 0

     for line in myfile:

         commit = line.split(" ")[3]

         if commit == "好评":

             good = good + 1

         elif commit == "中评":

             middle = middle + 1

         elif commit == "差评":

             bad = bad + 1

         else:

             nn = nn + 1

     count = good + middle + bad + nn

     g = round(good / (count - nn) * 100, 2)

     m = round(middle / (count - nn) * 100, 2)

     b = round(bad / (count - nn) * 100, 2)

     n = round(nn / (count - nn) * 100, 2)

     print("好评占比：", g)

     print("中评占比：", m)

     print("差评占比：", b)

     print ("未评论：", n)

 url = "http://newcar.xcar.com.cn/257/review/0.htm"

 mm(url)

 fenxi()

BeautifulSoup神器

Python一个第三方库bs4中有一个BeautifulSoup库，是用于解析html代码的，换句话说就是可以帮助你更方便的通过标签定位你需要的信息。这里只介绍两个比较关键的方法：

1、find方法和findAll方法：
首先，BeautifulSoup会先将整个html或者你所指定的html代码编程一个BeautifulSoup对象的实例（不懂对象和实例不要紧，你只要把它当作是一套你使用F12看到的树形html代码代码就好），这个实例可以使用很多方法，最常用的就是find和findAll，二者的功能是相同的，通过find( )的参数，即find( )括号中指定的标签名，属性名，属性值去搜索对应的标签，并获取它，不过find只获取搜索到的第一个标签，而findAll将会获取搜索到的所有符合条件的标签，放入一个迭代器（实际上是将所有符合条件的标签放入一个list），findAll常用于兄弟标签的定位，如刚才定位口碑信息，口碑都在dl标签下，而同一页的10条口碑对应于10个dl标签，这时候用find方法只能获取第一个，而findAll会获取全部的10个标签，存入一个列表，想要获取每个标签的内容，只需对这个列表使用一个for循环遍历一遍即可。

2、get_text()方法：
使用find获取的内容不仅仅是我们需要的内容，而且包括标签名、属性名、属性值等，比如使用find方法获取"<Y yy='aaa'>xxxx</Y>" 的内容xxxx，使用find后，我们会得到整个"<Y yy='aaa'>xxxx</Y>"，十分冗长，实际我们想要的仅仅是这个标签的内容xxxx，因此，对使用find方法后的对象再使用get_text( )方法，就可以得到标签的内容了，对应到这里，我们通过get_text( )方法就可以得到xxxx了。

python爬虫爬取汽车页面信息，并附带分析（静态爬虫）的更多相关文章

简单的python爬虫--爬取Taobao淘女郎信息
最近在学Python的爬虫,顺便就练习了一下爬取淘宝上的淘女郎信息:手法简单,由于淘宝网站本上做了很多的防爬措施,应此效果不太好! 爬虫的入口:https://mm.taobao.com/json/r ...
Python爬虫-爬取京东商品信息-按给定关键词
目的:按给定关键词爬取京东商品信息,并保存至mongodb. 字段:title.url.store.store_url.item_id.price.comments_count.comments 工具 ...
node.js爬虫爬取拉勾网职位信息
简介用node.js写了一个简单的小爬虫,用来爬取拉勾网上的招聘信息,共爬取了北京.上海.广州.深圳.杭州.西安.成都7个城市的数据,分别以前端.PHP.java.c++.python.Androi ...
利用Python爬虫爬取淘宝商品做数据挖掘分析实战篇，超详细教程
项目内容本案例选择>> 商品类目:沙发: 数量:共100页 4400个商品: 筛选条件:天猫.销量从高到低.价格500元以上. 项目目的 1. 对商品标题进行文本分析词云可视化 2. ...
python itchat 爬取微信好友信息
原文链接:https://mp.weixin.qq.com/s/4EXgR4GkriTnAzVxluJxmg 「itchat」一个开源的微信个人接口,今天我们就用itchat爬取微信好友信息,无图言虚 ...
<scrapy爬虫>爬取校花信息及图片
1.创建scrapy项目 dos窗口输入: scrapy startproject xiaohuar cd xiaohuar 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # ...
python爬虫爬取全球机场信息
--2013年10月10日23:54:43 今天需要获取机场信息,发现一个网站有数据,用爬虫趴下来了所有数据: 目标网址:http://www.feeyo.com/airport_code.asp?p ...
Python爬虫爬取Web页面图片
从网页页面上批量下载jpg格式图片,并按照数字递增命名保存到指定的文件夹 Web地址:http://news.weather.com.cn/2017/12/2812347.shtml 打开网页,点击F ...
Java爬虫爬取京东商品信息
以下内容转载于<https://www.cnblogs.com/zhuangbiing/p/9194994.html>,在此仅供学习借鉴只用. Maven地址 <dependency ...

随机推荐

Java并发编程，3分分钟深入分析volatile的实现原理
volatile原理 volatile简介 Java内存模型告诉我们,各个线程会将共享变量从主内存中拷贝到工作内存,然后执行引擎会基于工作内存中的数据进行操作处理. 线程在工作内存进行操作后何时会写到 ...
Spring Cloud（8）：Sleuth和Zipkin的使用
场景: 某大型电商网站基于微服务架构,服务模块有几十个. 某天,测试人员报告该网站响应速度过慢.排除了网络问题之后,发现很难进一步去排除故障. 那么:如何对微服务的链路进行监控呢? Sleuth: 一 ...
Spring中基于AOP的@AspectJ
以下内容引用自http://wiki.jikexueyuan.com/project/spring/aop-with-spring-framenwork/aspectj-based-aop-with- ...
hp 88a加粉
http://v.youku.com/v_show/id_XNzEzODEwNzMy.html
mysql查看存储过程show procedure status;
1.mysql查看存储过程(函数) 2.MySQL查看触发器查看触发器语法:SHOW TRIGGERS [FROM db_name] [LIKE expr] 实例:SHOW TRIGGERS\G ...
IOS开发 ios7适配
ios7控制器试图默认为全屏显示,导航栏的不同设置会产生不同的效果. 首先判断系统的的版本,区别: if (floor(NSFoundationVersionNumber) <= NSFound ...
【转载】《Unix网络编程》思维导图
参考这篇文章,很不错: http://www.cnblogs.com/qiaoconglovelife/p/5734768.html
深入浅出Redis（二）高级特性：事务
第一篇中介绍了Redis是一个强大的键-值仓储,支持五种灵活的数据结构.其实,Redis还支持其他的一些高级特性:事务.公布与订阅.管道.脚本等,本篇我们来看一下事务. 前一篇中我们提到,在Redis ...
spring理解一
spring基本工作原理例如以下: 1.查找bean配置文件 2.载入bean配置文件并解析生成中间表示BeanDefinition 3.注冊beanDefinition 4.假设是单例或lazy-i ...
kill mediaserver脚本
#!/bin/bash adb shell kill $(adb shell ps | grep mediaserver | awk '{print $2}') adb shell pm clear ...

python爬虫爬取汽车页面信息，并附带分析（静态爬虫）

BeautifulSoup神器

python爬虫爬取汽车页面信息，并附带分析（静态爬虫）的更多相关文章

随机推荐

热门专题