京东口红top 30分析

一、抓取商品id

分析网页源码，发现所有id都是在class=“gl-item”的标签里，可以利用bs4的select方法查找标签，获取id：

获取id后，分析商品页面可知道每个商品页面就是id号不同，可构造url：

将获取的id和构造的url保存在列表里，如下源码：

 def get_product_url(url):

     global pid

     global links

     req = urllib.request.Request(url)

     req.add_header("User-Agent",

                    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '

                    '(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36')

     req.add_header("GET", url)

     content = urllib.request.urlopen(req).read()

     soup = bs4.BeautifulSoup(content, "lxml")

     product_id = soup.select('.gl-item')

     for i in range(len(product_id)):

         lin = "https://item.jd.com/" + str(product_id[i].get('data-sku')) + ".html"

         # 获取链接

         links.append(lin)

         # 获取id

         pid.append(product_id[i].get('data-sku'))

二、获取商品信息

通过商品页面获取商品的基本信息（商品名，店名，价格等）：

         product_url = links[i]

         req = urllib.request.Request(product_url)

         req.add_header("User-Agent",

                        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0')

         req.add_header("GET", product_url)

         content = urllib.request.urlopen(req).read()

         # 获取商品页面源码

         soup = bs4.BeautifulSoup(content, "lxml")

         # 获取商品名

         sku_name = soup.select_one('.sku-name').getText().strip()

         # 获取商店名

         try:

             shop_name = soup.find(clstag="shangpin|keycount|product|dianpuname1").get('title')

         except:

             shop_name = soup.find(clstag="shangpin|keycount|product|zcdpmc_oversea").get('title')

         # 获取商品ID

         sku_id = str(pid[i]).ljust(20)

         # 获取商品价格

通过抓取评论的json页面获取商品热评、好评率、评论：

获取热评源码：

 def get_product_comment(product_id):

     comment_url = 'https://club.jd.com/comment/productPageComments.action?' \

                   'callback=fetchJSON_comment98vv16496&' \

                   'productId={}&' \

                   'score=0&' \

                   'sortType=6&' \

                   'page=0&' \

                   'pageSize=10' \

                   '&isShadowSku=0'.format(str(product_id))

     response = urllib.request.urlopen(comment_url).read().decode('gbk', 'ignore')

     response = re.search(r'(?<=fetchJSON_comment98vv16496\().*(?=\);)', response).group(0)

     response_json = json.loads(response)

     # 获取商品热评

     hot_comments = []

     hot_comment = response_json['hotCommentTagStatistics']

     for h_comment in hot_comment:

         hot = str(h_comment['name'])

         count = str(h_comment['count'])

         hot_comments.append(hot + '(' + count + ')')

     return ','.join(hot_comments)

获取好评率源码：

 def get_good_percent(product_id):

     comment_url = 'https://club.jd.com/comment/productPageComments.action?' \

                   'callback=fetchJSON_comment98vv16496&' \

                   'productId={}&' \

                   'score=0&' \

                   'sortType=6&' \

                   'page=0&' \

                   'pageSize=10' \

                   '&isShadowSku=0'.format(str(product_id))

     response = requests.get(comment_url).text

     response = re.search(r'(?<=fetchJSON_comment98vv16496\().*(?=\);)', response).group(0)

     response_json = json.loads(response)

     # 获取好评率

     percent = response_json['productCommentSummary']['goodRateShow']

     percent = str(percent) + '%'

     return percent

获取评论源码：

 def get_comment(product_id, page):

     global word

     comment_url = 'https://club.jd.com/comment/productPageComments.action?' \

                   'callback=fetchJSON_comment98vv16496&' \

                   'productId={}&' \

                   'score=0&' \

                   'sortType=6&' \

                   'page={}&' \

                   'pageSize=10' \

                   '&isShadowSku=0'.format(str(product_id), str(page))

     response = urllib.request.urlopen(comment_url).read().decode('gbk', 'ignore')

     response = re.search(r'(?<=fetchJSON_comment98vv16496\().*(?=\);)', response).group(0)

     response_json = json.loads(response)

     # 写入评论.csv

     comment_file = open('{0}\\评论.csv'.format(path), 'a', newline='', encoding='utf-8', errors='ignore')

     write = csv.writer(comment_file)

     # 获取用户评论

     comment_summary = response_json['comments']

     for content in comment_summary:

         # 评论时间

         creation_time = str(content['creationTime'])

         # 商品颜色

         product_color = str(content['productColor'])

         # 商品名称

         reference_name = str(content['referenceName'])

         # 客户评分

         score = str(content['score'])

         # 客户评论

         content = str(content['content']).strip()

         # 记录评论

         word.append(content)

         write.writerow([product_id, reference_name, product_color, creation_time, score, content])

     comment_file.close()

整体获取商品信息源码：

 def get_product_info():

     global pid

     global links

     global word

     # 创建评论.csv

     comment_file = open('{0}\\评论.csv'.format(path), 'w', newline='')

     write = csv.writer(comment_file)

     write.writerow(['商品id', '商品', '颜色', '评论时间', '客户评分', '客户评论'])

     comment_file.close()

     # 创建商品.csv

     product_file = open('{0}\\商品.csv'.format(path), 'w', newline='')

     product_write = csv.writer(product_file)

     product_write.writerow(['商品id', '所属商店', '商品', '价格', '商品好评率', '商品评价'])

     product_file.close()

     for i in range(len(pid)):

         print('[*]正在收集数据。。。')

         product_url = links[i]

         req = urllib.request.Request(product_url)

         req.add_header("User-Agent",

                        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0')

         req.add_header("GET", product_url)

         content = urllib.request.urlopen(req).read()

         # 获取商品页面源码

         soup = bs4.BeautifulSoup(content, "lxml")

         # 获取商品名

         sku_name = soup.select_one('.sku-name').getText().strip()

         # 获取商店名

         try:

             shop_name = soup.find(clstag="shangpin|keycount|product|dianpuname1").get('title')

         except:

             shop_name = soup.find(clstag="shangpin|keycount|product|zcdpmc_oversea").get('title')

         # 获取商品ID

         sku_id = str(pid[i]).ljust(20)

         # 获取商品价格

         price_url = 'https://p.3.cn/prices/mgets?pduid=1580197051&skuIds=J_{}'.format(pid[i])

         response = requests.get(price_url).content

         price = json.loads(response)

         price = price[0]['p']

         # 写入商品.csv

         product_file = open('{0}\\商品.csv'.format(path), 'a', newline='', encoding='utf-8', errors='ignore')

         product_write = csv.writer(product_file)

         product_write.writerow(

             [sku_id, shop_name, sku_name, price, get_good_percent(pid[i]), get_product_comment(pid[i])])

         product_file.close()

         pages = int(get_comment_count(pid[i]))

         word = []

         try:

             for j in range(pages):

                 get_comment(pid[i], j)

         except Exception as e:

             print("[!!!]{}商品评论加载失败！".format(pid[i]))

             print("[!!!]Error：{}".format(e))

         print('[*]第{}件商品{}收集完毕！'.format(i + 1, pid[i]))         # 的生成词云

         word = " ".join(word)

         my_wordcloud = WordCloud(font_path='C:\Windows\Fonts\STZHONGS.TTF', background_color='white').generate(word)

         my_wordcloud.to_file("{}.jpg".format(pid[i]))

将商品信息和评论写入表格，生成评论词云：

三、总结

在爬取的过程中遇到最多的问题就是编码问题，获取页面的内容requset到的都是bytes类型的要decode（”gbk”），后来还是存在编码问题，最后找到一些文章说明，在后面加“ignore”可以解决，由于爬取的量太大，会有一些数据丢失，不过数据量够大也不影响对商品分析。

京东口红top 30分析的更多相关文章

Oracle SQL篇（三）Oracle ROWNUM 与TOP N分析
首先我们来看一下ROWNUM: 含义解释: 1.rownum是oracle为从查询返回的行的编号,返回的第一行分配的是1,第二行是2,依此类推.这是一个伪列,可以用于限制查询返回的总行数. 2 ...
Learn golang: Top 30 Go Tutorials for Programmers Of All Levels
https://stackify.com/learn-go-tutorials/ What is Go Programming Language? Go, developed by Google in ...
值得收藏！国外最佳互联网安全博客TOP 30
如果你是网络安全从业人员,其中重要的工作便是了解安全行业的最新资讯以及技术趋势,那么浏览各大安全博客网站或许是信息来源最好的方法之一.最近有国外网站对50多个互联网安全博客做了相关排名,小编整理其中排 ...
Top 30 Nmap Command Examples For Sys/Network Admins
Nmap is short for Network Mapper. It is an open source security tool for network exploration, securi ...
linux中的调试知识---基础gdb和strace查看系统调用信息，top性能分析，ps进程查看，内存分析工具
1 调试一般分为两种,可以通过在程序中插入打印语句.有点能够显示程序的动态过程,比较容易的检查出源程序中的有关信息.缺点就是效率比较低了,而且需要输入大量无关的数据. 2 借助相关的调试工具. 3 有 ...
SSO单点登录、跨域重定向、跨域设置Cookie、京东单点登录实例分析
最近在研究SSO单点登录技术,其中有一种就是通过js的跨域设置cookie来达到单点登录目的的,下面就已京东商城为例来解释下跨域设置cookie的过程涉及的关键知识点: 1.jquery ajax跨 ...
HTML JS文字闪烁实现（项目top.htm分析）
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- saved from ur ...
移动开发day4_京东移动页面
复习父项身上有哪些属性可以设置主轴方向 fd flex-direction : row; column; 主轴子项的排列方式 j justify-content: flex-start;flex ...
转：XBMC源代码分析
1:整体结构以及编译方法 XBMC(全称是XBOX Media Center)是一个开源的媒体中心软件.XBMC最初为Xbox而开发,可以运行在Linux.OSX.Windows.Android4.0 ...

随机推荐

mongodb 在windows下面进行分片
在mongodb里面存在另一种集群,就是分片技术,跟sql server的表分区类似,我们知道当数据量达到T级别的时候,我们的磁盘,内存就吃不消了,针对这样的场景我们该如何应对. 一:分片 mongo ...
Servlet 3.0 使用注解配置URl提示404错误
我的环境是 Eclipse oxygen + Servlet 3.0 因为3.0已经开始使用注解了之前我都是配置listenner 还有Servlet mapping 在 web.xml 中就 ...
Grunt针对静态文件的压缩，版本控制打包方案
在讲之前先谈谈大致步骤:安装nodejs -> 全局安装grunt -> 项目创建package.json --> 项目安装grunt以及grunt插件 -> 配置Gruntf ...
三、js的函数
三.函数函数是定义一次但却可以调用或执行任意多次的一段JS代码.函数有时会有参数,即函数被调用时指定了值的局部变量.函数常常使用这些参数来计算一个返回值,这个值也成为函数调用表达式的值. 1.函数声 ...
vue练手小项目--眼镜在线试戴
最近看到了一个眼镜在线试戴小项目使用纯js手写的,本人刚学习vue.js没多久,便试试用vue做做看了,还没完善. 其中包括初始图片加载,使用keywords查找,父子组件之间传递信息,子组件之间传递 ...
通过SQL脚本导入数据到不同数据库避免重复导入三种方式
前言无论何种语言,一旦看见代码中有重复性的代码则想到封装来复用,在SQL同样如此,若我们没有界面来维护而且需要经常进行的操作,我们会写脚本避免下次又得重新写一遍,但是这其中就涉及到一个问题,这个问题 ...
开源纯C#工控网关+组态软件
一. 前言在园子潜水也七八年了.说来惭愧,这么多年虽然一直自称.NET铁杆粉丝,然仅限于回几个不痛不痒的贴,既没有发布过代码,也没有写过文章. 看着.NET和C#在国外风生水起,国内却日趋没落, ...
学习如何看懂SQL Server执行计划（三）——连接查询篇
三.连接查询部分 --------------------嵌套循环-------------------- /* UserInfo表数据少.Coupon表数据多嵌套循环可以理解为就是两层For循环,外 ...
DataGridView的使用记录
首先初始化 1 this.CheckView.Columns.Clear(); 2 DataGridViewComboBoxColumn dcomo = new DataGridViewComboBo ...
HDU3336 Count the string
居然一A了,说明对朴素的KMP还是有一定理解. 主要就是要知道next数组的作用,然后就可以计算每个i结尾的满足题意的串个数. #include<cstdio> #include<c ...

京东口红top 30分析

京东口红top 30分析的更多相关文章

随机推荐

热门专题