scarpy crawl 爬取微信小程序文章

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem

class WxSpider(CrawlSpider):

    name = 'wx'

    allowed_domains = ['wxapp-union.com']

    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (

        Rule(LinkExtractor(allow=r'.*mod=list&catid=2&page=\d+'), follow=True),

        Rule(LinkExtractor(allow=r'.*article-.+\.html'), callback='parse_detail', follow=False),

    )

    def parse_detail(self, response):

        detail_href = response.request.url

        title = response.xpath('//h1[@class="ph"]/text()').get()

        content = response.xpath('//td[@id="article_content"]//text()').getall()

        content = [c.strip() for c in content]

        content = ''.join(content).strip()

        pub_time = response.xpath('//p[@class="authors"]/span/text()').get()

        author = response.xpath('//p[@class="authors"]/a/text()').get()

        item = WxappItem(title=title, content=content, detail_href=detail_href, pub_time=pub_time, author=author)

        yield item

from scrapy.exporters import JsonLinesItemExporter, JsonItemExporter

class WxappPipeline(object):

    def __init__(self):

        """

        爬虫开始的时候执行

        """

        self.fp = open("data.json", 'wb')

        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')

    def open_spider(self, spider):

        """

        爬虫开始的时候执行

        :param spider:

        :return:

        """

        pass

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

    def close_spider(self, spider):

        """

        爬虫结束的时候执行

        :param spider:

        :return:

        """

        self.fp.close()

import scrapy

class WxappItem(scrapy.Item):

    title = scrapy.Field()

    content = scrapy.Field()

    pub_time = scrapy.Field()

    author = scrapy.Field()

    detail_href = scrapy.Field()

scarpy crawl 爬取微信小程序文章的更多相关文章

scarpy crawl 爬取微信小程序文章（将数据通过异步的方式保存的数据库中）
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider ...
python爬取微信小程序（实战篇）
python爬取微信小程序(实战篇) 本文链接:https://blog.csdn.net/HeyShHeyou/article/details/90452656 展开一.背景介绍近期有需求需要抓 ...
Python爬取微信小程序（Charles）
Python爬取微信小程序(Charles) 本文链接:https://blog.csdn.net/HeyShHeyou/article/details/90045204 一.前言最近需要获取微信小 ...
scrapy爬取微信小程序社区教程（crawlspider）
爬取的目标网站是: http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1 目的是爬取每一个教程的标题,作者,时间和 ...
使用Python爬取微信公众号文章并保存为PDF文件(解决图片不显示的问题)
前言第一次写博客,主要内容是爬取微信公众号的文章,将文章以PDF格式保存在本地. 爬取微信公众号文章(使用wechatsogou) 1.安装 pip install wechatsogou --up ...
《吐血整理》高级系列教程-吃透Fiddler抓包教程(34)-Fiddler如何抓取微信小程序的包-上篇
1.简介有些小伙伴或者是童鞋们说小程序抓不到包,该怎么办了???其实苹果手机如果按照宏哥前边的抓取APP包的设置方式设置好了,应该可以轻松就抓到包了.那么安卓手机小程序就比较困难,不是那么友好了.所 ...
如何抓取微信小程序的源码？
一.引言: 在工作中我们会想把别人的代码直接拿过来进行参考,当然这个更多的是前端代码的进行获取. 那么微信小程序的代码怎么样获取呢? 参考 https://blog.csdn.net/qq_4113 ...
微信小程序文章收录
基础篇 03-04 微信登入小程序与后端实现 - 小猿取经 - 博客园我做的小程序 - 小y - 博客园小程序二维码和小程序带参数二维码生成 - Likwo - 博客园 accesstoken 微 ...
Charles抓取微信小程序数据以及其它应用网站数据
为了抓取小程序数据所以使用Charles来抓取,下面介绍下使用方法(mac环境下使用).使用Charles可以非常方便的抓取Http/Https请求.官方dmg下载地址:点击此处下载 Charles抓 ...

随机推荐

作业1：java虚拟机内存模型图示
看了很多篇文章,整理成一幅图,但仍然有许多不解的地方,以后再接着完善,哪位大神看到不正确的地方,请指出,谢谢.
dev linechart动态加载数据（像股票一样的波动）
图片地址:https://blog.csdn.net/qq_33459369/article/details/80060196:(盗图) 接下来是封装的代码 #region 动态折线图 public ...
python selenium5 模拟点击+拖动+按照指定相对坐标拖动 58同城验证码
#!/usr/bin/python # -*- coding: UTF-8 -*- # @Time : 2019年12月9日11:41:08 # @Author : shenghao/10347899 ...
微信小程序登录获取手机号
一,发送请求携带 code 到后台换取 openid var that = this; wx.login({ success(res) { console.log(res); var code = r ...
angular-file-upload.min.js.map文件下载
https://github.com/nervgh/angular-file-upload 下载地址在文件菜单栏有对应文件
bash shell脚本之获取时间日期
shell中的时间日期获取 cat test5: #!/bin/bash # using the backtick character testing=`date` echo "The da ...
安装CDH5.11.2集群
master 192.168.1.30 saver1 192.168.1.40 saver2 192.168.1.50 首先,时间同步然后,ssh互通接下来开始: 1.安装MySQL5.6. ...
【python】python _、__、__xx__之间的差别
本文来自 yzl11 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/yzl11/article/details/53792416?utm_source=copy 单下 ...
java——从.net再学习java
到底从java中学到了什么? 1,java是由sun公司发明的,sun希望制定一些标准,具体的实现交给具体的厂商来自己实现: 2,java是开源的,第三方做了很多自己的一些组件实现,比如: 很多时候, ...
mybatis的简单搭建和使用(一)
前言 mybatis是一个持久层的框架,那么问题来了,什么是持久层的框架呢,持久层就是把数据持久化的保存到数据库中,这种过程一般叫数据持久化的过程,现为了程序员能够很方便的操作数据库,于是就出现持久层 ...

scarpy crawl 爬取微信小程序文章

scarpy crawl 爬取微信小程序文章的更多相关文章

随机推荐

热门专题