爬虫入门之Scrapy框架实战(新浪百科豆瓣)(十二)

一新浪新闻爬取

1 爬取新浪新闻(全站爬取)

项目搭建与开启

scrapy startproject sina

cd sina

scrapy genspider mysina http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml

2 项目setting配置

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {

   'sina.pipelines.SinaPipeline': 300,

}

3 启动文件start.py配置

import scrapy.cmdline

def main():

    # -o  ['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']

    scrapy.cmdline.execute(['scrapy','crawl','mysina'])

if __name__ == '__main__':

    main()

4 需求目标item配置

import scrapy

class SinaItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    newsTitle = scrapy.Field()

    newsUrl = scrapy.Field()

    newsTime = scrapy.Field()

    content = scrapy.Field()

5 爬虫逻辑文件配置mysina.py

import scrapy

import requests

from lxml import etree

from sina import items

from scrapy.spiders import CrawlSpider,Rule  #CrawlSpiders:定义了一些规则跟进link

from scrapy.linkextractors import LinkExtractor  #提取链接

class MysinaSpider(CrawlSpider): #继承了CrawlSpider因此parse需要重命名防止冲突

    name = 'mysina'

    allowed_domains = ['sina.com.cn']

    start_urls = ['http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml']

    '''

    Rule参数:link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity

    LinkExtractor部分参数: allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=()

    allow=(正则)允许的, deny=(正则)不允许的

    callback=回调函数

    follow= 跟随如果为True就跟随

    '''

    rules = [Rule(LinkExtractor(allow=('index_(\d+).shtml')),callback='getParse',follow=True)]

    def getParse(self, response): #重命名逻辑方法

        newsList = response.xpath("//ul[@class='list_009']/li")

        for news in newsList:

            item = items.SinaItem() #对其进行实例化

            newsTitle = news.xpath('./a/text()')[0].extract()

            newsUrl = news.xpath('./a/@href')[0].extract()

            newsTime = news.xpath('./span/text()')[0].extract()

            content = self.getContent(newsUrl)

            item['newsTitle'] = newsTitle

            item['newsUrl'] = newsUrl

            item['newsTime'] = newsTime

            item['content'] = content

            yield item

    def getContent(self,url):

        headers = {

            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"

        }

        response = requests.get(url,headers=headers).content.decode('utf-8','ignore')   #content二进制

        mytree = etree.HTML(response)

        contentList = mytree.xpath("//div[@class='article']//text()")

        print(contentList)

        content = ''

        for c in contentList:

            #Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列

            content += c.strip().replace('\n','')  #保证content为整片文章

        return content

方法二 :mysina.py也可采用scrapy创建请求

# -*- coding: utf-8 -*-

import scrapy

import requests

from lxml import etree

from sina import items

from scrapy.spiders import CrawlSpider,Rule  #CrawlSpiders:定义了一些规则跟进link

from scrapy.linkextractors import LinkExtractor  #提取链接

class MysinaSpider(CrawlSpider):

    name = 'mysina'

    allowed_domains = ['sina.com.cn']

    start_urls = ['http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml']

    rules = [Rule(LinkExtractor(allow=('index_(\d+).shtml')),callback='getParse',follow=True)]

    def getParse(self, response):

        newsList = response.xpath("//ul[@class='list_009']/li")

        for news in newsList:

            newsTitle = news.xpath('./a/text()')[0].extract()

            newsUrl = news.xpath('./a/@href')[0].extract()

            newsTime = news.xpath('./span/text()')[0].extract()

            #构造请求(修改为框架Request构造请求)

            request = scrapy.Request(newsUrl,callback=self.getMataContent) #回调为getMataContent

            #使用meta传参

            request.meta['newsTitle'] = newsTitle

            request.meta['newsUrl'] = newsUrl

            request.meta['newsTime'] = newsTime

            yield request

    def getMataContent(self,response):

        '''

        getMataContent接受来自request请求后的响应response

        '''

        contentList = response.xpath("//div[@class='article']//text()")

        content = ''

        for c in contentList:

            content += c.extract().strip()

        item = items.SinaItem()

        #response响应数据对应字段赋值给item

        item['newsTitle'] = response.meta['newsTitle']

        item['newsUrl'] = response.meta['newsUrl']

        item['newsTime'] = response.meta['newsTime']

        item['content'] = content

        yield item

6 管道存储pipelines.py

import pymysql

class SinaPipeline(object):

    def __init__(self):

        self.conn = None

        self.cursor = None

    def open_spider(self,spider):

        self.conn = pymysql.connect(host='111.230.169.xxx',user='root',password='xxx',database='sina', port=3306,charset='utf8') #创建连接

        self.cursor = self.conn.cursor()  #创建数据库游标

    def process_item(self, item, spider):

        sql = 'insert into sina_news(newsTitle,newsUrl,newsTime,content) VALUES (%r,%r,%r,%r)'%(item['newsTitle'], item['newsUrl'], item['newsTime'], item['content'])

        self.cursor.execute(sql)  #执行sql语句

        self.conn.commit()  #提交

        return item

    def close_spider(self,spider):

        self.cursor.close() #关闭

        self.conn.close()

方法二 : pipelines.py 补充快速创建sql语句

import pymysql

class DemoPipeline(object):

    def __init__(self):

        self.conn = None

        self.cur = None

    def open_spider(self, spider):

        self.conn = pymysql.connect(

            host='127.0.0.1',

            port=3306,

            user='root',

            password='123456',

            db='fate',

            charset='utf8')

        self.cur = self.conn.cursor()

    def process_item(self, item, spider):

        cols, values = zip(*item.items())  #zip打包返回两个参数

        sql = "INSERT INTO `%s` (%s) VALUES (%s)" % \

              (

                  'sina_news',

                  ','.join(cols),

                  ','.join(['%s'] * len(values))

               )

        self.cur.execute(sql, values) #执行sql语句并将values填充到%s

        self.conn.commit()

        return item

    def close_spider(self, spider):

        self.cur.close()

        self.conn.close()

二百科资料的爬取

1 百科资料爬取

项目搭建与开启

scrapy startproject baike

cd baike

scrapy genspider mybaike baike.baidu.com/item/Python/407313

2 项目setting配置

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

}

ITEM_PIPELINES = {

   'baike.pipelines.BaikePipeline': 300,

}

3 启动文件start.py配置

import scrapy.cmdline

def main():

    # -o  ['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']

    scrapy.cmdline.execute(['scrapy','crawl','mybaike'])

if __name__ == '__main__':

    main()

4 需求目标items配置

import scrapy

class BaikeItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    level1Title = scrapy.Field()

    level2Title = scrapy.Field()

    content = scrapy.Field()

5 爬虫逻辑文件配置mybaike.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.spiders import CrawlSpider,Rule

from scrapy.linkextractors import LinkExtractor

from baike.items import BaikeItem

class MybaikeSpider(CrawlSpider):

    name = 'mybaike'

    allowed_domains = ['baike.baidu.com']

    start_urls = ['https://baike.baidu.com/item/Python/407313']

    rules = [Rule(LinkExtractor(allow=('item/(.*)')),callback='getParse',follow=True)]

    def getParse(self, response):

        level1Title = response.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']/h1/text()")[0].extract()

        level2Title = response.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']/h2/text()")

        if len(level2Title) != 0:

            level2Title = level2Title[0].extract()

        else:

            level2Title = '待编辑'

        contentList = response.xpath("//div[@class='lemma-summary']//text()")

        content = ''

        for c in contentList:

            content += c.extract()

        item = BaikeItem()

        item['level1Title'] = level1Title

        item['level2Title'] = level2Title

        item['content'] = content

        yield item

6 管道存储pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

class BaikePipeline(object):

    def __init__(self):

        self.conn = None

        self.cousor = None

    def open_spider(self, spider):

        # 连接

        self.conn = pymysql.connect(host='111.230.169.107', user='root', password="20111673",

                                    database='baike', port=3306,

                                    charset='utf8')

        # 游标

        self.cousor = self.conn.cursor()

    def process_item(self, item, spider):

        cols, values = zip(*item.items())

        # `表名`

        sql = "INSERT INTO `%s`(%s) VALUES (%s)" % \

              ('baike', ','.join(cols), ','.join(['%s'] * len(values)))

        self.cousor.execute(sql, values)

        self.conn.commit()

        return item

    def close_spider(self, spider):

        self.cousor.close()

        self.conn.close()

三豆瓣电影的爬取

1 豆瓣电影排行版

项目搭建与开启

scrapy startproject douban

cd douban

scrapy genspider mysina movie.douban.com/top250

2 项目setting配置

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"

}

ITEM_PIPELINES = {

   'douban.pipelines.DoubanPipeline': 300,

}

3 启动文件start.py配置

import scrapy.cmdline

def main():

    # -o  ['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']

    scrapy.cmdline.execute(['scrapy','crawl','mybaike'])

if __name__ == '__main__':

    main()

4 需求目标items配置

import scrapy

class DoubanItem(scrapy.Item):

    # define the fields for your item here like:

    name = scrapy.Field()

    movieInfo = scrapy.Field()

    star = scrapy.Field()

    quote = scrapy.Field()

5 爬虫逻辑文件配置mydouban.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.http import Request

from douban.items import DoubanItem

class MydoubanSpider(scrapy.Spider):

    name = 'mydouban'

    url = ['https://movie.douban.com/top250']

    start_urls = {'https://movie.douban.com/top250'} #方法1

    '''#方法二

    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',

    }

    def start_requests(self):

        url = 'https://movie.douban.com/top250'

        yield Request(url, headers=self.headers)

    '''

    def parse(self, response):

        item = DoubanItem()

        movies = response.xpath('//ol[@class="grid_view"]/li')

        for movie in movies:

            item['name'] = movie.xpath(".//div[@class='pic']/a/img/@alt").extract()[0]

            item['movieInfo'] = movie.xpath(".//div[@class='info']/div[@class='bd']/p/text()").extract()[0].strip()

            item['star'] = movie.xpath(".//div[@class='info']/div[@class='bd']/div[@class='star']/span[2]/text()").extract()[0]

            item['quote'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人评价')[0]

            yield item

        next_url = response.xpath('//span[@class="next"]/a/@href').extract() #获取下一页链接

        if next_url:

            next_url = 'https://movie.douban.com/top250' + next_url[0]

            yield Request(next_url,callback=self.parse)  #执行回调

6 管道存储pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

class DoubanPipeline(object):

    def __init__(self):

        self.conn = pymysql.connect(host='111.230.169.107', port=3306, user= 'root', passwd = 'xxx', database = 'douban',charset = 'utf8')

        self.cursor = self.conn.cursor()

        self.cursor.execute("truncate table Movie")   #此处设置每开启就清空

        self.conn.commit()

    def process_item(self, item, spider):

        try:

            self.cursor.execute("insert into Movie (name,movieInfo,star,quote) VALUES (%s,%s,%s,%s)",(item['name'], item['movieInfo'], item['star'], item['quote']))

            self.conn.commit()

        except pymysql.Error:

            print("Error%s,%s,%s,%s" % (item['name'], item['movieInfo'], item['star'], item['quote']))

        return item

    def close_spider(self, spider):

        self.cursor.close()

        self.conn.close()

爬虫入门之Scrapy框架实战(新浪百科豆瓣)(十二)的更多相关文章

爬虫入门之Scrapy 框架基础功能(九)
Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非 ...
python爬虫入门(六) Scrapy框架之原理介绍
Scrapy框架 Scrapy简介 Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬 ...
爬虫入门之Scrapy框架基础rule与LinkExtractors(十一)
1 parse()方法的工作机制: 1. 因为使用的yield,而不是return.parse函数将会被当做一个生成器使用.scrapy会逐一获取parse方法中生成的结果,并判断该结果是一个什么样的 ...
爬虫入门之Scrapy框架基础框架结构及腾讯爬取(十)
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath或CSS表达式,查看他们的工作方式,方便我们爬取的网页中提取的数据. 如果安装了 IPyth ...
爬虫入门三 scrapy
title: 爬虫入门三 scrapy date: 2020-03-14 14:49:00 categories: python tags: crawler scrapy框架入门 1 scrapy简介 ...
python3.4学习笔记(十四) 网络爬虫实例代码，抓取新浪爱彩双色球开奖数据实例
python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例新浪爱彩双色球开奖数据URL:http://zst.aicai.com/ssq/openInfo/ 最终输出结果格 ...
第三百三十五节，web爬虫讲解2—Scrapy框架爬虫—豆瓣登录与利用打码接口实现自动识别验证码
第三百三十五节,web爬虫讲解2—Scrapy框架爬虫—豆瓣登录与利用打码接口实现自动识别验证码打码接口文件 # -*- coding: cp936 -*- import sys import os ...
第三百三十三节，web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies
第三百三十三节,web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录模拟浏览器登录 start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于star ...
第三百三十二节，web爬虫讲解2—Scrapy框架爬虫—Scrapy使用
第三百三十二节,web爬虫讲解2—Scrapy框架爬虫—Scrapy使用 xpath表达式 //x 表示向下查找n层指定标签,如://div 表示查找所有div标签 /x 表示向下查找一层指定的标签 ...

随机推荐

pandas中获取数据框的行、列数
获取数据框的行.列数 # 获取行数 df.shape[0] # 获取行数 len(df) # 获取列数 df.shape[1]
Java - String中的==、equals及StringBuffer（转自CSDN 作者：chenrui_）
equals是比较值/对象是否相同,==则比较的是引用地址是否相同. == 如果是基本类型则表示值相等,如果是引用类型则表示地址相等即是同一个对象 package com.char3; public ...
详解exif.js，应用于canvas照片倒转（海报H5）
业务背景,苹果手机调用上传接口拍照没有问题但是上传到网页上照片倒转了解决方法利用exif.js读取图片参数并对图片进行元数据修改 window.btoa(str)转码 window.atob(base ...
Js框架设计之DomReady
一.在介绍DomReady之前,先了解下相关的知识 1.HTML是一种标记语言,告诉我们这页面里面有什么内容,但是行为交互则要通过DOM操作来实现,但是注意:不要把尖括号里面的内容看作是DOM! 2. ...
RabbitMQ入门-理论
目录 RabbitMQ简介 RabbitMQ原理简介 RabbitMQ安装 .NET Core 使用 RabbitMQ Hello World 工作队列扇型交换机直连交换机主题交换机远程过程调 ...
mysql启动登陆
mysql.server start # 1. 启动 mysql.server stop # 2. 停止 mysql.server restart # 3. 重启 1.本地登陆 sud ...
jq访问网络接口实例
最近需要在app生活频道上,需要添加一些类目,这就需要用到一些公用的开放接口,ajax其实调用并不复杂,但是结合jquery则显得更简洁一些,下面一起来看看jquery调用后台api. 代码如下: & ...
【转】C++和Java比较
"作为一名C++程序员,我们早已掌握了面向对象程序设计的基本概念,而且Java的语法无疑是非常熟悉的.事实上,Java本来就是从C++衍生出来的." 然而,C++和Java之间仍存 ...
bzoj 3512: DZY Loves Math IV
Description 给定n,m,求模10^9+7的值. Solution 设 \(S(n,m)\) 表示 \(\sum_{i=1}^{m}\phi(n*i)\) \(Ans=\sum_{i=1} ...
公司管理系列--Facebook是如何营造企业文化的[转]
本文讲下硅谷创业公司的文化,去过硅谷公司或者是看过硅谷公司报道的人,都会惊讶硅谷创业公司里面有如此奇特且活力十足的文化.在中国,企业文化是一个被滥用但是却又缺乏解读的概念,很多国内企业对保持公司的 ...

爬虫入门之Scrapy框架实战(新浪百科豆瓣)(十二)

一 新浪新闻爬取

1 爬取新浪新闻(全站爬取)

2 项目setting配置

3 启动文件start.py配置

4 需求目标item配置

5 爬虫逻辑文件配置mysina.py

6 管道存储pipelines.py

二 百科资料的爬取

1 百科资料爬取

2 项目setting配置

3 启动文件start.py配置

4 需求目标items配置

5 爬虫逻辑文件配置mybaike.py

6 管道存储pipelines.py

三 豆瓣电影的爬取

1 豆瓣电影排行版

2 项目setting配置

3 启动文件start.py配置

4 需求目标items配置

5 爬虫逻辑文件配置mydouban.py

6 管道存储pipelines.py

爬虫入门之Scrapy框架实战(新浪百科豆瓣)(十二)的更多相关文章

随机推荐

热门专题

一新浪新闻爬取

二百科资料的爬取

三豆瓣电影的爬取