Scrapy Item用法示例（保存item到MySQL数据库，MongoDB数据库，使用官方组件下载图片）

需要学习的地方：

保存item到MySQL数据库，MongoDB数据库，下载图片

1.爬虫文件images.py

# -*- coding: utf-8 -*-

from scrapy import Spider, Request

from urllib.parse import urlencode

import json

from images360.items import ImageItem

class ImagesSpider(Spider):

    name = 'images'

    allowed_domains = ['images.so.com']

    start_urls = ['http://images.so.com/']

    def start_requests(self):

        data = {'ch': 'photography', 'listtype': 'new'}

        base_url = 'https://image.so.com/zj?'

        for page in range(1, self.settings.get('MAX_PAGE') + 1):

            data['sn'] = page * 30

            params = urlencode(data)

            url = base_url + params

            yield Request(url, self.parse)

    def parse(self, response):

        result = json.loads(response.text)

        for image in result.get('list'):

            item = ImageItem()

            item['id'] = image.get('imageid')

            item['url'] = image.get('qhimg_url')

            item['title'] = image.get('group_title')

            item['thumb'] = image.get('qhimg_thumb_url')

            yield item

2.items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field

class ImageItem(Item):

    collection = table = 'images'

    id = Field()

    url = Field()

    title = Field()

    thumb = Field()

3.pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

import pymysql

from scrapy import Request

from scrapy.exceptions import DropItem

from scrapy.pipelines.images import ImagesPipeline

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):

        self.mongo_uri = mongo_uri

        self.mongo_db = mongo_db

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            mongo_uri=crawler.settings.get('MONGO_URI'),

            mongo_db=crawler.settings.get('MONGO_DB')

        )

    def open_spider(self, spider):

        self.client = pymongo.MongoClient(self.mongo_uri)

        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):

        name = item.collection

        self.db[name].insert(dict(item))

        return item

    def close_spider(self, spider):

        self.client.close()

class MysqlPipeline():

    def __init__(self, host, database, user, password, port):

        self.host = host

        self.database = database

        self.user = user

        self.password = password

        self.port = port

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            host=crawler.settings.get('MYSQL_HOST'),

            database=crawler.settings.get('MYSQL_DATABASE'),

            user=crawler.settings.get('MYSQL_USER'),

            password=crawler.settings.get('MYSQL_PASSWORD'),

            port=crawler.settings.get('MYSQL_PORT'),

        )

    def open_spider(self, spider):

        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',

                                  port=self.port)

        self.cursor = self.db.cursor()

    def close_spider(self, spider):

        self.db.close()

    def process_item(self, item, spider):

        print(item['title'])

        data = dict(item)

        keys = ', '.join(data.keys())

        values = ', '.join(['%s'] * len(data))

        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)

        self.cursor.execute(sql, tuple(data.values()))

        self.db.commit()

        return item

class ImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):

        url = request.url

        file_name = url.split('/')[-1]

        return file_name

    def item_completed(self, results, item, info):

        image_paths = [x['path'] for ok, x in results if ok]

        if not image_paths:

            raise DropItem('Image Downloaded Failed')

        return item

    def get_media_requests(self, item, info):

        yield Request(item['url'])

4.settings.py

配置文件中增加如下内容

ITEM_PIPELINES = {

    'images360.pipelines.ImagePipeline': 300,

    'images360.pipelines.MongoPipeline': 301,

    'images360.pipelines.MysqlPipeline': 302,

}

IMAGES_STORE = './images'

MAX_PAGE = 50

MONGO_URI = 'localhost'

MONGO_DB = 'images360'

MYSQL_HOST = 'localhost'

MYSQL_DATABASE = 'images360'

MYSQL_USER = 'root'

MYSQL_PASSWORD = ''

MYSQL_PORT = 3306

代码下载地址：https://files.cnblogs.com/files/sanduzxcvbnm/Images360-master.7z

Scrapy Item用法示例（保存item到MySQL数据库，MongoDB数据库，使用官方组件下载图片）的更多相关文章

使用官方组件下载图片，保存到MySQL数据库，保存到MongoDB数据库
需要学习的地方,使用官方组件下载图片的用法,保存item到MySQL数据库需要提前创建好MySQL数据库,根据item.py文件中的字段信息创建相应的数据表 1.items.py文件 from sc ...
scrapy爬取数据保存csv、mysql、mongodb、json
目录前言 Items Pipelines 前言用Scrapy进行数据的保存进行一个常用的方法进行解析 Items item 是我们保存数据的容器,其类似于 python 中的字典.使用 item ...
<day001>存储到Mysql、mongoDB数据库+简单的Ajax请求+os模块+进程池+MD5
任务1:记住如何存储到Mysql.mongoDB数据库 ''' 存储到Mysql ''' import pymysql.cursors class QuotePipeline(object): def ...
Python Json分别存入Mysql、MongoDB数据库，使用Xlwings库转成Excel表格
将电影数据 data.json 数据通过xlwings库转换成excel表格,存入mysql,mongodb数据库中.python基础语法.xlwings库.mysql库.pymongo库.mongo ...
scrapy基础知识之将item 通过pipeline保存数据到mysql mongoDB：
pipelines.py class xxPipeline(object): def process_item(self, item, spider): con=pymysql.connect(hos ...
redis数据库到mysql或mongodb数据库
# -*- coding:utf-8 -*-# item_mongodb.py import redis import pymongo import json def main(): redis_co ...
Scrapy爬去哪儿~上海一日游门票并存入MongoDB数据库
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZwAAAGGCAYAAABPDDfEAAAgAElEQVR4nOy9C3Rb1Z3/+z1Hkm35mT
Python学习笔记（五）之Python操作Redis、mysql、mongodb数据库
操作数据库一.数据库数据库类型主要有关系型数据库和菲关系型数据库. 数据库:用来存储和管理数的仓库,数据库是通过依据“数据结构”将数据格式化,以记录->表->库的关系存储.因此数据查询 ...
Redis/Mysql/SQLite/MongoDB 数据库对比
一.Redis: redis是一个key-value存储系统.和Memcached类似,它支持存储的value类型相对更多,包括string(字符串).list(链表).set(集合).zset(so ...

随机推荐

android 4.0主线程訪问网络问题
在4.0下面,在主线程中訪问网络,假设请求超过6s的话,就会报ANR,那么这就会带来一个问题,假设网络慢或者请求的数据过大时,界面会卡顿,造成界面灵敏性非常差,因此网络请求一般不能放在主线程中操作,g ...
Delphi中ARC内存管理的方向
随着即将发布的10.3版本,RAD Studio R&D和PM团队正在制作Delphi在内存管理方面的新方向. 几年前,当Embarcadero开始为Windows以外的平台构建新的Delph ...
luogu2679 子串
题目大意有两个仅包含小写英文字母的字符串 A 和 B.现在要从字符串 A 中取出 k 个互不重叠的非空子串,然后把这 k 个子串按照其在字符串 A 中出现的顺序依次连接起来得到一个新的字符串,请问 ...
[BZOJ 1691] 挑剔的美食家
[题目链接] https://www.lydsy.com/JudgeOnline/problem.php?id=1691 [算法] 不难想到如下算法 : 将所有牛和牧草按鲜嫩程度降序排序,按顺序扫描, ...
Hadoop - WordCount代码示例
文章来源:http://www.itnose.net/detail/6197823.html import java.io.IOException; import java.util.Iterator ...
0619-dedeCMS数据表
CMS的层级从前台分主要分为首页--栏目页--内容页,从后台分主要是四张表之间的关系: 1.模型表--dede_channeltype(顶级) 2.栏目表--dede_arctype 3.数据表:分为 ...
codevs1163访问艺术馆（树形dp）
1163 访问艺术馆时间限制: 1 s 空间限制: 128000 KB 题目等级 : 大师 Master 题目描述 Description 皮尔是一个出了名的盗画者,他经过数月的精心准备, ...
MyBatis Generator实现MySQL分页插件
MyBatis Generator是一个非常方便的代码生成工具,它能够根据表结构生成CRUD代码,可以满足大部分需求.但是唯一让人不爽的是,生成的代码中的数据库查询没有分页功能.本文介绍如何让MyBa ...
Oracle数据库初学者入门教程
Oracle数据库是相对于其他数据库来说比较难的一个.Oracle Database,又名Oracle RDBMS,简称Oracle.是甲骨文公司推出的一款关系数据库管理系统.Oracle数据库系统是 ...
flask 初始
一.flask安装这里提供两种安装方式: 第一种: pip3 install flask 第二种: pip3 install -i https://pypi.douban.com/simple/ f ...

Scrapy Item用法示例（保存item到MySQL数据库，MongoDB数据库，使用官方组件下载图片）

Scrapy Item用法示例（保存item到MySQL数据库，MongoDB数据库，使用官方组件下载图片）的更多相关文章

随机推荐

热门专题