scrapy Mongodb 储存
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem class JobsPipeline(object): # (Jobs→自己的项目名)
def process_item(self, item, spider):
return item class MongoPipeline(object):
collection_name = '表名' def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db @classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
) def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db] def close_spider(self, spider):
self.client.close() def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for Jobs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Jobs' SPIDER_MODULES = ['Jobs.spiders']
NEWSPIDER_MODULE = 'Jobs.spiders'
DOWNLOAD_DELAY = 0
ITEM_PIPELINES = {
# 'zhaopin.pipelines.ZhaopinPipeline': 300,
'Jobs.pipelines.MongoPipeline': 400
}
MONGO_URL = 'localhost'
MONGO_DATABASE = 'shuju' # 数据库名 # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Jobs (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Jobs.middlewares.JobsSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Jobs.middlewares.JobsDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'Jobs.pipelines.JobsPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' # 设置日志级别为Error,低于次级别的日志就出不来
LOG_LEVEL = 'ERROR' # 禁用日志输出,所有日志都出不来
# LOG_ENABLED=False # 保持中文,不被Unicode编码
FEED_EXPORT_ENCODING = 'utf-8'
scrapy Mongodb 储存的更多相关文章
- 用 mongodb 储存多态消息/提醒类数据(转)
原文:http://codecampo.com/topics/66 前天看到 javaeye 计划采用mongoDB实现网站全站消息系统,很有同感,mongodb 很适合储存消息类数据.之前讨论了如何 ...
- python scrapy+Mongodb爬取蜻蜓FM,酷我及懒人听书
1.初衷:想在网上批量下载点听书.脱口秀之类,资源匮乏,大家可以一试 2.技术:wireshark scrapy jsonMonogoDB 3.思路:wireshark分析移动APP返回的各种连接分类 ...
- scrapy+mongodb
我们都知道scrapy适合爬取大量的网站信息,爬取到的信息储存到数据库显然需要更高的效率,scrapy配合mongodb是非常合适的,这里记录一下如何在scrapy中配置mongodb. 文件结构 $ ...
- scrapy+mongodb报错 TypeError: name must be an instance of str
经过各种排查,最后找到原因,在settings文件中配置文件大小写写错了,在pipelines中 mongo_db=crawler.settings.get('MONGODB_DB'),get 获取的 ...
- 利用Scrapy爬取所有知乎用户详细信息并存至MongoDB
欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者 :崔庆才 本节分享一下爬取知乎用户所有用户信息的 Scrapy 爬虫实战. 本节目标 本节要实现的内容有 ...
- PHP操作MongoDB简明教程(转)
转自:http://blog.sina.com.cn/s/blog_6324c2380100ux2m.html MongoDB是最近比较流行的NoSQL数据库,网络上关于PHP操作MongoDB的资料 ...
- MongoDB与PHP的添加、修改、查询、删除
链接数据库使用下面的代码创建一个数据库链接 <?php$connection = new Mongo(); //链接到 localhost:27017$connection = new Mong ...
- 爬虫框架之Scrapy(一)
scrapy简介 scrapy是一个用python实现为了爬取网站数据,提取结构性数据而编写的应用框架,功能非常的强大. scrapy常应用在包括数据挖掘,信息处理或者储存历史数据的一系列程序中. s ...
- mongodb shell和Node.js driver使用基础
开始: Mongo Shell 安装后,输入mongo进入控制台: //所有帮助 > help //数据库的方法 > db.help() > db.stats() //当前数据库的状 ...
随机推荐
- Redis的过期策略和内存淘汰策略
Redis的过期策略:通常有三种,Redis中同时使用惰性过期和定期过期两种过期策略组合. 定时过期:每个设置过期时间的key都需要创建一个定时器,到过期时间就会立即清除.该策略可以立即清除过期的数据 ...
- [java,2019-01-28] 枪手博弈,谁才是最后赢家
什么是枪手博弈: 枪手博弈指彼此痛恨的甲乙丙三个枪手准备决斗.甲枪法最好,十发八中.乙枪法次之,十发六中.丙枪法最差,十发四中.假设他们了解彼此实力,也能做出理性判断. 问题一:如果三人同时开枪,并且 ...
- 我和blog的初次接触
这是我的第一篇bolg! 进击的小白,要加油哇!
- Jmeter5.1.1创建一个http请求的压力测试
1.首先添加一个线程组,在线程组中,配置压力情况 2.然后在线程组中,添加取样器,添加http请求:配置web服务器协议(http/https).服务器名称或IP.端口号.请求方法.路径等参数 3.然 ...
- Java Design Patterr
Factory: ●简介: 工厂模式同单例模式一样,也是Java中最常用的设计模式之一,属于创建型模式,它提供了一种创建对象的最佳方式.能够根据要求调用者提供的信息为接口指定不同的实现类,降低耦合. ...
- python3下同时取得exe、zip和chm下载地址
from selenium import webdriverimport osimport timeimport re cur_path=os.getcwd() #得到程序的当前目录str_file= ...
- jedisCluster 报错: redis.clients.jedis.exceptions.JedisClusterException: No way to dispatch this command to Redis Cluster because keys have different slots.
根本原因:jedisCluster不支持mget/mset等跨槽位的操作. 版本:2.9.0 解决办法,推荐更改redis的驱动修改为: lettuce lettuce 项目地址:https://gi ...
- Faiss in python and GPU报错:NotImplementedError: Wrong number or type of arguments for overloaded function 'new_GpuIndexFlatL2'.
最近在玩faiss,运行这段代码的时候报错了: res = faiss.StandardGpuResources()flat_config = 0index = faiss.GpuIndexFlatL ...
- 用EventEmitter收发消息
下面简单介绍其步骤. <发消息 方> 1.import进EventEmitter import { EventEmitter } from '@angular/core'; 2.在Comp ...
- Linux 下配置 nginx + 两个 tomcat 的负载均衡
前提:已经安装了 nginx 和两个 tomcat 1.修改 nginx.conf 配置文件 1)在 http{} 节点之间添加 upstream 配置 2)修改 nginx 的监听端口,默认是 ...