开始用scrapy 爬取数据的时候  开始用同步操作始终会报1064  的错误  因为 mysql 语法和导入的字段不兼容

尝试了  n  次之后  开始用  异步爬取  虽然一路报错 但是还是能把数据保存到mysql 数据库里

关于spider:

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
import re
from copy import deepcopy
from ..items import MyspiderItem class TbSpider(scrapy.Spider):
name = 'tb' allowed_domains = []
start_urls = ['http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/m?kw=%E6%A1%82%E6%9E%97%E7%94%B5%E5%AD%90%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%A6%E5%8C%97%E6%B5%B7%E6%A0%A1%E5%8C%BA&pn=26140',
] def parse(self, response): # 总页面
item = MyspiderItem() all_elements = response.xpath(".//div[@class='i']")
# print(all_elements) for all_element in all_elements:
content = all_element.xpath("./a/text()").extract_first()
content = "".join(content.split())
change = re.compile(r'[\d]+.')
content = change.sub('', content)
item['comment'] = content person = all_element.xpath("./p/text()").extract_first()
person = "".join(person.split())
# 去掉点赞数 评论数
change2 = re.compile(r'点[\d]+回[\d]+')
person = change2.sub('',person)
# 选择日期
change3 = re.compile(r'[\d]?[\d]?-[\d][\d](?=)')
date = change3.findall(person) # 如果为今天则选择时间
change4 = re.compile(r'[\d]?[\d]?:[\d][\d](?=)')
time = change4.findall(person) person = change3.sub('',person)
person = change4.sub('',person) if time ==[]:
item['time'] = date
else:
item['time'] = time item['name'] = person # 增加密码 活跃
item['is_active'] = ''
item['password'] = '' print(item)
yield item # 下一页
next_url ='http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/' + parse.unquote( response.xpath(".//div[@class='bc p']/a/@href").extract_first()) print(next_url)
yield scrapy.Request(
next_url,
callback=self.parse, )

关于  item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MyspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
comment = scrapy.Field()
time = scrapy.Field()
name = scrapy.Field()
password = scrapy.Field()
is_active = scrapy.Field()

关于setting

# -*- coding: utf-8 -*-

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'mySpider' SPIDER_MODULES = ['mySpider.spiders']
NEWSPIDER_MODULE = 'mySpider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'mu_ke'
MYSQL_USER = 'root'
MYSQL_PASSWD = 'root' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'mySpider.pipelines.MysqlPipelineTwo':200, } # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

关于  异步的爬取   重点

import pymysql
from twisted.enterprise import adbapi class MysqlPipelineTwo(object):
def __init__(self, dbpool):
self.dbpool = dbpool @classmethod
def from_settings(cls, settings): # 函数名固定,会被scrapy调用,直接可用settings的值
"""
数据库建立连接
:param settings: 配置参数
:return: 实例化参数
"""
adbparams = dict(
host='127.0.0.1',
db='mu_ke',
user='root',
password='root',
cursorclass=pymysql.cursors.DictCursor # 指定cursor类型
)
# 连接数据池ConnectionPool,使用pymysql或者Mysqldb连接
dbpool = adbapi.ConnectionPool('pymysql', **adbparams)
# 返回实例化参数
return cls(dbpool) def process_item(self, item, spider):
"""
使用twisted将MySQL插入变成异步执行。通过连接池执行具体的sql操作,返回一个对象
"""
query = self.dbpool.runInteraction(self.do_insert, item) # 指定操作方法和操作数据
# 添加异常处理
query.addCallback(self.handle_error) # 处理异常 def do_insert(self, cursor, item):
# 对数据库进行插入操作,并不需要commit,twisted会自动commit
insert_sql = """
insert into login_person(name,password,is_active,comment,time) VALUES(%s,%s,%s,%s,%s)
"""
cursor.execute(insert_sql, (item['name'], item['password'], item['is_active'], item['comment'],
item['time'])) def handle_error(self, failure):
if failure:
# 打印错误信息
print(failure)

scrapy 实现mysql 数据保存的更多相关文章

  1. scrapy爬取数据保存csv、mysql、mongodb、json

    目录 前言 Items Pipelines 前言 用Scrapy进行数据的保存进行一个常用的方法进行解析 Items item 是我们保存数据的容器,其类似于 python 中的字典.使用 item ...

  2. Echart显示后端mysql数据

    一.基本思想 1.将数据存储在mysql数据库中 2.后端链接数据库,将数据库中的数据保存为json格式 3.将json格式数据使用ajax传到前端JSP页面中的Echarts 二.实现的关键点 1. ...

  3. scrapy爬虫事件以及数据保存为txt,json,mysql

    今天要爬取的网页是虎嗅网 我们将完成如下几个步骤: 创建一个新的Scrapy工程 定义你所需要要抽取的Item对象 编写一个spider来爬取某个网站并提取出所有的Item对象 编写一个Item Pi ...

  4. 第三百四十二节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存

    第三百四十二节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存 注意:数据保存的操作都是在pipelines.py文件里操作的 将数据保存为json文件 spider是一个信号检测 ...

  5. 二十一 Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存

    注意:数据保存的操作都是在pipelines.py文件里操作的 将数据保存为json文件 spider是一个信号检测 # -*- coding: utf-8 -*- # Define your ite ...

  6. EF 连接MySQL 数据库  保存中文数据后乱码问题

    EF 连接MySQL 数据库  保存中文数据后乱码问题 采用Code First 生成的数据库,MySQL数据库中,生成的表的编码格式为***** 发现这个问题后,全部手动改成UTF8(图是另一个表的 ...

  7. pandas数据保存至Mysql数据库

    pandas数据保存至Mysql数据库 import pandas as pd from sqlalchemy import create_engine host = '127.0.0.1' port ...

  8. Spark使用Java读取mysql数据和保存数据到mysql

    原文引自:http://blog.csdn.net/fengzhimohan/article/details/78471952 项目应用需要利用Spark读取mysql数据进行数据分析,然后将分析结果 ...

  9. 简书全站爬取 mysql异步保存

    # 简书网 # 数据保存在mysql中; 将selenium+chromedriver集成到scrapy; 整个网站数据爬取 # 抓取ajax数据 #爬虫文件 # -*- coding: utf-8 ...

随机推荐

  1. laravel 邮箱认证

    修改 User 模型,将 Laravel 自带的邮箱认证功能集成到我们的程序中 <?php namespace App\Models; use Illuminate\Notifications\ ...

  2. (2)Linux Java环境变量安装

    install default JRE/JDK Installing Java with apt-get is easy. First, update the package index: sudo ...

  3. L298N模块的使用(文末有惊喜)

    这个模块有两个供电口,标示着“12V输入”的是功率驱动电源输入,供电范围可以是7-46V, 一般12V供电就能满足我们大部分的DIY需求.标示着“5V输出可不接”的是逻辑供电, 当我们将“板载5V输出 ...

  4. 吴裕雄 Bootstrap 前端框架开发——Bootstrap 网格系统实例:手机、平板电脑、台式电脑

    <!DOCTYPE html> <html> <head> <title>Bootstrap 实例 - 手机.平板电脑.台式电脑</title&g ...

  5. IDEA中常用优化设置

    1.设置鼠标悬浮提示 Editor->General 这里要勾选下,后面设置的是延迟时间 默认半秒:设置后,我们鼠标移动到类上看看: 2.显示方法分隔符 Editor->General - ...

  6. 【译】索引进阶(十七): SQL SERVER索引最佳实践

    [译注:此文为翻译,由于本人水平所限,疏漏在所难免,欢迎探讨指正] 原文链接:传送门. 在本章我们给出一些建议:贯穿本系列我们提取出了十四条基本指南,这些基本的指南将会帮助你为你的数据库创建最佳的索引 ...

  7. IIS 配置迁移

    使用管理员身份运行cmd 应用程序池: # 导出所有应用程序池 %windir%\system32\inetsrv\appcmd list apppool /config /xml > c:\a ...

  8. druid监控sql完整版

    利用Druid实现应用和SQL监控 一.关于Druid Druid是一个JDBC组件,它包括三部分: DruidDriver 代理Driver,能够提供基于Filter-Chain模式的插件体系. D ...

  9. HA: Infinity Stones-Write-up

    下载地址:点我 哔哩哔哩:点我 主题还是关于复仇者联盟的,这次是无限宝石的. 信息收集 虚拟机的IP为:192.168.116.137 ➜ ~ nmap -sn 192.168.116.1/24 St ...

  10. mongodb插入性能

    转自 https://blog.csdn.net/asdfsadfasdfsa/article/details/60872180 MongoDB与MySQL的插入.查询性能测试     7.1  平均 ...