开始用scrapy 爬取数据的时候  开始用同步操作始终会报1064  的错误  因为 mysql 语法和导入的字段不兼容

尝试了  n  次之后  开始用  异步爬取  虽然一路报错 但是还是能把数据保存到mysql 数据库里

关于spider:

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
import re
from copy import deepcopy
from ..items import MyspiderItem class TbSpider(scrapy.Spider):
name = 'tb' allowed_domains = []
start_urls = ['http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/m?kw=%E6%A1%82%E6%9E%97%E7%94%B5%E5%AD%90%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%A6%E5%8C%97%E6%B5%B7%E6%A0%A1%E5%8C%BA&pn=26140',
] def parse(self, response): # 总页面
item = MyspiderItem() all_elements = response.xpath(".//div[@class='i']")
# print(all_elements) for all_element in all_elements:
content = all_element.xpath("./a/text()").extract_first()
content = "".join(content.split())
change = re.compile(r'[\d]+.')
content = change.sub('', content)
item['comment'] = content person = all_element.xpath("./p/text()").extract_first()
person = "".join(person.split())
# 去掉点赞数 评论数
change2 = re.compile(r'点[\d]+回[\d]+')
person = change2.sub('',person)
# 选择日期
change3 = re.compile(r'[\d]?[\d]?-[\d][\d](?=)')
date = change3.findall(person) # 如果为今天则选择时间
change4 = re.compile(r'[\d]?[\d]?:[\d][\d](?=)')
time = change4.findall(person) person = change3.sub('',person)
person = change4.sub('',person) if time ==[]:
item['time'] = date
else:
item['time'] = time item['name'] = person # 增加密码 活跃
item['is_active'] = ''
item['password'] = '' print(item)
yield item # 下一页
next_url ='http://tieba.baidu.com/mo/q----,sz@320_240-1-3---2/' + parse.unquote( response.xpath(".//div[@class='bc p']/a/@href").extract_first()) print(next_url)
yield scrapy.Request(
next_url,
callback=self.parse, )

关于  item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MyspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
comment = scrapy.Field()
time = scrapy.Field()
name = scrapy.Field()
password = scrapy.Field()
is_active = scrapy.Field()

关于setting

# -*- coding: utf-8 -*-

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'mySpider' SPIDER_MODULES = ['mySpider.spiders']
NEWSPIDER_MODULE = 'mySpider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'mu_ke'
MYSQL_USER = 'root'
MYSQL_PASSWD = 'root' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'mySpider.pipelines.MysqlPipelineTwo':200, } # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

关于  异步的爬取   重点

import pymysql
from twisted.enterprise import adbapi class MysqlPipelineTwo(object):
def __init__(self, dbpool):
self.dbpool = dbpool @classmethod
def from_settings(cls, settings): # 函数名固定,会被scrapy调用,直接可用settings的值
"""
数据库建立连接
:param settings: 配置参数
:return: 实例化参数
"""
adbparams = dict(
host='127.0.0.1',
db='mu_ke',
user='root',
password='root',
cursorclass=pymysql.cursors.DictCursor # 指定cursor类型
)
# 连接数据池ConnectionPool,使用pymysql或者Mysqldb连接
dbpool = adbapi.ConnectionPool('pymysql', **adbparams)
# 返回实例化参数
return cls(dbpool) def process_item(self, item, spider):
"""
使用twisted将MySQL插入变成异步执行。通过连接池执行具体的sql操作,返回一个对象
"""
query = self.dbpool.runInteraction(self.do_insert, item) # 指定操作方法和操作数据
# 添加异常处理
query.addCallback(self.handle_error) # 处理异常 def do_insert(self, cursor, item):
# 对数据库进行插入操作,并不需要commit,twisted会自动commit
insert_sql = """
insert into login_person(name,password,is_active,comment,time) VALUES(%s,%s,%s,%s,%s)
"""
cursor.execute(insert_sql, (item['name'], item['password'], item['is_active'], item['comment'],
item['time'])) def handle_error(self, failure):
if failure:
# 打印错误信息
print(failure)

scrapy 实现mysql 数据保存的更多相关文章

  1. scrapy爬取数据保存csv、mysql、mongodb、json

    目录 前言 Items Pipelines 前言 用Scrapy进行数据的保存进行一个常用的方法进行解析 Items item 是我们保存数据的容器,其类似于 python 中的字典.使用 item ...

  2. Echart显示后端mysql数据

    一.基本思想 1.将数据存储在mysql数据库中 2.后端链接数据库,将数据库中的数据保存为json格式 3.将json格式数据使用ajax传到前端JSP页面中的Echarts 二.实现的关键点 1. ...

  3. scrapy爬虫事件以及数据保存为txt,json,mysql

    今天要爬取的网页是虎嗅网 我们将完成如下几个步骤: 创建一个新的Scrapy工程 定义你所需要要抽取的Item对象 编写一个spider来爬取某个网站并提取出所有的Item对象 编写一个Item Pi ...

  4. 第三百四十二节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存

    第三百四十二节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存 注意:数据保存的操作都是在pipelines.py文件里操作的 将数据保存为json文件 spider是一个信号检测 ...

  5. 二十一 Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存

    注意:数据保存的操作都是在pipelines.py文件里操作的 将数据保存为json文件 spider是一个信号检测 # -*- coding: utf-8 -*- # Define your ite ...

  6. EF 连接MySQL 数据库  保存中文数据后乱码问题

    EF 连接MySQL 数据库  保存中文数据后乱码问题 采用Code First 生成的数据库,MySQL数据库中,生成的表的编码格式为***** 发现这个问题后,全部手动改成UTF8(图是另一个表的 ...

  7. pandas数据保存至Mysql数据库

    pandas数据保存至Mysql数据库 import pandas as pd from sqlalchemy import create_engine host = '127.0.0.1' port ...

  8. Spark使用Java读取mysql数据和保存数据到mysql

    原文引自:http://blog.csdn.net/fengzhimohan/article/details/78471952 项目应用需要利用Spark读取mysql数据进行数据分析,然后将分析结果 ...

  9. 简书全站爬取 mysql异步保存

    # 简书网 # 数据保存在mysql中; 将selenium+chromedriver集成到scrapy; 整个网站数据爬取 # 抓取ajax数据 #爬虫文件 # -*- coding: utf-8 ...

随机推荐

  1. 【代码学习】PYTHON 面向对象

    一.方法重新 #!/usr/bin/python # -*- coding: UTF-8 -*- class Parent: # 定义父类 def myMethod(self): print '调用父 ...

  2. 一行代码解决 sql语句 in传入数组变字符串

    --数组 var arrs= ['test1','test2','test3'];--变字符串 var instring = "'"+arrs.join("','&quo ...

  3. ECMAScript 6 和数组的新功能

    Array. @@iterator 返回一个包含数组键值对的迭代器对象,可以通过同步调用得到数组元素的键值对 copyWithin 复制数组中一系列元素到同一数组指定的起始位置 entries 返回包 ...

  4. 电子书及阅读器Demo

    电子书阅读器(Kindle,电子纸技术.LCD.电子墨水技术等: 亚马逊/当当网站)  电子书产业可分5大环节:内容供应商.数字格式制作商.内容流通服务平台.传输平台以及终端阅读器产品. 全球电子书市 ...

  5. 添加COOKIE

    HttpCookie userinfoCookie = new HttpCookie("userinfo"); JObject o = new JObject();//JObjec ...

  6. vue 组件,以及组件的复用

    有时候代码的某一模块可能会经常使用到,那么完全可以把这一模块抽取出来,封装为一个组件,哪里需要用到的时候只需把模块调用即可 .参考vue官方 https://cn.vuejs.org/v2/guide ...

  7. 数据包报文格式(IP包、TCP报头、UDP报头)

    转自: https://blog.51cto.com/lyhbwwk/2162568 一.IP包格式 IP数据包是一种可变长分组,它由首部和数据负载两部分组成.首部长度一般为20-60字节(Byte) ...

  8. Node Sass does not yet support your current environment: Linux 64-bit with Unsupported runtime

    环境: ubuntu18 webstorm vue项目 报错原因: 缺少相关依赖 解决方法: npm rebuild node-sass 还未解决: npm uninstall --save node ...

  9. Spring报错汇总笔记

    报错信息: org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing X ...

  10. Mac. 修改bash_file

    https://www.cnblogs.com/mokey/p/3542389.html