python之scrapy爬取数据保存到mysql数据库
1、创建工程
- scrapy startproject tencent
2、创建项目
- scrapy genspider mahuateng
3、既然保存到数据库,自然要安装pymsql
- pip install pymysql
4、settings文件,配置信息,包括数据库等
- # -*- coding: utf-8 -*-
- # Scrapy settings for tencent project
- #
- # For simplicity, this file contains only settings considered important or
- # commonly used. You can find more settings consulting the documentation:
- #
- # https://doc.scrapy.org/en/latest/topics/settings.html
- # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
- # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
- BOT_NAME = 'tencent'
- SPIDER_MODULES = ['tencent.spiders']
- NEWSPIDER_MODULE = 'tencent.spiders'
- LOG_LEVEL="WARNING"
- LOG_FILE="./qq.log"
- # Crawl responsibly by identifying yourself (and your website) on the user-agent
- USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
- # Obey robots.txt rules
- #ROBOTSTXT_OBEY = True
- # Configure maximum concurrent requests performed by Scrapy (default: 16)
- #CONCURRENT_REQUESTS = 32
- # Configure a delay for requests for the same website (default: 0)
- # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
- # See also autothrottle settings and docs
- #DOWNLOAD_DELAY = 3
- # The download delay setting will honor only one of:
- #CONCURRENT_REQUESTS_PER_DOMAIN = 16
- #CONCURRENT_REQUESTS_PER_IP = 16
- # Disable cookies (enabled by default)
- #COOKIES_ENABLED = False
- # Disable Telnet Console (enabled by default)
- #TELNETCONSOLE_ENABLED = False
- # Override the default request headers:
- #DEFAULT_REQUEST_HEADERS = {
- # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- # 'Accept-Language': 'en',
- #}
- # Enable or disable spider middlewares
- # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
- #SPIDER_MIDDLEWARES = {
- # 'tencent.middlewares.TencentSpiderMiddleware': 543,
- #}
- # Enable or disable downloader middlewares
- # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
- #DOWNLOADER_MIDDLEWARES = {
- # 'tencent.middlewares.TencentDownloaderMiddleware': 543,
- #}
- # Enable or disable extensions
- # See https://doc.scrapy.org/en/latest/topics/extensions.html
- #EXTENSIONS = {
- # 'scrapy.extensions.telnet.TelnetConsole': None,
- #}
- # Configure item pipelines
- # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
- ITEM_PIPELINES = {
- 'tencent.pipelines.TencentPipeline': 300,
- }
- # Enable and configure the AutoThrottle extension (disabled by default)
- # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
- #AUTOTHROTTLE_ENABLED = True
- # The initial download delay
- #AUTOTHROTTLE_START_DELAY = 5
- # The maximum download delay to be set in case of high latencies
- #AUTOTHROTTLE_MAX_DELAY = 60
- # The average number of requests Scrapy should be sending in parallel to
- # each remote server
- #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
- # Enable showing throttling stats for every response received:
- #AUTOTHROTTLE_DEBUG = False
- # Enable and configure HTTP caching (disabled by default)
- # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
- #HTTPCACHE_ENABLED = True
- #HTTPCACHE_EXPIRATION_SECS = 0
- #HTTPCACHE_DIR = 'httpcache'
- #HTTPCACHE_IGNORE_HTTP_CODES = []
- #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
- # 连接数据MySQL
- # 数据库地址
- MYSQL_HOST = 'localhost'
- # 数据库用户名:
- MYSQL_USER = 'root'
- # 数据库密码
- MYSQL_PASSWORD = 'yang156122'
- # 数据库端口
- MYSQL_PORT = 3306
- # 数据库名称
- MYSQL_DBNAME = 'test'
- # 数据库编码
- MYSQL_CHARSET = 'utf8'
5、items.py文件定义数据字段
- # -*- coding: utf-8 -*-
- # Define here the models for your scraped items
- #
- # See documentation in:
- # https://doc.scrapy.org/en/latest/topics/items.html
- import scrapy
- class TencentItem(scrapy.Item):
- """
- 数据字段定义
- """
- postId = scrapy.Field()
- recruitPostId = scrapy.Field()
- recruitPostName = scrapy.Field()
- countryName = scrapy.Field()
- locationName = scrapy.Field()
- categoryName = scrapy.Field()
- lastUpdateTime = scrapy.Field()
- pass
6、mahuateng.py文件主要是抓取数据
- # -*- coding: utf-8 -*-
- import scrapy
- import json
- import logging
- class MahuatengSpider(scrapy.Spider):
- name = 'mahuateng'
- allowed_domains = ['careers.tencent.com']
- start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
- pageNum = 1
- def parse(self, response):
- """
- 数据获取
- :param response:
- :return:
- """
- content = response.body.decode()
- content = json.loads(content)
- content=content['Data']['Posts']
- #删除空字典
- for con in content:
- #print(con)
- for key in list(con.keys()):
- if not con.get(key):
- del con[key]
- #记录每一个岗位信息
- # for con in content:
- # yield con
- #print(type(con))
- yield con
- #logging.warning(con)
- #####翻页######
- self.pageNum = self.pageNum+1
- if self.pageNum<=118:
- next_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex="+str(self.pageNum)+"&pageSize=10&language=zh-cn&area=cn"
- yield scrapy.Request(
- next_url,
- callback=self.parse
- )
7、pipelines.py文件主要是对数据进行处理,包括将数据存储到mysql
- # -*- coding: utf-8 -*-
- # Define your item pipelines here
- #
- # Don't forget to add your pipeline to the ITEM_PIPELINES setting
- # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
- import logging
- from pymysql import cursors
- from twisted.enterprise import adbapi
- import time
- # from tencent.settings import MYSQL_HOST
- # from tencent.settings import MYSQL_USER
- # from tencent.settings import MYSQL_PASSWORD
- # from tencent.settings import MYSQL_PORT
- # from tencent.settings import MYSQL_DBNAME
- #
- # from tencent.settings import MYSQL_CHARSET
- import copy
- class TencentPipeline(object):
- #函数初始化
- def __init__(self,db_pool):
- self.db_pool=db_pool
- @classmethod
- def from_settings(cls,settings):
- """类方法,只加载一次,数据库初始化"""
- db_params = dict(
- host=settings['MYSQL_HOST'],
- user=settings['MYSQL_USER'],
- password=settings['MYSQL_PASSWORD'],
- port=settings['MYSQL_PORT'],
- database=settings['MYSQL_DBNAME'],
- charset=settings['MYSQL_CHARSET'],
- use_unicode=True,
- # 设置游标类型
- cursorclass=cursors.DictCursor
- )
- # 创建连接池
- db_pool = adbapi.ConnectionPool('pymysql', **db_params)
- # 返回一个pipeline对象
- return cls(db_pool)
- def process_item(self, item, spider):
- """
- 数据处理
- :param item:
- :param spider:
- :return:
- """
- myItem={}
- myItem["postId"] = item["PostId"]
- myItem["recruitPostId"] = item["RecruitPostId"]
- myItem["recruitPostName"] = item["RecruitPostName"]
- myItem["countryName"] = item["CountryName"]
- myItem["locationName"] = item["LocationName"]
- myItem["categoryName"] = item["CategoryName"]
- myItem["lastUpdateTime"] = item["LastUpdateTime"]
- logging.warning(myItem)
- # 对象拷贝,深拷贝 --- 这里是解决数据重复问题!!!
- asynItem = copy.deepcopy(myItem)
- # 把要执行的sql放入连接池
- query = self.db_pool.runInteraction(self.insert_into, asynItem)
- # 如果sql执行发送错误,自动回调addErrBack()函数
- query.addErrback(self.handle_error, myItem, spider)
- return myItem
- # 处理sql函数
- def insert_into(self, cursor, item):
- # 创建sql语句
- sql = "INSERT INTO tencent (postId,recruitPostId,recruitPostName,countryName,locationName,categoryName,lastUpdateTime) VALUES ('{}','{}','{}','{}','{}','{}','{}')".format(
- item['postId'], item['recruitPostId'], item['recruitPostName'], item['countryName'], item['locationName'],
- item['categoryName'],item['lastUpdateTime'])
- # 执行sql语句
- cursor.execute(sql)
- # 错误函数
- def handle_error(self, failure, item, spider):
- # #输出错误信息
- print("failure", failure)
8、创建数据库表
- Navicat MySQL Data Transfer
- Source Server : 本机
- Source Server Version : 50519
- Source Host : localhost:3306
- Source Database : test
- Target Server Type : MYSQL
- Target Server Version : 50519
- File Encoding : 65001
- Date: 2019-06-28 12:47:06
- */
- SET FOREIGN_KEY_CHECKS=0;
- -- ----------------------------
- -- Table structure for tencent
- -- ----------------------------
- DROP TABLE IF EXISTS `tencent`;
- CREATE TABLE `tencent` (
- `id` int(10) NOT NULL AUTO_INCREMENT,
- `postId` varchar(100) DEFAULT NULL,
- `recruitPostId` varchar(100) DEFAULT NULL,
- `recruitPostName` varchar(100) DEFAULT NULL,
- `countryName` varchar(100) DEFAULT NULL,
- `locationName` varchar(100) DEFAULT NULL,
- `categoryName` varchar(100) DEFAULT NULL,
- `lastUpdateTime` varchar(100) DEFAULT NULL,
- PRIMARY KEY (`id`)
- ) ENGINE=InnoDB AUTO_INCREMENT=1181 DEFAULT CHARSET=utf8;
完美收官!
python之scrapy爬取数据保存到mysql数据库的更多相关文章
- Python scrapy爬虫数据保存到MySQL数据库
除将爬取到的信息写入文件中之外,程序也可通过修改 Pipeline 文件将数据保存到数据库中.为了使用数据库来保存爬取到的信息,在 MySQL 的 python 数据库中执行如下 SQL 语句来创建 ...
- 爬取伯乐在线文章(四)将爬取结果保存到MySQL
Item Pipeline 当Item在Spider中被收集之后,它将会被传递到Item Pipeline,这些Item Pipeline组件按定义的顺序处理Item. 每个Item Pipeline ...
- 5分钟掌握智联招聘网站爬取并保存到MongoDB数据库
前言 本次主题分两篇文章来介绍: 一.数据采集 二.数据分析 第一篇先来介绍数据采集,即用python爬取网站数据. 1 运行环境和python库 先说下运行环境: python3.5 windows ...
- python爬取数据保存到Excel中
# -*- conding:utf-8 -*- # 1.两页的内容 # 2.抓取每页title和URL # 3.根据title创建文件,发送URL请求,提取数据 import requests fro ...
- 关于爬取数据保存到json文件,中文是unicode解决方式
流程: 爬取的数据处理为列表,包含字典.里面包含中文, 经过json.dumps,保存到json文件中, 发现里面的中文显示未\ue768这样子 查阅资料发现,json.dumps 有一个参数.ens ...
- 爬取网贷之家平台数据保存到mysql数据库
# coding utf-8 import requests import json import datetime import pymysql user_agent = 'User-Agent: ...
- Python+Scrapy+Crawlspider 爬取数据且存入MySQL数据库
1.Scrapy使用流程 1-1.使用Terminal终端创建工程,输入指令:scrapy startproject ProName 1-2.进入工程目录:cd ProName 1-3.创建爬虫文件( ...
- Java爬取51job保存到MySQL并进行分析
大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看, java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的, 然后网上转了一圈,拉钩什么的是动态 ...
- 如何将大数据保存到 MySql 数据库
1. 什么是大数据 1. 所谓大数据, 就是大的字节数据,或大的字符数据. 2. 标准 SQL 中提供了如下类型来保存大数据类型: 字节数据类型: tinyblob(256B), blob(64K), ...
随机推荐
- openstack 平台P2V迁移
目录 [Openstack]P2V迁移 一.前言 二.前提准备 三.操作步骤 1.安装迁移中转机 2.配置中转机 3.创建存储池(可选) 4.制作virt-p2v的 U盘引导启动工具 5.操作物理机, ...
- ie11浏览器不显示vbs脚本
最初接触学习vbs在浏览器上运行,老不显示vbscript脚本语言,所以找了很久,最后就用这个方法吧,比较简单有效 原因:新版IE不再支持 VBScript,就是因为微软已经放弃把VBScript作为 ...
- php连接docker运行的mysql,显示(HY000/2002): Connection refused的解决办法
php要连接docker中运行的mysql是不能用localhost, 127.0.0.1来连接的,因为每个docker运行容器的localhost 127.0.0.1都是自己容器本身,不是mysql ...
- Azkaban无法启动错误Error: Could not find or load main class 12321
1 错误日志 Using Hadoop from /mnt/software/hadoop-2.6.0-cdh5.16.1 bin/internal/../.. /mnt/software/jdk1. ...
- 远程操控批量复制应用(scp/pssh/pscp.pssh/rsync/pslurp)
scp命令: scp [options] SRC... DEST/两种方式: scp [options] [user@]host:/sourcefile /destpath scp [options] ...
- kubernetes容易混淆的几个端口
k8s服务的配置文件中几个端口参数,nodePort.port.targetPort,刚开始的时候不理解什么意思很容易混淆写错,这里总结一下,概括来说就是nodePort和port都是k8s的serv ...
- java——OOM内存泄漏
资料: 一.什么是OOM OOM,全称“Out Of Memory”,翻译成中文就是“内存用完了”,当JVM因为没有足够的内存来为对象分配空间并且垃圾回收器也已经没有空间可回收时,就会抛出这个erro ...
- kotlin字符串模板&条件控制
字符串模版: 小时候都有要求记日记的习惯,下面是一小学生记的日记: 很漂亮的流水账,那细分析一下这些文件其实大体都类似,只有几个不同点: 其实就是地点变了,那对于这种有规律的文字可以采用kotlin的 ...
- PHP 提取数组中奇数或偶数的元素array_filter
//提取奇数 $filter = array_filter($ql,function($var){ return($var & 1); },ARRAY_FILTER_USE_KEY); pri ...
- bug-- java.lang.RuntimeException: Type “Klass*"
使用jinfo查看jvm进程id为27523的信息 [java@xftest0 ~]$ jinfo 27523 Attaching to process ID 27523, please wa ...