python之scrapy篇(一)
一、首先创建工程(cmd中进行)
scrapy startproject xxx
二、编写Item文件
添加要字段
# -*- coding: utf-8 -*- # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 电影标题
title = scrapy.Field() # 电影信息
info = scrapy.Field() # 电影评分
score = scrapy.Field() # 评分人数
number = scrapy.Field() # 简介
content = scrapy.Field()
# 简介
content = scrapy.Field()
三、进入spider文件(cmd中进行)
scrapy genspider demo 'www.movie.douban.com'
创建完成后进入
编写代码
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem class DoubanmovieSpider(scrapy.Spider):
name = 'doubanmovie'
allowed_domains = ['movie.douban.com']
offset = 0
url = "https://movie.douban.com/top250?start="
start_urls = (
url + str(offset),
) def parse(self, response):
item = DoubanItem()
# 标题
movies = response.xpath("//div[@class='info']") for movie in movies:
name = movie.xpath('div[@class="hd"]/a/span/text()').extract() message = movie.xpath('div[@class="bd"]/p/text()').extract() star = movie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract() number = movie.xpath('div[@class="bd"]/div[@class="star"]/span/text()').extract() quote = movie.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract() if quote:
quote = quote[0]
else:
quote = '' item['title'] = ''.join(name)
item['info'] = quote
item['score'] = star[0]
item['content'] = ';'.join(message).replace(' ', '').replace('\n', '')
item['number'] = number[1].split('人')[0] yield item if self.offset < 225:
self.offset += 25
yield scrapy.Request(self.url + str(self.offset), callback=self.parse)
四、配置setting文件
# -*- coding: utf-8 -*- # Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36" # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban.middlewares.DoubanSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.DoubanDownloaderMiddleware': 100,
'douban.middlewares.RandomUserAgent': 100,
} USER_AGENT = [
# Opera
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Opera/8.0 (Windows NT 5.1; U; en)",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
# Firefox
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
# Safari
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
# chrome
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
# 360
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
# 淘宝浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
# 猎豹浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
# QQ浏览器
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
# sogou浏览器
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
# maxthon浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
# UC浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
] # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'douban.pipelines.DoubanPipeline': 300,
'douban.pipelines.DoubanSqlPipeline': 300,
'douban.pipelines.DoubanWritePipeline': 250,
}
# 主机名
MYSQL_HOST = "IP"
# 端口号
MYSQL_PORT = 3306
# 数据库用户名
MYSQL_USER = "root"
# 数据库密码
MYSQL_PASSWORD = "Password"
# 数据库名称
MYSQL_DBNAME = "mydouban"
# 存放数据的表名称
MYSQL_TABLENAME = "doubanmovies"
# 数据库编码
MYSQL_CHARSET = "utf8" # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
配置管道文件
ITEM_PIPELINES = {
# 'douban.pipelines.DoubanPipeline': 300,
'douban.pipelines.DoubanSqlPipeline': 300,
'douban.pipelines.DoubanWritePipeline': 250,
}
User_Agent配置(包含了大多数浏览器)
DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.DoubanDownloaderMiddleware': 100,
'douban.middlewares.RandomUserAgent': 100,
}
USER_AGENT = [
# Opera
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Opera/8.0 (Windows NT 5.1; U; en)",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
# Firefox
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
# Safari
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
# chrome
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
# 360
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
# 淘宝浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
# 猎豹浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
# QQ浏览器
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
# sogou浏览器
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
# maxthon浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
# UC浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
]
# 主机名
MYSQL_HOST = "IP"
# 端口号
MYSQL_PORT = 3306
# 数据库用户名
MYSQL_USER = "root"
# 数据库密码
MYSQL_PASSWORD = "password"
# 数据库名称
MYSQL_DBNAME = "mydouban"
# 存放数据的表名称
MYSQL_TABLENAME = "doubanmovies"
# 数据库编码
MYSQL_CHARSET = "utf8"
五、管道文件pipelines
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo
import pymysql
from scrapy import settings
import json
import logging
from pymysql import cursors
from twisted.enterprise import adbapi
import time
import copy #保存到本地
class DoubanWritePipeline(object):
def __init__(self):
self.filename = open("douban.json", "wb") def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.filename.write(text.encode("utf-8"))
return item def colse_spider(self, spider):
self.filename.close() #保存到mysql(提前建立好数据库)
class DoubanSqlPipeline(object):
def __init__(self):
self.conn = pymysql.connect(host='IP', user='root',
passwd='password', db='mydouban', charset='utf8')
self.cur = self.conn.cursor() def process_item(self, item, spider):
title = item["title"]
info = item["info"]
score = item["score"]
number = item["number"]
content = item["content"]
# 创建sql语句
sql = """INSERT INTO doubanmovies (title,info,score,number,content,createtime) VALUES ("{}","{}","{}","{}","{}","{}")""".format(
pymysql.escape_string(title), pymysql.escape_string(info), pymysql.escape_string(score), pymysql.escape_string(number), pymysql.escape_string(content),pymysql.escape_string(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
# 执行sql语句
self.conn.ping(reconnect=True)
self.cur.execute(sql)
self.conn.commit() def close_spider(self, spider):
self.cur.close()
self.conn.close()
运行
scrapy crawl xxx
OVER!
下面展示一下效果...
python之scrapy篇(一)的更多相关文章
- python之scrapy篇(三)
一.创建工程(cmd) scrapy startproject xxxx 二.编写item文件 # -*- coding: utf-8 -*- # Define here the models for ...
- python之scrapy篇(二)
一.创建工程 scarpy startproject xxx 二.编写iteam文件 # -*- coding: utf-8 -*- # Define here the models for your ...
- Python爬虫Scrapy框架入门(0)
想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...
- [Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
- python爬虫Scrapy(一)-我爬了boss数据
一.概述 学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...
- Python之Scrapy爬虫框架安装及简单使用
题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提 ...
- dota玩家与英雄契合度的计算器,python语言scrapy爬虫的使用
首发:个人博客,更新&纠错&回复 演示地址在这里,代码在这里. 一个dota玩家与英雄契合度的计算器(查看效果),包括两部分代码: 1.python的scrapy爬虫,总体思路是pag ...
- python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...
- python爬虫scrapy项目详解(关注、持续更新)
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
随机推荐
- day6(短信验证接口)
1.注册容联云账号 1.1注册账号 https://www.yuntongxun.com/user/login 1.2登录即可看到开发者账号信息 1.3 添加测试账号 2.使用容联云发送代码测试 ' ...
- 2017 Mid Central Regional G.Hopscotch (组合计数)
这道题有点意思,给出点(N,N),你在原点处向目标点走,每次只能向x和y两个方向走路,每次xy两个方向的步幅分别不能小于dx和dy,问走到终点的方案数,答案对1e9 + 7取模 这道题最直接的想法就是 ...
- C++ 虚基类的定义、功能、规定
原文声明:http://blog.sina.com.cn/s/blog_93b45b0f01011pkz.html 虚继承和虚基类的定义是非常的简单的,同时也是非常容易判断一个继承是否是虚继承的,虽然 ...
- 第6.5节 exec函数:一个自说自话的强大Python动态编译器
在Python动态执行的函数中,exec是用于执行一个字符串内包含的Python源码或其编译后对应的字节码. 一. 语法 1. exec(Code, globals=None, local ...
- Python中可迭代对象是什么?
Python中可迭代对象(Iterable)并不是指某种具体的数据类型,它是指存储了元素的一个容器对象,且容器中的元素可以通过__iter__( )方法或__getitem__( )方法访问. __i ...
- PyQt(Python+Qt)学习随笔:Qt Designer中Action关联menu菜单和toolBar的方法
1.Action关联菜单 通过菜单创建的Action,已经与菜单自动关联,如果是单独创建的Action,需要与菜单挂接时,直接将Action Editor中定义好的Action对象拖拽到菜单栏上即可以 ...
- 再次学习sql注入
爆所有数据库 select schema_name from information_schema.schemata 先爆出多少个字段 id = 1 order by ?; mysql5.0及以上 都 ...
- ASP自动刷新页面的实现方法总结
1) <meta http-equiv="refresh" content="10"> 10表示间隔10秒刷新一次 2) <script> ...
- NOIP2020 浙江 游记
day - ? 由于 CSP-S 的失利,感觉这一次 NOIP 的心态反而是非常的淡定,感觉反正已经炸过一次了,再炸一次好像也没什么,就抱着这样的心态去考试的. day 1 考试当天起晚了,到考场的时 ...
- Linux 批量创建user和批量删除用户
Linux 批量创建user和批量删除用户 以下为批量创建用户: #首先我们需要创建一个xxx.txt文件,把需要的我们创建的用户写在这个文本里面来,注意:每写完一个用户都需要换行. vim user ...