1.创建项目

我这里的项目名称为scrapyuniversal,然后我创建在D盘根目录。创建方法如下

打开cmd,切换到d盘根目录。然后输入以下命令:

scrapy startproject scrapyuniversal

如果创建成功,d盘的根目录下将生成一个名为scrapyuniversal的文件夹。

2.创建crawl模板

打开命令行窗口,然后定位到d盘刚才创建的scrapyuniversal文件夹。然后输入以下命令

scrapy genspider -t crawl china tech.china.com

如创建成功会在scrapyuniversal目录下的spider目录里多一个spider文件,下面我们就来看这个spider文件。 代码含注释

3.目录结构

scrapyuniversal
│ scrapy.cfg
│ spider.sql
│ start.py

└─scrapyuniversal
│ items.py
│ loaders.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py

├─spiders
│ │ china.py
│ │ __init__.py
│ │

  

4.china.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import NewsItem
from ..loaders import ChinaLoader
class ChinaSpider(CrawlSpider):
name = 'china'
allowed_domains = ['tech.china.com'] start_urls = ['http://tech.china.com/articles/']
#当我们follow一个链接时, 我们其实是用rules把这个链接返回的response再提取一遍.
#第二个rule是设定设定只取两页
rules = (
Rule(LinkExtractor(allow='article\/.*\.html', restrict_xpaths='//div[@id="left_side"]//div[@class="con_item"]'),
callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths="//div[@id='pageStyle']//span[text()<3]")),
) def parse_item(self, response):
#item存储方式,保存格式乱所以改用itemloader
# item=NewsItem()
# item['title']=response.xpath("//h1[@id='chan_newsTitle']/text()").extract_first()
# item['url']=response.url
# item['text']=''.join(response.xpath("//div[@id='chan_newsDetail']//text()").extract()).strip()
# #re_first正则表达式提取时间
# item['datetime']=response.xpath("//div[@id='chan_newsInfo']/text()").re_first('(\d+-\d+-\d+\s\d+:\d+:\d+)')
# item['source']=response.xpath('//div[@id="chan_newsInfo"]/text()').re_first("来源: (.*)").strip()
# item['website']="中华网"
# yield item loader = ChinaLoader(item=NewsItem(), response=response)
loader.add_xpath('title', '//h1[@id="chan_newsTitle"]/text()')
loader.add_value('url', response.url)
loader.add_xpath('text', '//div[@id="chan_newsDetail"]//text()')
loader.add_xpath('datetime', '//div[@id="chan_newsInfo"]/text()', re='(\d+-\d+-\d+\s\d+:\d+:\d+)')
loader.add_xpath('source', '//div[@id="chan_newsInfo"]/text()', re='来源:(.*)')
loader.add_value('website', '中华网')
yield loader.load_item()

 5.loader.py

#!/usr/bin/env python
# encoding: utf-8
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst,Join,Compose class NewsLoader(ItemLoader):
"""
定义一个通用Out Processor为TakeFirst
TakeFirst:取迭代对象中第一个非空元素,相当于之前item用的extract_first
"""
default_output_processor = TakeFirst()
class ChinaLoader(NewsLoader):
"""
Compose第一个参数
Join:把列表拼成字符串
Compose第二个参数是一个匿名函数
对字符串进一步处理
"""
text_out=Compose(Join(),lambda s: s.strip())
source_out=Compose(Join(),lambda s:s.strip())

  6.items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html from scrapy import Field,Item class NewsItem(Item):
#标题
title=Field()
#链接
url=Field()
#正文
text=Field()
#发布时间
datetime=Field()
#来源
source=Field()
#站点名称,直接赋值中华网
website=Field()

  7.中间件的修改,随机获取useragent逻辑

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals
import random
class ProcessHeaderMidware():
"""process request add request info"""
def __init__(self):
self.USER_AGENT_LIST= ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
def process_request(self, request, spider):
"""
随机从列表中获得header, 并传给user_agent进行使用
"""
ua = random.choice(self.USER_AGENT_LIST)
spider.logger.info(msg='now entring download midware')
if ua:
request.headers['User-Agent'] = ua
# Add desired logging message here.
spider.logger.info(u'User-Agent is : {} {}'.format(request.headers.get('User-Agent'), request))

  8.settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for scrapyuniversal project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapyuniversal' SPIDER_MODULES = ['scrapyuniversal.spiders']
NEWSPIDER_MODULE = 'scrapyuniversal.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapyuniversal (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'scrapyuniversal.middlewares.ScrapyuniversalSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'scrapyuniversal.middlewares.ScrapyuniversalDownloaderMiddleware': 543,
#}
DOWNLOADER_MIDDLEWARES = {
'scrapyuniversal.middlewares.ProcessHeaderMidware': 543, }
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'scrapyuniversal.pipelines.ScrapyuniversalPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' HTTP_PROXY="127.0.0.1:5000" #替换成需要的代理

  

 

scrapy的CrawlSpider使用的更多相关文章

  1. python爬虫之Scrapy框架(CrawlSpider)

    提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...

  2. scrapy——3 crawlSpider——爱问

    scrapy——3  crawlSpider crawlSpider 爬取一般网站常用的爬虫类.其定义了一些规则(rule)来提供跟进link的方便的机制. 也许该spider并不是完全适合您的特定网 ...

  3. Scrapy框架-CrawlSpider

    目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...

  4. scrapy 中crawlspider 爬虫

    爬取目标网站: http://www.chinanews.com/rss/rss_2.html 获取url后进入另一个页面进行数据提取 检查网页: 爬虫该页数据的逻辑: Crawlspider爬虫类: ...

  5. Scrapy 框架 CrawlSpider 全站数据爬取

    CrawlSpider 全站数据爬取 创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...

  6. Scrapy 使用CrawlSpider整站抓取文章内容实现

    刚接触Scrapy框架,不是很熟悉,之前用webdriver+selenium实现过头条的抓取,但是感觉对于整站抓取,之前的这种用无GUI的浏览器方式,效率不够高,所以尝试用CrawlSpider来实 ...

  7. 全栈爬取-Scrapy框架(CrawlSpider)

    引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

  8. Scrapy框架——CrawlSpider类爬虫案例

    Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...

  9. Scrapy框架——CrawlSpider爬取某招聘信息网站

    CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...

  10. python框架Scrapy中crawlSpider的使用——爬取内容写进MySQL

    一.先在MySQL中创建test数据库,和相应的site数据表 二.创建Scrapy工程 #scrapy startproject 工程名 scrapy startproject demo4 三.进入 ...

随机推荐

  1. 孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块

    孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块 (完整学习过程屏幕记录视频地址在文末) 从今天起开始正式学习Python的爬虫. 今天已经初步了解了两个主要的模块: ...

  2. redis基础和通用key操作

    redis是什么? redis开源的,构建于内存的数据结构的nosql数据库.常被用于数据存储,缓存处理和消息处理. redis的优势? 1.极高的读写能力 2.丰富的数据类型 3.原子性操作 4.支 ...

  3. [转]Linux UDP严重丢包问题的解决

    测试系统在Linux上的性能发现丢包率极为严重,发210000条数据,丢包达110000之巨,丢包率超过50%.同等情形下Windows上测试,仅丢几条数据.形势严峻,必须解决.考虑可能是因为协议栈B ...

  4. application/x-www-form-urlencoded从前端到后台

    html <form id="userForm1" enctype="application/x-www-form-urlencoded" method= ...

  5. (转)Nginx配置和内核优化 实现突破十万并发

    nginx指令中的优化(配置文件) worker_processes 8; nginx进程数,建议按照cpu数目来指定,一般为它的倍数. worker_cpu_affinity 00000001 00 ...

  6. LeetCode难度和面试频率(转)

    转自:http://www.cnblogs.com/ywl925/p/3507945.html    ID Question   Diff  Freq  Data Structure  Algorit ...

  7. JDK的弃儿:Vector、Stack、Hashtable、Enumeration

    随着JDK的发展,一些设计缺陷或者性能不足的类库难免会被淘汰,最常见的就是Vector.Stack.HashTable和Enumeration了. Vector(@since 1.0) 首先看看Vec ...

  8. NIO Q&A(持续补充。。。。)

    Q:NIO是非阻塞的.但调用的selector.select()方法会阻塞.这和NIO非阻塞岂不是矛盾了? A:非阻塞指的是 IO 事件本身不阻塞,但是获取 IO 事件的 select 方法是需要阻塞 ...

  9. BZOJ3522 POI2014HOT-Hotels(树形dp)

    分两种情况.三点两两lca相同:在三点的lca处对其统计即可,显然其离lca距离应相同:某点在另两点lca的子树外部:对每个点统计出与其距离x的点有多少个即可. 可以长链剖分做到线性,当然不会. #i ...

  10. hihocoder 1323 回文字符串(字符串+dp)

    题解: 比较水的题目 dp[i][j]表示[i...j]最少改变几次变成回文字符串 那么有三种转移 dp[i][j] = dp[i+1][j-1] + s[i] != s[j] dp[i][j] = ...