一、创建工程

scarpy startproject xxx

二、编写iteam文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class TestScrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()

二、编写setting文件

# -*- coding: utf-8 -*-

# Scrapy settings for test_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'test_scrapy' SPIDER_MODULES = ['test_scrapy.spiders']
NEWSPIDER_MODULE = 'test_scrapy.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'test_scrapy (+http://www.yourdomain.com)' # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'test_scrapy.middlewares.TestScrapySpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'test_scrapy.middlewares.TestScrapyDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'test_scrapy.pipelines.TestScrapyPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

三、进入spider文件(cmd创建自定义爬虫文件)

scrapy genspider demo 'www.douyu.com'

编写代码

import scrapy
from test_scrapy.items import TestScrapyItem class ItcastSpider(scrapy.Spider):
# 爬虫名
name = "itcast"
# 允许爬虫作用的范围
allowd_domains = ['http://www.xxxx.cn/']
# 爬虫起始的url
start_urls = ["http://www.xxxx.cn/channel/teacher.shtml#ajavaee"] def parse(self, response):
# with open("teacher.html", "w") as f:
# f.write(response.body)
# 通过自带的xpath匹配出所有老师的根节点列表集合
teacher_list = response.xpath('//div[@class="li_txt"]') # 所有老师信息的列表集合
# 遍历根节点集合
for each in teacher_list:
# 保存数据
item = TestScrapyItem()
# name,extract()将匹配出来的结果转换为unicode字符串
# 不加extract()结果为xpath匹配对象
name = each.xpath('./h3/text()').extract()
# title
title = each.xpath('./h4/text()').extract()
# info
info = each.xpath('./p/text()').extract() item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0] yield item

四、运行

scrapy crawl xxxx

OVER!

python之scrapy篇(二)的更多相关文章

  1. Python:Scrapy(二) 实例分析与总结、写一个爬虫的一般步骤

    学习自:Scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 - 知乎 Python Scrapy 爬虫框架实例(一) - Blue·Sky - 博客园 1.声明Item 爬虫爬取的目标是从非 ...

  2. [Python学习]错误篇二:切换当前工作目录时出错——FileNotFoundError: [WinError 3] 系统找不到指定的路径

    REFERENCE:<Head First Python> ID:我的第二篇[Python学习] BIRTHDAY:2019.7.13 EXPERIENCE_SHARING:解决切换当前工 ...

  3. python之scrapy篇(三)

    一.创建工程(cmd) scrapy startproject xxxx 二.编写item文件 # -*- coding: utf-8 -*- # Define here the models for ...

  4. python之scrapy篇(一)

    一.首先创建工程(cmd中进行) scrapy startproject xxx 二.编写Item文件 添加要字段 # -*- coding: utf-8 -*- # Define here the ...

  5. Python项目--Scrapy框架(二)

    本文主要是利用scrapy框架爬取果壳问答中热门问答, 精彩问答的相关信息 环境 win8, python3.7, pycharm 正文 1. 创建scrapy项目文件 在cmd命令行中任意目录下执行 ...

  6. [Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍

    前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...

  7. python爬虫Scrapy(一)-我爬了boss数据

    一.概述 学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...

  8. Python的单元测试(二)

    title: Python的单元测试(二) date: 2015-03-04 19:08:20 categories: Python tags: [Python,单元测试] --- 在Python的单 ...

  9. Python爬虫Scrapy框架入门(0)

    想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...

随机推荐

  1. 在 CentOS 7 安装 Tomcat

    一. 安装 JDK 8 1.1 下载 JDK 8 cd /opt/ wget --no-cookies --no-check-certificate --header "Cookie: gp ...

  2. Java虚拟机之内存区域

    原创文章,转载请标明出处! 目录 一.背景 二.运行时内存区域概述 1.官方描述 2.中文翻译 3.内存区域简述 4.运行时数据区简图 5.运行时数据区详图 三.JVM线程 JVM数据区域与线程关系 ...

  3. PyQt(Python+Qt)学习随笔:QToolBox工具箱currentItem对应的index、text、name、icon、ToolTip属性

    老猿Python博文目录 专栏:使用PyQt开发图形界面Python应用 老猿Python博客地址 在Designer中,toolBox主要有如下属性: 可以看到,toolBox的属性主要是与当前项相 ...

  4. Python类知识学习时的部分问题

    Python的富比较方法__eq__和__ne__之间的关联关系分析 Python的富比较方法__le__.ge__之间的关联关系分析 Python的富比较方法__lt.__gt__之间的关联关系分析 ...

  5. PyQt(Python+Qt)学习随笔:QTreeWidgetItem项下子项的指示符展示原则childIndicatorPolicy

    老猿Python博文目录 专栏:使用PyQt开发图形界面Python应用 老猿Python博客地址 树型部件QTreeWidget中的QTreeWidgetItem项下可以有子项,如果存在子项,则父项 ...

  6. 第14.17节 爬虫实战3: request+BeautifulSoup实现自动获取本机上网公网地址

    一. 引言 一般情况下,没有特殊要求的客户,宽带服务提供商提供的上网服务,给客户家庭宽带分配的地址都是一个宽带服务提供商的内部服务地址,真正对外访问时通过NAT进行映射到一个公网地址,如果我们想确认自 ...

  7. Newbe.ObjectVisitor 0.4.4 发布,模型验证器上线

    Newbe.Claptrap 0.4.4 发布,模型验证器上线. 更新内容 完全基于表达式树的模型验证器 本版本,我们带来了基于表达式树实现的模型验证器.并实现了很多内置的验证方法. 我们罗列了与 F ...

  8. 【题解】「P6832」[Cnoi2020]子弦

    [题解]「P6832」[Cnoi2020]子弦第一次写月赛题解( 首先第一眼看到这题,怎么感觉要用 \(\texttt{SAM}\) 什么高科技的?结果一仔细读题,简单模拟即可. 我们不难想出,出现最 ...

  9. 得物(毒)APP,8位抽奖码需求,这不就是产品给我留的数学作业!

    作者:小傅哥 博客:https://bugstack.cn Github:https://github.com/fuzhengwei/CodeGuide/wiki 沉淀.分享.成长,让自己和他人都能有 ...

  10. Docker安装RabbitMQ与Kafka

    RabbitMq安装(dokcer) 下载镜像 docker pull rabbitmq 创建并启动容器 docker run -d --name rabbitmq -p 5672:5672 -p 1 ...