1. 安装Python

这个不去细说,官网直接下载,安装即可,我自己选择的版本是 Python 3.6.5 x86_64bit windows版本。

2. 配置PATH

我用的windows 10系统,操作步骤,‘此电脑’ 上鼠标右键,选择 ’属性’, 在弹出的面板中,选择 ‘高级系统设置’, 新窗口中点击 ’高级‘ 标签页,然后点击 ’环境变量‘, 在用户环境变量中,选中 path(没有就添加),然后把:C:\Python365\Scripts;C:\Python365;添加到该变量值中即可。

3. 安装scrapy

采用的安装方式为pip, 在打开的cmd窗口中,输入: pip install scrapy,这时候估计会遇到如下错误:

  1. building 'twisted.test.raiser' extension
  2. error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools":
    http://landinghub.visualstudio.com/visual-cpp-build-tools
  3.  
  4. ----------------------------------------
  5. Command "C:\Python365\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\admin\\AppData\\Local
    \\Temp\\pip-install-fkvobf_0\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace(
    '\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\admin\AppData\Local
    \Temp\pip-record-6z5m4wfj\install-record.txt --single-version-externally-managed --compile"
    failed with error code 1 in C:\Users\admin\AppData\Local\Temp\pip-install-fkvobf_0\Twisted\

这是因为没有安装 visual studio c++ 2015, 但是其实我们不需要,另外这里给出的链接也不是正确可以访问的链接,这时候大家可以到这个网站上去下载 Twisted 的whl文件来直接安装即可:

https://www.lfd.uci.edu/~gohlke/pythonlibs/

https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

在这个页面,大家可以选择合适的包进行下载(我选的是:Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl):

  1. Twisted, an event-driven networking engine.
  2. Twisted18.7.0cp27cp27mwin32.whl
  3. Twisted18.7.0cp27cp27mwin_amd64.whl
  4. Twisted18.7.0cp34cp34mwin32.whl
  5. Twisted18.7.0cp34cp34mwin_amd64.whl
  6. Twisted18.7.0cp35cp35mwin32.whl
  7. Twisted18.7.0cp35cp35mwin_amd64.whl
  8. Twisted18.7.0cp36cp36mwin32.whl
  9. Twisted18.7.0cp36cp36mwin_amd64.whl
  10. Twisted18.7.0cp37cp37mwin32.whl
  11. Twisted18.7.0cp37cp37mwin_amd64.whl

下载完成之后,执行:pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl,这个步骤完成之后,继续执行:pip install scrapy,就能够完成剩余的安装任务了。

  1. Installing collected packages: scrapy
  2. Successfully installed scrapy-1.5.1

4. github库

学习、工作最好有跟踪,为此建立自己的github仓库:

  https://github.com/timscm/myscrapy

5. 示例

官方文档上就给出了一简单的示例,这里不做解释,只是尝试是否能够正常运行。

https://docs.scrapy.org/en/latest/intro/tutorial.html

  1. PS D:\pycharm\labs> scrapy
  2. Scrapy 1.5.1 - no active project
  3.  
  4. Usage:
  5. scrapy <command> [options] [args]
  6.  
  7. Available commands:
  8. bench Run quick benchmark test
  9. fetch Fetch a URL using the Scrapy downloader
  10. genspider Generate new spider using pre-defined templates
  11. runspider Run a self-contained spider (without creating a project)
  12. settings Get settings values
  13. shell Interactive scraping console
  14. startproject Create new project
  15. version Print Scrapy version
  16. view Open URL in browser, as seen by Scrapy
  17.  
  18. [ more ] More commands available when run from project directory
  19.  
  20. Use "scrapy <command> -h" to see more info about a command

5.1. 创建项目

  1. PS D:\pycharm\labs> scrapy startproject tutorial .
  2. New Scrapy project 'tutorial', using template directory 'c:\\python365\\lib\\site-packages\\scrapy\\templates\\project', created in:
  3. D:\pycharm\labs
  4.  
  5. You can start your first spider with:
  6. cd .
  7. scrapy genspider example example.com
  8. PS D:\pycharm\labs> dir
  9.  
  10. 目录: D:\pycharm\labs
  11.  
  12. Mode LastWriteTime Length Name
  13. ---- ------------- ------ ----
  14. d----- 2018/9/23 10:58 .idea
  15. d----- 2018/9/23 11:46 tutorial
  16. -a---- 2018/9/23 11:05 1307 .gitignore
  17. -a---- 2018/9/23 11:05 11558 LICENSE
  18. -a---- 2018/9/23 11:05 24 README.md
  19. -a---- 2018/9/23 11:46 259 scrapy.cfg

5.2. 创建spider

文件结构如图所示:

tutorial/spiders/quotes_spider.py内容如下:

  1. import scrapy
  2.  
  3. class QuotesSpider(scrapy.Spider):
  4. name = "quotes"
  5.  
  6. def start_requests(self):
  7. urls = [
  8. 'http://quotes.toscrape.com/page/1/',
  9. 'http://quotes.toscrape.com/page/2/',
  10. ]
  11. for url in urls:
  12. yield scrapy.Request(url=url, callback=self.parse)
  13.  
  14. def parse(self, response):
  15. page = response.url.split("/")[-2]
  16. filename = 'quotes-%s.html' % page
  17. with open(filename, 'wb') as f:
  18. f.write(response.body)
  19. self.log('Saved file %s' % filename)

5.3. 运行

运行需要在cmd窗口中:

  1. PS D:\pycharm\labs> scrapy crawl quotes
  2. 2018-09-23 11:51:41 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
  3. 2018-09-23 11:51:41 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5,
    cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4,
    Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i
    14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
  4. 2018-09-23 11:51:41 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial',
    'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
  5. 2018-09-23 11:51:41 [scrapy.middleware] INFO: Enabled extensions:
  6. ['scrapy.extensions.corestats.CoreStats',
  7. 'scrapy.extensions.telnet.TelnetConsole',
  8. 'scrapy.extensions.logstats.LogStats']
  9. Unhandled error in Deferred:
  10. 2018-09-23 11:51:41 [twisted] CRITICAL: Unhandled error in Deferred:
  11.  
  12. 2018-09-23 11:51:41 [twisted] CRITICAL:
  13. Traceback (most recent call last):
  14. File "c:\python365\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
  15. result = g.send(result)
  16. File "c:\python365\lib\site-packages\scrapy\crawler.py", line 80, in crawl
  17. self.engine = self._create_engine()
  18. File "c:\python365\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
  19. return ExecutionEngine(self, lambda _: self.stop())
  20. File "c:\python365\lib\site-packages\scrapy\core\engine.py", line 69, in __init__
  21. self.downloader = downloader_cls(crawler)
  22. File "c:\python365\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
  23. self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  24. File "c:\python365\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
  25. return cls.from_settings(crawler.settings, crawler)
  26. File "c:\python365\lib\site-packages\scrapy\middleware.py", line 34, in from_settings
  27. mwcls = load_object(clspath)
  28. File "c:\python365\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object
  29. mod = import_module(module)
  30. File "c:\python365\lib\importlib\__init__.py", line 126, in import_module
  31. return _bootstrap._gcd_import(name[level:], package, level)
  32. File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  33. File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  34. File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  35. File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  36. File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  37. File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  38. File "c:\python365\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>
  39. from twisted.web.client import ResponseFailed
  40. File "c:\python365\lib\site-packages\twisted\web\client.py", line 41, in <module>
  41. from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
  42. File "c:\python365\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module>
  43. from twisted.internet.stdio import StandardIO, PipeAddress
  44. File "c:\python365\lib\site-packages\twisted\internet\stdio.py", line 30, in <module>
  45. from twisted.internet import _win32stdio
  46. File "c:\python365\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>
  47. import win32api
  48. ModuleNotFoundError: No module named 'win32api'
  49. PS D:\pycharm\labs>

出错了,提升没有win32api,这是需要安装一个pypiwin32包:

  1. PS D:\pycharm\labs> pip install pypiwin32
  2. Collecting pypiwin32
  3. Downloading https://files.pythonhosted.org/packages/d0/1b/
        2f292bbd742e369a100c91faa0483172cd91a1a422a6692055ac920946c5/
        pypiwin32-223-py3-none-any.whl
  4. Collecting pywin32>=223 (from pypiwin32)
  5. Downloading https://files.pythonhosted.org/packages/9f/9d/
        f4b2170e8ff5d825cd4398856fee88f6c70c60bce0aa8411ed17c1e1b21f/
        pywin32-223-cp36-cp36m-win_amd64.whl (9.0MB)
  6. 100% |████████████████████████████████| 9.0MB 1.1MB/s
  7. Installing collected packages: pywin32, pypiwin32
  8. Successfully installed pypiwin32-223 pywin32-223
  9. PS D:\pycharm\labs>

然后再次运行:

  1. PS D:\pycharm\labs> scrapy crawl quotes
  2. 2018-09-23 11:53:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
  3. 2018-09-23 11:53:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5,
    cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4,
    Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i
    14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
  4. 2018-09-23 11:53:05 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial',
    'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
  5. 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled extensions:
  6. ['scrapy.extensions.corestats.CoreStats',
  7. 'scrapy.extensions.telnet.TelnetConsole',
  8. 'scrapy.extensions.logstats.LogStats']
  9. 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
  10. ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
  11. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  12. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  13. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  14. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
  15. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
  16. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  17. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  18. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  19. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
  20. 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
  21. 'scrapy.downloadermiddlewares.stats.DownloaderStats']
  22. 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled spider middlewares:
  23. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  24. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  25. 'scrapy.spidermiddlewares.referer.RefererMiddleware',
  26. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  27. 'scrapy.spidermiddlewares.depth.DepthMiddleware']
  28. 2018-09-23 11:53:06 [scrapy.middleware] INFO: Enabled item pipelines:
  29. []
  30. 2018-09-23 11:53:06 [scrapy.core.engine] INFO: Spider opened
  31. 2018-09-23 11:53:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  32. 2018-09-23 11:53:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
  33. 2018-09-23 11:53:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
  34. 2018-09-23 11:53:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
  35. 2018-09-23 11:53:08 [quotes] DEBUG: Saved file quotes-1.html # Timlinux: 保存文件了
  36. 2018-09-23 11:53:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
  37. 2018-09-23 11:53:08 [quotes] DEBUG: Saved file quotes-2.html # Timlinux: 保存到文件了
  38. 2018-09-23 11:53:08 [scrapy.core.engine] INFO: Closing spider (finished)
  39. 2018-09-23 11:53:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  40. {'downloader/request_bytes': 678,
  41. 'downloader/request_count': 3,
  42. 'downloader/request_method_count/GET': 3,
  43. 'downloader/response_bytes': 5976,
  44. 'downloader/response_count': 3,
  45. 'downloader/response_status_count/200': 2,
  46. 'downloader/response_status_count/404': 1,
  47. 'finish_reason': 'finished',
  48. 'finish_time': datetime.datetime(2018, 9, 23, 3, 53, 8, 822749),
  49. 'log_count/DEBUG': 6,
  50. 'log_count/INFO': 7,
  51. 'response_received_count': 3,
  52. 'scheduler/dequeued': 2,
  53. 'scheduler/dequeued/memory': 2,
  54. 'scheduler/enqueued': 2,
  55. 'scheduler/enqueued/memory': 2,
  56. 'start_time': datetime.datetime(2018, 9, 23, 3, 53, 6, 381170)}
  57. 2018-09-23 11:53:08 [scrapy.core.engine] INFO: Spider closed (finished)
  58. PS D:\pycharm\labs>

咱们看下保存的文件:

内容:

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <title>Quotes to Scrape</title>
  6. <link rel="stylesheet" href="/static/bootstrap.min.css">
  7. <link rel="stylesheet" href="/static/main.css">
  8. </head>
  9. <body>
  10. <div class="container">
  11. <div class="row header-box">
  12. 很长,咱们就取这一小段吧。

5.4. 上传示例代码

  1. $ git commit -m "init scrapy tutorial."
  2. [master b1d6e1d] init scrapy tutorial.
  3. 9 files changed, 259 insertions(+)
  4. create mode 100644 .idea/vcs.xml
  5. create mode 100644 scrapy.cfg
  6. create mode 100644 tutorial/__init__.py
  7. create mode 100644 tutorial/items.py
  8. create mode 100644 tutorial/middlewares.py
  9. create mode 100644 tutorial/pipelines.py
  10. create mode 100644 tutorial/settings.py
  11. create mode 100644 tutorial/spiders/__init__.py
  12. create mode 100644 tutorial/spiders/quotes_spider.py
  13.  
  14. $ git push
  15. Counting objects: 14, done.
  16. Delta compression using up to 4 threads.
  17. Compressing objects: 100% (12/12), done.
  18. Writing objects: 100% (14/14), 4.02 KiB | 293.00 KiB/s, done.
  19. Total 14 (delta 0), reused 0 (delta 0)
  20. To https://github.com/timscm/myscrapy.git
  21. c7e93fc..b1d6e1d master -> master

[TimLinux] scrapy 在Windows平台的安装的更多相关文章

  1. 如何在Windows平台下安装配置Memcached

    Memcached是一个自由开源的,高性能,分布式内存对象缓存系统. Memcached是以LiveJournal旗下Danga Interactive公司的Brad Fitzpatric为首开发的一 ...

  2. Windows 平台下安装Cygwin后,sshd服务无法启动

    Windows 平台下安装Cygwin后,sshd服务无法启动 系统日志记录信息: 事件 ID ( 0 )的描述(在资源( sshd )中)无法找到.本地计算机可能没有必要的注册信息或消息 DLL 文 ...

  3. windows平台mongoDB安装配置

    一.首先安装mongodb 1.官网下载mongoDB:http://www.mongodb.org/downloads,选择windows平台.安装时,一路next就可以了.我安装在了F:\mong ...

  4. Arduino可穿戴开发入门教程Windows平台下安装Arduino IDE

    Arduino可穿戴开发入门教程Windows平台下安装Arduino IDE Windows平台下安装Arduino IDE Windows操作系统下可以使用安装向导和压缩包形式安装.下面详细讲解这 ...

  5. 在Windows平台下安装与配置Memcached及C#使用方法

    1.在Windows下安装Memcached 资料来源:http://www.jb51.net/article/30334.htm 在Windows平台下安装与配置Memcached的方法,Memca ...

  6. 获取Windows平台下 安装office 版本位数信息

    最近在处理客户端安装程序过程,有一个需求:需要检测Windows平台下安装office 版本信息以及获取使用的office是32 位还是64 位: 当检测出office 位数为64位时,提示当前off ...

  7. 在Windows平台上安装Node.js及NPM模块管理

    1. 下载Node.js官方Windows版程序:http://nodejs.org/#download    从0.6.1开始,Node.js在Windows平台上提供了两种安装方式,一是.MSI安 ...

  8. [转]Windows平台下安装Hadoop

    1.安装JDK1.6或更高版本 官网下载JDK,安装时注意,最好不要安装到带有空格的路径名下,例如:Programe Files,否则在配置Hadoop的配置文件时会找不到JDK(按相关说法,配置文件 ...

  9. MongoDB学习总结(一) —— Windows平台下安装

    > 基本概念 MongoDB是一个基于分布式文件存储的开源数据库系统,皆在为WEB应用提供可扩展的高性能数据存储解决方案.MongoDB将数据存储为一个文档,数据结构由键值key=>val ...

随机推荐

  1. 来了!GitHub for mobile 发布!iOS beta 版已来,Android 版即将发布

    北京时间 2019 年 11 月 14 日,在 GitHub Universe 2019大会上,GitHub 正式发布了 GitHub for mobile,支持 iOS 与 Android 两大移动 ...

  2. 二叉搜索树BST(C语言实现可用)

    1:概述 搜索树是一种可以进行插入,搜索,删除等操作的数据结构,可以用作字典或优先级队列.二叉搜索树是最简单的搜索树.其左子树的键值<=根节点的键值,右子树的键值>=根节点的键值. 如果共 ...

  3. 【Swift】UNNotificationServiceExtension

    一.简介 An object that modifies the content of a remote notification before it's delivered to the user. ...

  4. 使用VSCode调试Egret项目中的ts代码

    发布一次Android项目后,会在代码里,生成对应的.map文件.这样就可以在编辑器里或是Chrome里面对相应的TS文件进行断点调试了. 实际只要在tsconfig.json里面配置一下," ...

  5. PHP导出成PDF你用哪个插件

    准备工作 首先查询了相关的类库,有FPDF,zendPDF,TcPDF等等.首先看了下先选择了FPDF,可以说除了中文字符以外没有什么问题,中文乱码而且看了下最新版本没有很好的解决方案,所以只能放弃. ...

  6. (C#)WPF:LinearGradientBrush的使用

    在MSDN文档库里可以查到,Rectangle.Fill的类型是Brush.Brush是一个抽象类,凡是以Brush为基类的类都可作为Fill属性的值.Brush的派生类有很多: * SolidCol ...

  7. hdu 1880 魔咒词典(双hash)

    魔咒词典Time Limit: 8000/5000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submiss ...

  8. JavaWeb核心知识点

    一:HTTP协议     一.概述 1. 概念:超文本传输协议 2. 作用:规范了客户端(浏览器)和服务器的数据交互格式 3. 特点 1. 简单快速:客户端向服务器请求服务时,仅通过键值对来传输请求方 ...

  9. gopls替换hover文档

    一直在用vscode来开发golang程序,也一直在用 gopls语言服务器,也一直用鼠标悬浮显示函数的文档. 今天 偶然关闭了 gopls,然后 鼠标悬浮后,发现了 新大陆,nima,gopls的 ...

  10. 领扣(LeetCode)Fizz Buzz 个人题解

    写一个程序,输出从 1 到 n 数字的字符串表示. 1. 如果 n 是3的倍数,输出“Fizz”: 2. 如果 n 是5的倍数,输出“Buzz”: 3.如果 n 同时是3和5的倍数,输出 “FizzB ...