安装python爬虫scrapy踩过的那些坑和编程外的思考
这些天应朋友的要求抓取某个论坛帖子的信息,网上搜索了一下开源的爬虫资料,看了许多对于开源爬虫的比较发现开源爬虫scrapy比较好用。但是以前一直用的java和php,对python不熟悉,于是花一天时间粗略了解了一遍python的基础知识。然后就开干了,没想到的配置一个运行环境就花了我一天时间。下面记录下安装和配置scrapy踩过的那些坑吧。
运行环境:CentOS 6.0 虚拟机
开始上来先得安装python运行环境。然而我运行了一下python命令,发现已经自带了,窃(大)喜(坑)。于是google搜索了一下安装步骤,pip install Scrapy直接安装,发现不对。少了pip,于是安装pip。再次pip install Scrapy,发现少了python-devel,于是这么来回折腾了一上午。后来下载了scrapy的源码安装,突然曝出一个需要python2.7版本,再通过python --version查看,一个2.6映入眼前;顿时千万个草泥马在心中奔腾。
于是查看了官方文档(http://doc.scrapy.org/en/master/intro/install.html),果然是要python2.7。没办法,只能升级python的版本了。
1、升级python
- 下载python2.7并安装
wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
tar -zxvf Python-2.7..tgz
cd Python-2.7.
./configure
make all
make installmake clean
make distclean
- 检查python版本
python --version
发现还是2.6
- 更改python命令指向
mv /usr/bin/python /usr/bin/python2..6_bak
ln -s /usr/local/bin/python2. /usr/bin/python
- 再次检查版本
# python --version
Python 2.7.
到这里,python算是升级完成了,继续安装scrapy。于是pip install scrapy,还是报错。
-bash: pip: command not found
- 安装pip
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
于是pip install scrapy,还是报错
Collecting Twisted>=10.0. (from scrapy)
Could not find a version that satisfies the requirement Twisted>=10.0. (from scrapy) (from versions: )
No matching distribution found for Twisted>=10.0. (from scrapy)
少了Twisted,于是安装Twisted
2、安装Twisted
- 下载Twisted(https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2#md5=4be066a899c714e18af1ecfcb01cfef7)
- 安装
wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2
tar -xjvf Twisted-15.2.1.tar.bz2
cd Twisted-15.2.1
python setup.py install
- 查看是否安装成功
python
Python 2.7. (default, Jun , ::)
[GCC 4.4. (Red Hat 4.4.-)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import twisted
>>>
此时索命twisted已经安装成功。于是继续pip install scrapy,还是报错。
3、安装libxlst、libxml2和xslt-config
Collecting libxlst
Could not find a version that satisfies the requirement libxlst (from versions: )
No matching distribution found for libxlst
Collecting libxml2
Could not find a version that satisfies the requirement libxml2 (from versions: )
No matching distribution found for libxml2
wget http://xmlsoft.org/sources/libxslt-1.1.28.tar.gz
tar -zxvf libxslt-1.1.28.tar.gz
cd libxslt-1.1./
./configure
make
make install
wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
tar -zxvf libxml2-git-snapshot.tar.gz
cd libxml2-2.9./
./configure
make
make install
安装好以后继续pip install scrapy,幸运之星仍未降临
4、安装cryptography
Failed building wheel for cryptography
下载cryptography(https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz)
安装
wget https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz
tar -zxvf cryptography-0.4.tar.gz
cd cryptography-0.4
python setup.py build
python setup.py install
发现安装的时候报错:
No package 'libffi' found
于是下载libffi下载并安装
wget ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz
tar -zxvf libffi-3.2.1.tar.gz
cd libffi-3.2.1
./configure
make
make install
安装后发现仍然报错
Package libffi was not found in the pkg-config search path.
Perhaps you should add the directory containing `libffi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libffi' found
于是设置:PKG_CONFIG_PATH
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
再次安装scrapy
pip install scrapy
幸运女神都去哪儿了?
ImportError: libffi.so.: cannot open shared object file: No such file or directory
于是
whereis libffi
libffi: /usr/local/lib/libffi.a /usr/local/lib/libffi.la /usr/local/lib/libffi.so
已经正常安装,网上搜索了一通,发现是LD_LIBRARY_PATH没设置,于是
export LD_LIBRARY_PATH=/usr/local/lib
于是继续安装cryptography-0.4
python setup.py build
python setup.py install
此时正确安装,没有报错信息了。
5、继续安装scrapy
pip install scrapy
看着提示信息:
Building wheels for collected packages: cryptography
Running setup.py bdist_wheel for cryptography
在这里停了好久,在想幸运女神是不是到了。等了一会
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6. in /usr/local/lib/python2./site-packages/zope.interface-4.1.-py2.-linux-i686.egg (from Twisted>=10.0.->scrapy)
Collecting cryptography>=0.7 (from pyOpenSSL->scrapy)
Using cached cryptography-0.9.tar.gz
Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2./site-packages (from zope.interface>=3.6.->Twisted>=10.0.->scrapy)
Requirement already satisfied (use --upgrade to upgrade): idna in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ordereddict in /usr/local/lib/python2./site-packages (from enum34->cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2./site-packages (from cffi>=0.8->cryptography>=0.7->pyOpenSSL->scrapy)
Building wheels for collected packages: cryptography
Running setup.py bdist_wheel for cryptography
Stored in directory: /root/.cache/pip/wheels/d7///7258f08eae0b9c930c04209959c9a0794b9729c2b64258117e
Successfully built cryptography
Installing collected packages: cryptography
Found existing installation: cryptography 0.4
Uninstalling cryptography-0.4:
Successfully uninstalled cryptography-0.4
Successfully installed cryptography-0.9
显示如此的信息。看到此刻,内流马面。谢谢CCAV,感谢MTV,钓鱼岛是中国的。终于安装成功了。
6、测试scrapy
创建测试脚本
cat > myspider.py <<EOF from scrapy import Spider, Item, Field class Post(Item):
title = Field() class BlogSpider(Spider):
name, start_urls = 'blogspider', ['http://www.cnblogs.com/rwxwsblog/'] def parse(self, response):
return [Post(title=e.extract()) for e in response.css("h2 a::text")] EOF
测试脚本能否正常运行
scrapy runspider myspider.py
-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
-- :: [scrapy] INFO: Optional features available: ssl, http11
-- :: [scrapy] INFO: Overridden settings: {}
-- :: [py.warnings] WARNING: :: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. -- :: [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
-- :: [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
-- :: [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
-- :: [scrapy] INFO: Enabled item pipelines:
-- :: [scrapy] INFO: Spider opened
-- :: [scrapy] INFO: Crawled pages (at pages/min), scraped items (at items/min)
-- :: [scrapy] DEBUG: Telnet console listening on 127.0.0.1:
-- :: [scrapy] DEBUG: Crawled () <GET http://www.cnblogs.com/rwxwsblog/> (referer: None)
-- :: [scrapy] INFO: Closing spider (finished)
-- :: [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': ,
'downloader/request_count': ,
'downloader/request_method_count/GET': ,
'downloader/response_bytes': ,
'downloader/response_count': ,
'downloader/response_status_count/200': ,
'finish_reason': 'finished',
'finish_time': datetime.datetime(, , , , , , ),
'log_count/DEBUG': ,
'log_count/INFO': ,
'log_count/WARNING': ,
'response_received_count': ,
'scheduler/dequeued': ,
'scheduler/dequeued/memory': ,
'scheduler/enqueued': ,
'scheduler/enqueued/memory': ,
'start_time': datetime.datetime(, , , , , , )}
-- :: [scrapy] INFO: Spider closed (finished)
运行正常(此时心中窃喜,^_^....)。
7、创建自己的scrapy项目(此时换了一个会话)
scrapy startproject tutorial
输出以下信息
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line , in <module>
load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load
return self.resolve()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=)
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/__init__.py", line , in <module>
from scrapy.spiders import Spider
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/spiders/__init__.py", line , in <module>
from scrapy.http import Request
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/__init__.py", line , in <module>
from scrapy.http.request.form import FormRequest
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/request/form.py", line , in <module>
import lxml.html
File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line , in <module>
from lxml import etree
ImportError: /usr/lib/libxml2.so.: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)
心中无数个草泥马再次狂奔。怎么又不行了?难道会变戏法?定定神看了下:ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)。这是那样的熟悉呀!想了想,这怎么和前面的ImportError: libffi.so.6: cannot open shared object file: No such file or directory那么类似呢?于是
8、添加环境变量
export LD_LIBRARY_PATH=/usr/local/lib
再次运行:
scrapy startproject tutorial
输出以下信息:
[root@bogon scrapy]# scrapy startproject tutorial
-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
-- :: [scrapy] INFO: Optional features available: ssl, http11
-- :: [scrapy] INFO: Overridden settings: {}
New Scrapy project 'tutorial' created in:
/root/scrapy/tutorial You can start your first spider with:
cd tutorial
scrapy genspider example example.com
尼玛的终于成功了。由此可见,scrapy运行的时候需要LD_LIBRARY_PATH环境变量的支持。可以考虑将其加入环境变量中
vi /etc/profile
添加:export LD_LIBRARY_PATH=/usr/local/lib 这行(前面的PKG_CONFIG_PATH也可以考虑添加进来,export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH)
注:安装的时候可以留意Libraries安装的路径,以libffi为例:
libtool: install: /usr/bin/install -c .libs/libffi.so.6.0. /usr/local/lib/../lib64/libffi.so.6.0.
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so. || { rm -f libffi.so. && ln -s libffi.so.6.0. libffi.so.; }; })
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so || { rm -f libffi.so && ln -s libffi.so.6.0. libffi.so; }; })
libtool: install: /usr/bin/install -c .libs/libffi.lai /usr/local/lib/../lib64/libffi.la
libtool: install: /usr/bin/install -c .libs/libffi.a /usr/local/lib/../lib64/libffi.a
libtool: install: chmod /usr/local/lib/../lib64/libffi.a
libtool: install: ranlib /usr/local/lib/../lib64/libffi.a
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/www/wdlinux/mysql/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib/../lib64
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib/../lib64 If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf' See any operating system documentation about shared libraries for
more information, such as the ld() and ld.so() manual pages.
----------------------------------------------------------------------
/bin/mkdir -p '/usr/local/share/info'
/usr/bin/install -c -m ../doc/libffi.info '/usr/local/share/info'
install-info --info-dir='/usr/local/share/info' '/usr/local/share/info/libffi.info'
/bin/mkdir -p '/usr/local/lib/pkgconfig'
/usr/bin/install -c -m libffi.pc '/usr/local/lib/pkgconfig'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'
这里可以知道libffi安装的路径为/usr/local/lib/../lib64,因此在引入LD_LIBRARY_PATH时应该为:export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:$LD_LIBRARY_PATH,此处需要特别留意。
保存后检查是否存在异常:
source /etc/profile
开一个新的会话运行
scrapy runspider myspider.py
发现正常运行,可见LD_LIBRARY_PATH是生效的。至此scrapy就算正式安装成功了。
查看scrapy版本:运行scrapy version,看了下scrapy的版本为“Scrapy 1.0.0rc2”
9、编程外的思考(感谢阅读到此的你,我自己都有点晕了。)
- 有没有更好的安装方式呢?我的这种安装方式是否有问题?有的话请告诉我。(很多依赖包我采用pip和easy_install都无法安装,感觉是pip配置文件配置源的问题)
- 一定要看官方的文档,Google和百度出来的结果往往是碎片化的,不全面。这样可以少走很多弯路,减少不必要的工作量。
- 遇到的问题要先思考,想想是什么问题再Google和百度。
- 解决问题要形成文档,方便自己也方便别人。
10、参考文档
http://scrapy.org/
http://doc.scrapy.org/en/master/
http://blog.csdn.net/slvher/article/details/42346887
http://blog.csdn.net/niying/article/details/27103081
http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html
安装python爬虫scrapy踩过的那些坑和编程外的思考的更多相关文章
- Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
- python爬虫Scrapy(一)-我爬了boss数据
一.概述 学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...
- python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...
- python爬虫scrapy项目详解(关注、持续更新)
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
- 06 windows安装Python+Pycharm+Scrapy环境
windows安装Python+Pycharm+Scrapy环境 使用微信扫码关注微信公众号,并回复:"Python工具包",免费获取下载链接! 一.卸载python环境 卸载以下 ...
- [Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
- 安装 python 爬虫框架 Scrapy
官方安装说明文档:https://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy 一.scrapy 需要以下依赖 二.一般来 ...
- Python爬虫Scrapy框架入门(0)
想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...
- python爬虫---->scrapy的使用(一)
这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用.平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回. scrapy的安装使用 我的电脑环境是win10,64位的.py ...
随机推荐
- 21Mybatis_订单商品数据模型_一对多查询——resultMap方式
这篇文章延续订单商品数据模型,这张讲述的是一对多的查询.(用resultMap) 给出几张表的内容: User表:
- 【WPF】Winform调用WPF窗体注意事项
1.需要添加一些引用 2.调用处使用如下方法进行调用 Window win= new Window(); ElementHost.EnableModelessKeyboardInterop(win) ...
- 最近火到不行的微信小程序的常识
满网都是微信小程序,技术dog们不关注都不行了.先别忙着去学怎么开发小程序,先纠正一下你对微信小程序的三观吧~~~~ 小程序目前被炒得沸沸扬扬,无数媒体和企业借机获取阅读流量. 这再次证明一点,微信想 ...
- 佳博80250打印机怎么看打印机IP
插上电源关机状态开机前按住走纸键(FEED)先别放手长按大概5-10秒手放开,打印机就会自动打印出一张测试纸的,纸上有个IP的,此IP就是打印机IP了!
- MySQL数据备份小结
一 MySQL备份恢复总结: 1,备份所有库 2,分库备份 3,备份某库中的某表 4,备份某库中的多个表 5,分表备份 6,只备份表结构 7,只备份数据 二 MySQL备份恢复参数总结: -A 备份所 ...
- benchmark
redis benchmark How many requests per second can I get out of Redis? Using New Relic to Understand R ...
- Got a packet bigger than 'max_allowed_packet' bytes
昨天用导入数据的时候发现有的地方有这个错误.后来才发现我用RPM包装的MYSQL配置文件里面有old_passwords=1去掉就可以了. Got a packet bigger than ‘max_ ...
- WinForm编程数据视图之DataGridView浅析
学习C#语言的朋友们肯定或多或少地接触到了WinForm编程,在C#语言的可视化IDE中(如VS.NET中)使用设计器可以让我们轻松地完成窗体.按钮.标签.图片框等等控件的组合,我们可以轻易地做出界面 ...
- Android -- Properties使用
import java.io.FileInputStream; import java.io.FileOutputStream; import java.util.Properties; public ...
- 20135316王剑桥 linux第十一周课实验笔记
getenv函数 1.获得环境变量值的函数 2.参数是环境变量名name,例如"HOME"或者"PATH".如果环境变量存在,那么getenv函数会返回环境变量 ...