<Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python>
Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing
- BeautifulSoup
Key:
P5:
- urlib or urlib2?
If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib.
在学习这本书之前,使用过此package(我一开始学习Python就用的是3.x,Mac自带Python2.x),当时出错了,上Stackoverflow找到了答案,现在这本书提到了这点,重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库,可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:
urllib.request
、urllib.parse
和urllib.error
。尽管函数名称大多和原来一样,但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。
P15:
- When to get_text() and When to Preserve Tags?
.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.
P16:
- find() and findAll() with BeautifulSoup?
<Web Scraping with Python>:Chapter 1 & 2的更多相关文章
- Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...
- Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
- 首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
- 《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
- Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
- Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...
随机推荐
- java读取xml文件报“org.xml.sax.SAXParseException: Premature end of file” .
背景:java读取xml文件,xml文件内容只有“<?xml version="1.0" encoding="UTF-8"?>”一行 java读取该 ...
- [Head First Python]4. summary
1- strip()方法可以从字符串去除不想要的空白符 (role, line_spoken) = each_line.split(":", 1) line_spoken = li ...
- POJ1094 拓扑排序
问题:POJ1094 本题考查拓扑排序算法 拓扑排序: 1)找到入度为0的点,加入已排序列表末尾: 2)删除该点,更新入度数组. 循环1)2)直到 1. 所有点都被删除,则找到一个拓扑 ...
- Spyder调试错误-"TypeError: decoding Unicode is not supported"
这是Spyder 2.7.4版本的一个Bug,升级到最新版本(2.7.9)即可. pip install --upgrade spyder Reference: https://github.com/ ...
- project euler 16:Power digit sum
>>> sum([int(i) for i in str(2**1000)]) 1366 >>>
- html 标记语言
HTML html标记语言 网页 <html></html> 可见页面内容 <body> ...
- poj 1036 Gangsters
http://poj.org/problem?id=1036 题意:N个土匪,伸缩门的范围是K, 时间T, 伸缩门在[0, k]范围内变动,每个单位时间可以不变伸长或者缩短一个单位.给出每个最烦到达的 ...
- MFC界面开发(QQ透明皮肤:多层算法,一键适配各种背景 )
http://blog.csdn.net/kent19900125/article/category/1368203/3 QQ透明皮肤:多层算法,一键适配各种背景 . http://blog.csdn ...
- python模块学习之random
模块源码: Source code: Lib/random.py 文档:http://docs.python.org/2/library/random.html 常用方法: random.random ...
- UESTC_Big Brother 2015 UESTC Training for Graph Theory<Problem G>
G - Big Brother Time Limit: 3000/1000MS (Java/Others) Memory Limit: 65535/65535KB (Java/Others) ...