<Web Scraping with Python>

Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

  • BeautifulSoup

Key:

    P5:

  • urlib or urlib2?

 If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib. 

在学习这本书之前,使用过此package(我一开始学习Python就用的是3.x,Mac自带Python2.x),当时出错了,上Stackoverflow找到了答案,现在这本书提到了这点,重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库,可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.requesturllib.parse 和 urllib.error。尽管函数名称大多和原来一样,但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

 

    P15:

  • When to get_text() and When to Preserve Tags?

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible. 

    P16:

  • find() and findAll() with BeautifulSoup?

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

  1. Web Scraping with Python读书笔记及思考

    Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...

  2. Web scraping with Python (part II) « Jean, aka Sig(gg)

    Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)

  3. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...

  4. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...

  5. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...

  6. 首部讲Python爬虫电子书 Web Scraping with Python

    首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...

  7. 《Web Scraping With Python》Chapter 2的学习笔记

    You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...

  8. Web Scraping with Python

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...

  9. Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python

    Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

  1. java poi 合并单元格后边框问题

    在项目中用poi合并单元格,但发现边框会有不显示的问题. 在网上搜集了答案,来记录一下. 解决方法: 将每个没用到的单元格都设空值. 例如: HSSFCell cell = row.createCel ...

  2. css样式书写顺序

    这里推荐先写显示属性,再写自身属性,再写文字属性:并不代表非得按这个顺序写,但这种写法可以使css结构更清晰易读,修改起来比较方便. 而且在团队项目中能更好地提高效率. //显示属性 display ...

  3. django中使用json.dumps处理数据时,在前台遇到字符转义的问题

    django后台代码: import json ctx['dormitory_list'] = json.dumps([{", "is_checked": 1}, {&q ...

  4. Centos下搭建 nginx+uwsgi+python

    python做web应用最麻烦的还是配置服务器了,此话不假,光中间件就有好几种选择,fastcgi.wsgi.uwsgi,难 免让人眼花缭乱. 而听说uwsgi的效率是fastcgi和wsgi的10倍 ...

  5. js阻止元素的默认事件与冒泡事件

    嵌套的div元素,如果父级和子元素都绑定了一些事件,那么在点击最内层子元素时可能会触发父级元素的事件,从而带来一定的影响. 1. event.preventDefault();  -- 阻止元素的默认 ...

  6. RegexOptions枚举

    在创建Regex类的实例时,构造函数的重载中有一个要求传入RegexOptions的一个枚举值,我相信这个枚举一定非常有用,否则不会要求在构造函数中传入.今天就来看一看这个枚举的作用. 我们干脆把代码 ...

  7. 【Xamarin挖墙脚系列:IOS现有的设备SDK /OS/硬件一览】

    附件下载: http://pan.baidu.com/s/1o7rsrUE

  8. 《how to design programs》第11章自然数

    这章让我明白了原来自然数的定义本来就是个递归的过程. 我们通常用枚举的方式引出自然数的定义:0,1,2,3,等等(etc).最后的等等是什么意思?唯一能把等等从描述自然数的枚举方法中去除的方法是自引用 ...

  9. Implement Queue using Stacks 解答

    Question Implement the following operations of a queue using stacks. push(x) -- Push element x to th ...

  10. java Socket使用注意

    Socket s = new Socket(ia, port); BufferedOutputStream bufOut = new BufferedOutputStream(s.getOutputS ...