QQ:231469242

欢迎交流

Parsing HTML with the BeautifulSoup Module

Beautiful Soup是用于提取HTML网页信息的模板,BeautifulSoup模板名字是bs4。
 
bs4.BeautifulSoup()函数需要调用时,携带包含HTML的一个字符串。这个字符串将被复制。
bs4.BeautifulSoup()返回一个BeautifulSoup对象。
 
Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoupmodule’s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup4 from the command line. (Check out Appendix A for instructions on installing third-party modules.) While beautifulsoup4 is the name used for installation, to import Beautiful Soup you run import bs4.

For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive. Open a new file editor window in IDLE, enter the following, and save it as example.html. Alternatively, download it from http://nostarch.com/automatestuff/.

<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body><p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p><p class="slogan">Learn Python the easy way!</p><p>By <span id="author">Al Sweigart</span></p>
</body></html>

As you can see, even a simple HTML file involves many different tags and attributes, and matters quickly get confusing with complex websites. Thankfully, Beautiful Soup makes working with HTML much easier.

Creating a BeautifulSoup Object from HTML

bs4.BeautifulSoup()函数需要调用时,携带包含HTML的一个字符串。这个字符串将被复制。
bs4.BeautifulSoup()返回一个BeautifulSoup对象。
The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns is a BeautifulSoupobject. Enter the following into the interactive shell while your computer is connected to the Internet:

>>> import requests, bs4
>>> res = requests.get('http://nostarch.com')
>>> res.raise_for_status()
>>> noStarchSoup = bs4.BeautifulSoup(res.text) #返回文字属性给bs4.BeautifulSoup函数
>>> type(noStarchSoup)
 
<class 'bs4.BeautifulSoup'>

 

This code uses requests.get() to download the main page from the No Starch Press website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named noStarchSoup.

You can also load an HTML file from your hard drive by passing a File object tobs4.BeautifulSoup(). Enter the following into the interactive shell (make sure theexample.html file is in the working directory):

>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile)
>>> type(exampleSoup)
 
<class 'bs4.BeautifulSoup'>

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

如何创建一个example.html测试文件
打开一个idle文件,复制好下图HTML代码,用example.html文件名报存

bs4_2的更多相关文章

随机推荐

  1. HashTable、HashMap、HashSet

    1. HashMap 1)  hashmap的数据结构 Hashmap是一个数组和链表的结合体(在数据结构称“链表散列“),如下图示: 当我们往hashmap中put元素的时候,先根据key的hash ...

  2. ASP.NET杂谈-一切都从web.config说起(2)(ConfigSections详解-中)

    我们就接着上一篇继续说,上一篇中介绍了ConfigSection的结构和两个简单的DEMO,本篇就说一下SectionGroup.ConfigurationElementCollection和key/ ...

  3. json-jsonConfig使用

    一,setCycleDetectionStrategy 防止自包含 /** * 这里测试如果含有自包含的时候需要CycleDetectionStrategy */ public static void ...

  4. 一起学HTML基础-JavaScritp简介与语法

    简介: 1.什么是JavaScript? 它是个脚本语言,作用是使 HTML 页面具有更强的动态和交互性,它需要有宿主文件,它的宿主文件就是html文件.  JavaScript 是 Web 的编程语 ...

  5. C#微信开发之旅(二):基础类之HttpClientHelper(更新:SSL安全策略)

    public class HttpClientHelper   2     {   3         /// <summary>   4         /// get请求   5    ...

  6. 【USACO 1.5】Prime Palindromes

    /* TASK: pprime LANG: C++ SOLVE: 枚举数的长度,dfs出对称的数,判断是否在范围内,是否是素数 原来想着枚举每个范围里的数,但是显然超时,范围最大是10^9. 对称的数 ...

  7. 【POJ 1679】The Unique MST(次小生成树)

    找出最小生成树,同时用Max[i][j]记录i到j的唯一路径上最大边权.然后用不在最小生成树里的边i-j来替换,看看是否差值为0. #include <algorithm> #includ ...

  8. Leetcode 221. Maximal Square

    本题用brute force超时.可以用DP,也可以不用. dp[i][j] 代表 以(i,j)为右下角正方形的边长. class Solution(object): def maximalSquar ...

  9. 在VS里配置及查看IL

    在VS里配置及查看IL 来源:网络 编辑:admin 在之前的版本VS2010中,在Tools下有IL Disassembler(IL中间语言查看器),但是我想直接集成在VS2012里使用,方法如下: ...

  10. JDK与Java SE/EE/ME的区别

    1. Java SE(Java Platform,Standard Edition). Java SE 以前称为 J2SE.它允许开发和部署在桌面.服务器.嵌入式环境和实时环境中使用的 Java 应用 ...