通过正则表达式来获取一个网页中的所有的 URL链接，并下载这些 URL链接的源代码

使用的系统：Windows 10 64位

Python 语言版本：Python 2.7.10 V

使用的编程 Python 的集成开发环境：PyCharm 2016 04

我使用的 urllib 的版本：urllib2

注意： 我没这里使用的是 Python2 ，而不是Python3

一 . 前言

通过之前两节（爬取一个网页的网络爬虫和解决爬取到的网页显示时乱码问题），我们终于完成了最终的 download() 函数。

并且上上一节，我们通过网站地图解析里面的URL的方式爬取了目标站点的所有网页。在上一节，介绍一种方法来爬取一个网页里面所有的链接网页。这一节，我们通过正则表达式来获取一个网页中的所有的URL链接，并下载这些URL链接的源代码。

二 . 简介

到目前为止，我们已经利用目标网站的结构特点实现了两个简单爬虫。只要这两个技术可用，就应当使用其进行爬取，因为这两个方法最小化了需要下载的网页数量。不过，对于一些网站，我们需要让爬虫表现得更像普通用户：跟踪链接，访问感兴趣的内容。

通过跟踪所有链接的方式，我们可以很容易地下载整个网站的页面。但是这种方法会下载大量我们并不需要的网页。例如，我们想要从一个在线论坛中爬取用户账号详情页，那么此时我们只需要下载账户页，而不需要下载讨论轮贴的页面。本篇博客中的链接爬虫将使用正则表达式来确定需要下载那些页面。

三 . 初级代码

import re

def link_crawler(seed_url, link_regex):

    """Crawl from the given seed URL following links matched by link_regex

    """

    crawl_queue = [seed_url]

    while crawl_queue:

        url = crawl_queue.pop()

        html = download(url)

        # filter for links matching our regular expression

        for link in get_links(html):

            if re.match(link_regex, link):

                crawl_queue.append(link)

def get_links(html):

    """Return a list of links from html

    """

    # a regular expression to extract all links from the webpage

    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)

    # list of all links from the webpage

    return webpage_regex.findall(html)

四 . 讲解初级代码

1 .

def link_crawler(seed_url, link_regex):

这个函数就是我们要在外部使用的函数。功能：先下载 seed_url 网页的源代码，然后提取出里面所有的链接URL，接着对所有匹配到的链接URL与link_regex 进行匹配，如果链接URL里面有link_regex内容，就将这个链接URL放入到队列中，下一次执行 while crawl_queue: 就对这个链接URL 进行同样的操作。反反复复，直到 crawl_queue 队列为空，才退出函数。

2 .

get_links(html) 函数的功能：用来获取 html 网页中所有的链接URL。

3 .

webpage_regex = re.compile('<a[^>]+href=["\']'(.*?)["\']', re.IGNORECASE)

做了一个匹配模板，存在 webpage_regex 对象里面。匹配<a href="xxx"> 这样的字符串，并提取出里面xxx的内容，这个xxx就是网址 URL 。

4 .

return webpage_regex.findall(html)

使用 webpage_regex 这个模板对 html 网页源代码匹配所有符合<a href="xxx"> 格式的字符串，并提取出里面的 xxx 内容。

详细的正则表达式的知识，请到这个网站了解：

http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

五 . 运行

先启动Python 终端交互指令，在PyCharm软件的Terminal窗口中或者在Windows 系统的DOS窗口中执行下面的命令：

C:\Python27\python.exe -i 1-4-4-regular_expression.py

执行link_crawler() 函数：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

输出：

Downloading:  http://example.webscraping.com

Downloading:  /index/1

Traceback (most recent call last):

  File "1-4-4-regular_expression.py", line 50, in <module>

    link_crawler('http://example.webscraping.com', '/(index|view)')

  File "1-4-4-regular_expression.py", line 36, in link_crawler

    html = download(url)

  File "1-4-4-regular_expression.py", line 13, in download

    html = urllib2.urlopen(request).read()

  File "C:\Python27\lib\urllib2.py", line 154, in urlopen

    return opener.open(url, data, timeout)

  File "C:\Python27\lib\urllib2.py", line 423, in open

    protocol = req.get_type()

  File "C:\Python27\lib\urllib2.py", line 285, in get_type

    raise ValueError, "unknown url type: %s" % self.__original

ValueError: unknown url type: /index/1

运行的时候，出现了错误。这个错误出在：下载 /index/1 URL时。这个 /index/1 是目标站点中的一个相对链接，就是完整网页URL 的路径部分，而没有协议和服务器部分。我们使用download() 函数是没有办法下载的。在浏览器里浏览网页，相对链接是可以正常工作的，但是在使用 urllib2 下载网页时，因为无法知道上下文，所以无法下载成功。

七 . 改进代码

所以为了让urllib2 能够定为网页，我们需要将相对链接转换为绝对链接，这样方可解决问题。

Python 里面有可以实现这个功能的模块：urlparse。

下面对 link_crawler() 函数进行改进：

import urlparse

def link_crawler(seed_url, link_regex):

    """Crawl from the given seed URL following links matched by link_regex

    """

    crawl_queue = [seed_url]

    while crawl_queue:

        url = crawl_queue.pop()

        html = download(url)

        for link in get_links(html):

            if re.match(link_regex, link):

                link = urlparse.urljoin(seed_url, link)

                crawl_queue.append(link)

八 . 运行：

运行程序：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

输出：

Downloading:  http://example.webscraping.com

Downloading:  http://example.webscraping.com/index/1

Downloading:  http://example.webscraping.com/index/2

Downloading:  http://example.webscraping.com/index/3

Downloading:  http://example.webscraping.com/index/4

Downloading:  http://example.webscraping.com/index/5

Downloading:  http://example.webscraping.com/index/6

Downloading:  http://example.webscraping.com/index/7

Downloading:  http://example.webscraping.com/index/8

Downloading:  http://example.webscraping.com/index/9

Downloading:  http://example.webscraping.com/index/10

Downloading:  http://example.webscraping.com/index/11

Downloading:  http://example.webscraping.com/index/12

Downloading:  http://example.webscraping.com/index/13

Downloading:  http://example.webscraping.com/index/14

Downloading:  http://example.webscraping.com/index/15

Downloading:  http://example.webscraping.com/index/16

Downloading:  http://example.webscraping.com/index/17

Downloading:  http://example.webscraping.com/index/18

Downloading:  http://example.webscraping.com/index/19

Downloading:  http://example.webscraping.com/index/20

Downloading:  http://example.webscraping.com/index/21

Downloading:  http://example.webscraping.com/index/22

Downloading:  http://example.webscraping.com/index/23

Downloading:  http://example.webscraping.com/index/24

Downloading:  http://example.webscraping.com/index/25

Downloading:  http://example.webscraping.com/index/24

Downloading:  http://example.webscraping.com/index/25

Downloading:  http://example.webscraping.com/index/24

Downloading:  http://example.webscraping.com/index/25

Downloading:  http://example.webscraping.com/index/24

通过运行得到的结果，你可以看出来：虽然，现在可以下载网页没有出错，但是同样的网页会被不断的下载到。为什么会这样？这是因为这些链接URL相互之间存在链接。如果两个网页之间相互都有对方的链接，那么对着这个程序，它会不断死循环下去。

所以，我们还需要继续改进程序：避免爬取相同的链接，所以我们需要记录哪些链接已经被爬取过，如果已经被爬取过了，就不在爬取它。

九 . 继续改进 `link_crawler()`函数：

def link_crawler(seed_url, link_regex):

    crawl_queue = [seed_url]

    # keep track which URL's have seen before

    seen = set(crawl_queue)

    while crawl_queue:

        url = crawl_queue.pop()

        html = download(url)

        for link in get_links(html):

            # check if link matches expected regex

            if re.match(link_regex, link):

                # form absolute link

                link = urlparse.urljoin(seed_url, link)

                # check if have already seen this link

                if link not in seen:

                    seen.add(link)

                    crawl_queue.append(link)

十 . 运行：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

输出：

Downloading:  http://example.webscraping.com

Downloading:  http://example.webscraping.com/index/1

Downloading:  http://example.webscraping.com/index/2

Downloading:  http://example.webscraping.com/index/3

Downloading:  http://example.webscraping.com/index/4

Downloading:  http://example.webscraping.com/index/5

Downloading:  http://example.webscraping.com/index/6

Downloading:  http://example.webscraping.com/index/7

Downloading:  http://example.webscraping.com/index/8

Downloading:  http://example.webscraping.com/index/9

Downloading:  http://example.webscraping.com/index/10

Downloading:  http://example.webscraping.com/index/11

Downloading:  http://example.webscraping.com/index/12

Downloading:  http://example.webscraping.com/index/13

Downloading:  http://example.webscraping.com/index/14

Downloading:  http://example.webscraping.com/index/15

Downloading:  http://example.webscraping.com/index/16

Downloading:  http://example.webscraping.com/index/17

Downloading:  http://example.webscraping.com/index/18

Downloading:  http://example.webscraping.com/index/19

Downloading:  http://example.webscraping.com/index/20

Downloading:  http://example.webscraping.com/index/21

Downloading:  http://example.webscraping.com/index/22

Downloading:  http://example.webscraping.com/index/23

Downloading:  http://example.webscraping.com/index/24

Downloading:  http://example.webscraping.com/index/25

Downloading:  http://example.webscraping.com/view/Zimbabwe-252

Downloading:  http://example.webscraping.com/view/Zambia-251

Downloading:  http://example.webscraping.com/view/Yemen-250

Downloading:  http://example.webscraping.com/view/Western-Sahara-249

现在这个程序就是一个非常完美的程序，它会爬取所有地点，并且能够如期停止。最终，完美得到了一个可用的爬虫。

总结：

这样，我们就已经介绍了3种爬取一个站点或者一个网页里面所有的链接URL的源代码。这些只是初步的程序，接下来，我们还可能会遇到这样的问题：

1 . 如果一些网站设置了禁止爬取的URL，我们为了执行这个站点的规则，就要按照它的 robots.txt 文件来设计爬取程序。

2 . 在国内是上不了google的，那么如果我们想要使用代理的方式上谷歌，就需要给我们的爬虫程序设置代理。

3 . 如果我们的爬虫程序爬取网站的速度太快，可能就会被目标站点的服务器封杀，所以我们需要限制下载速度。

4 . 有一些网页里面有类似日历的东西，这个东西里面的每一个日期都是一个URL链接，我们有不会去爬取这种没有意义的东西。日期是无止境的，所以对于我们的爬虫程序来说，这就是一个爬虫陷阱，我们需要避免陷入爬虫陷阱。

我们需要解决上这4个问题。才能得到最终版本的爬虫程序。

Python 网络爬虫 009 (编程) 通过正则表达式来获取一个网页中的所有的URL链接，并下载这些URL链接的源代码的更多相关文章

Python 网络爬虫 008 (编程) 通过ID索引号遍历目标网页里链接的所有网页
通过 ID索引号遍历目标网页里链接的所有网页使用的系统:Windows 10 64位 Python 语言版本:Python 2.7.10 V 使用的编程 Python 的集成开发环境:PyChar ...
Python 网络爬虫 007 (编程) 通过网站地图爬取目标站点的所有网页
通过网站地图爬取目标站点的所有网页使用的系统:Windows 10 64位 Python 语言版本:Python 2.7.10 V 使用的编程 Python 的集成开发环境:PyCharm 2016 ...
Python 网络爬虫 004 (编程) 如何编写一个网络爬虫，来下载（或叫：爬取）一个站点里的所有网页
爬取目标站点里所有的网页使用的系统:Windows 10 64位 Python语言版本:Python 3.5.0 V 使用的编程Python的集成开发环境:PyCharm 2016 04 一 . 首 ...
Python 网络爬虫 005 (编程) 如何编写一个可以下载（或叫：爬取）一个网页的网络爬虫
如何编写一个可以下载(或叫:爬取)一个网页的网络爬虫使用的系统:Windows 10 64位 Python 语言版本:Python 2.7.10 V 使用的编程 Python 的集成开发环境:P ...
Python网络爬虫四大选择器（正则表达式、BS4、Xpath、CSS）总结
一.正则表达式正则表达式为我们提供了抓取数据的快捷方式.虽然该正则表达式更容易适应未来变化,但又存在难以构造.可读性差的问题.当在爬京东网的时候,正则表达式如下图所示: 此外 ,我们都知道,网页时常 ...
Python 网络爬虫 006 (编程) 解决下载（或叫：爬取）到的网页乱码问题
解决下载(或叫:爬取)到的网页乱码问题使用的系统:Windows 10 64位 Python 语言版本:Python 2.7.10 V 使用的编程 Python 的集成开发环境:PyCharm 20 ...
简单的Java网络爬虫（获取一个网页中的邮箱）
import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; impo ...
Java中利用正则表达式获取一个网页中的所有邮箱地址
package cn.tms.ui; import java.io.BufferedReader; import java.io.File; import java.io.FileWriter; im ...
python网络爬虫之自动化测试工具selenium[二]
目录前言一.获取今日头条的评论信息(request请求获取json) 1.分析数据 2.获取数据二.获取今日头条的评论信息(selenium请求获取) 1.分析数据 2.获取数据房源案例(仅供 ...

随机推荐

loader疑惑
今天写自己的loader管理类时,发现一个问题,如果证明flash是并发加载资源的呢? var loader:Loader=new Loader; loader.contentLoaderInfo.a ...
《DSP using MATLAB》示例Example7.21
代码: h = [1, 2, 3, 4, 3, 2, 1]/15; M = length(h); n = 0:M-1; [Hr, w, a, L] = Hr_Type1(h); a L amax = ...
ansible安装基本使用
备注使用yum (centos7) 1. 安装 yum install -y ansible 2. 免密登录(ssh,最好使用dns 解析) // create ssh key ssh-keyge ...
Use the dkms from EPEL when install CUDA Toolkits on CentOS
###Use the dkms from EPEL. yum install epel-release yum install dkms # download the rpm from the NVi ...
ORACLE常识
1. ORACLE中查看表中的外键来源于哪些表 select cl.table_name from user_cons_columns cl left join user_constraints c ...
1118 Birds in Forest
题意: 思路:并查集模板题. 代码: #include <cstdio> #include <algorithm> using namespace std; ; int fat ...
优秀设计师必须知道哪些优秀的UI设计原则
转自:http://www.gamelook.com.cn/2016/01/240359 界面清晰最重要界面清晰是UI设计的第一步,要想让用户喜欢你设计的UI,首先必须让用户认可它.知道怎么样使用它 ...
ansible安装配置zabbix客户端
安装软件 ansible host -m apt -a "name=zabbix-agent state=present" ansible host -m shell -a ...
Bash脚本编程总结
bash脚本编程之用户交互: read [option]… [name …] -p ‘PROMPT’ -t TIMEOUT bash -n /path/to/some_script 检测脚本中的 ...
解决word自动编号出现内容空格过大的问题
选择你需要调整的段落.右键点击.选择“调整列表缩进” 然后在弹出的窗口中,第三行的“制表符”改成“空格”即可.或者空格都不需要可以改为“不特别标注”. 当编号超过10的时候,也会有空格太大的现象,这时 ...

Python 网络爬虫 009 (编程) 通过正则表达式来获取一个网页中的所有的URL链接，并下载这些URL链接的源代码

通过 正则表达式 来获取一个网页中的所有的 URL链接，并下载这些 URL链接 的源代码