python3 抓取网页资源的 N 种方法

1. 最简单

import urllib.request

response = urllib.request.urlopen('http://python.org/')

html = response.read()

2. 使用Request

import urllib.request

req = urllib.request.Request('http://python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

3. 发送数据

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

          'act' : 'login',

          'login[email]' : '12345@qq.com',

          'login[password]' : '123456'

         }

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data)

req.add_header('Referer', 'http://www.python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

4. 发送数据和header

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

          'act' : 'login',

          'login[email]' : '12334@qq.com',

          'login[password]' : '123456'

         }

headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data, headers)

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

5. http错误

#! /usr/bin/env python3

import urllib.request

req = urllib.request.Request('http://www.python.org/fish.html')

try:

    urllib.request.urlopen(req)

except urllib.error.HTTPError as e:

    print(e.code)

    print(e.read().decode("utf8"))

6. 异常处理1

#! /usr/bin/env python3

from urllib.request import Request, urlopen

from urllib.error import URLError, HTTPError

req = Request("http://twitter.com/")

try:

    response = urlopen(req)

except HTTPError as e:

    print('The server couldn\'t fulfill the request.')

    print('Error code: ', e.code)

except URLError as e:

    print('We failed to reach a server.')

    print('Reason: ', e.reason)

else:

    print("good!")

    print(response.read().decode("utf8"))

7. 异常处理2

#! /usr/bin/env python3

from urllib.request import Request, urlopen

from urllib.error import  URLError

req = Request("http://twitter.com/")

try:

    response = urlopen(req)

except URLError as e:

    if hasattr(e, 'reason'):

        print('We failed to reach a server.')

        print('Reason: ', e.reason)

    elif hasattr(e, 'code'):

        print('The server couldn\'t fulfill the request.')

        print('Error code: ', e.code)

else:

    print("good!")

    print(response.read().decode("utf8"))

8. HTTP认证

#! /usr/bin/env python3

import urllib.request

# create a password manager

password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.

# If we knew the realm, we could use it instead of None.

top_level_url = "https://cms.tetx.com/"

password_mgr.add_password(None, top_level_url, 'yzhang', 'cccddd')

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)

opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL

a_url = "https://cms.tetx.com/"

x = opener.open(a_url)

print(x.read())

# Install the opener.

# Now all calls to urllib.request.urlopen use our opener.

urllib.request.install_opener(opener)

a = urllib.request.urlopen(a_url).read().decode('utf8')

print(a)

9. 使用代理

#! /usr/bin/env python3

import urllib.request

proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})

opener = urllib.request.build_opener(proxy_support)

urllib.request.install_opener(opener)

a = urllib.request.urlopen("http://g.cn").read().decode("utf8")

print(a)

10. 超时

#! /usr/bin/env python3

import socket

import urllib.request

# timeout in seconds

timeout = 2

socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout

# we have set in the socket module

req = urllib.request.Request('http://twitter.com/')

a = urllib.request.urlopen(req).read()

print(a)

python3 抓取网页资源的 N 种方法的更多相关文章

python3爬取网页
爬虫 python3爬取网页资源方式(1.最简单: import'http://www.baidu.com/'print2.通过request import'http://www.baidu.com' ...
php抓取网页中的内容
以下就是几种常用的用php抓取网页中的内容的方法.1.file_get_contentsPHP代码代码如下:>>>>>>>>>>>&g ...
php抓取网页
用php抓取页面的内容在实际的开发其中是很实用的,如作一个简单的内容採集器,提取网页中的部分内容等等.抓取到的内容在通过正則表達式做一下过滤就得到了你想要的内容.下面就是几种经常使用的用php抓取网页 ...
Python3简单爬虫抓取网页图片
现在网上有很多python2写的爬虫抓取网页图片的实例,但不适用新手(新手都使用python3环境,不兼容python2), 所以我用Python3的语法写了一个简单抓取网页图片的实例,希望能够帮助到 ...
Python3抓取javascript生成的html网页
用urllib等抓取网页,只能读取网页的静态源文件,而抓不到由javascript生成的内容. 究其原因,是因为urllib是瞬时抓取,它不会等javascript的加载延迟,所以页面中由javasc ...
python3下scrapy爬虫(第四卷:初步抓取网页内容之抓取网页里的指定数据延展方法）
上卷中我运用创建HtmlXPathSelector 对象进行抓取数据: 现在咱们再试一下其他的方法,先试一下我得最爱XPATH 看下结果: 直接打印出结果了我现在就正常拼下路径只求打印结果: 现在 ...
[Python]网络爬虫（一）：抓取网页的含义和URL基本构成
一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛.网络蜘蛛是通过网页的链接地址来寻找网页的. 从网站某一个 ...
python抓取网页引用的模块和类
在Python3.x中,我们可以使用urlib这个组件抓取网页,urllib是一个URL处理包,这个包中集合了一些处理URL的模块,如下:1.urllib.request模块用来打开和读取URLs:2 ...
PHP的CURL方法curl_setopt()函数案例介绍(抓取网页,POST数据)
通过curl_setopt()函数可以方便快捷的抓取网页(采集很方便),curl_setopt 是php的一个扩展库使用条件:需要在php.ini 中配置开启.(PHP 4 >= 4.0.2) ...

随机推荐

DbHelper为什么要用Using？
我们分析一下DbHelper做什么事情,大家都知道它用于数据库的连接操作,这里的数据库连接会创建非托管资源,c#的垃圾回收机制不会对它处理,需要实现IDisposable接口手动释放. 手动释放的 ...
完整的PHP MYSQL数据库类
<?php class mysql { private $db_host; //数据库主机 private $db_user; //数据库用户名 private $db_ ...
CSS修改input[type=range]滑块样式
input[type="range"]是html5中的input标签新属性,样子如下: <input type="range" value="4 ...
POJ2914 （未解决）无向图最小割|Stoer-Wagner算法|模板
还不是很懂,贴两篇学习的博客: http://www.hankcs.com/program/algorithm/poj-2914-minimum-cut.html http://blog.sina.c ...
HDU 4946 Area of Mushroom（构造凸包）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4946 题目大意:在一个平面上有n个点p1,p2,p3,p4....pn,每个点可以以v的速度在平面上移 ...
Android多线程通信机制
掌握Android的多线程通信机制,我们首先应该掌握Android中进程与线程是什么. 1. 进程在Android中,一个应用程序就是一个独立的进程(应用运行在一个独立的环境中,可以避免其他应用程序 ...
plist中的中文数据
直接打印那个值就会显示中文了,你打引整个 plist 文件的内容,则会显示出 NSUTF8 编码后的数据:请放心使用:
c语言中的一些注意点
1.头文件两种形式的区别(#include<mystring.h>与#include"mystring.h") 当运行一个程序时,需要调用自己写的函数时,需要在头文件加 ...
FreeRTOS学习及移植笔记之一：开始FreeRTOS之旅
1.必要的准备工作工欲善其事,必先利其器,在开始学习和移植之前,相应的准备工作必不可少.所以在开始我们写要准备如下: 测试环境:我准备在STM32F103平台上移植和测试FreeRTOS系统准备F ...
a0=1、a1=1、a2=a1+a0、a3=a2+a1，以此类推，请写代码用递归算出a30？
public class Test { public static void main(String[] args) { System.out.println(recursive(30)); } pu ...

python3 抓取网页资源的 N 种方法

python3 抓取网页资源的 N 种方法的更多相关文章

随机推荐

热门专题