Python challenge 3 - urllib & re

第三个主题地址：http://www.pythonchallenge.com/pc/def/ocr.html

Hint1：recognize the characters. maybe they are in the book, but MAYBE they are in the page source.

Hint2: 网页源代码的凝视中有: find rare characters in the mess below；以下是一堆字符。

显然是从这对字符中找出现次数最少的；注意忽略空白符。出现次数相同多的字符按出现次数排序。

import re

import urllib

# urllib to open the website

response= urllib.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html")

source = response.read()

response.close()

# 抓取到整个HTML的sourceprint source

# 得到凝视中的全部元素

data = re.findall(r'', source, re.S)

# 得到字母charList = re.findall(r'([a-zA-Z])', data[1], 16)print charListprint ''.join(charList)

终于的结果是

['e', 'q', 'u', 'a', 'l', 'i', 't', 'y']

equality

####################################################################################################################################

Python urllib库提供了一个从指定URL地址获取网页数据，然后进行分析的功能。

import urllib

google = urllib.urlopen('http://www.google.com')

print 'http header:\n', google.info()

print 'http status:', google.getcode()

print 'url:', google.geturl()

# result

http header:

Date: Tue, 21 Oct 2014 19:30:35 GMT

Expires: -1

Cache-Control: private, max-age=0

Content-Type: text/html; charset=ISO-8859-1

Set-Cookie: PREF=ID=521bc5021bb6e976:FF=0:TM=1413919835:LM=1413919835:S=7cbCQWnhLCPJFOiw; expires=Thu, 20-Oct-2016 19:30:35 GMT; path=/; domain=.google.com

Set-Cookie: NID=67=mzfYCxoBC3d9VaQC6-cXKIcbxt4eekorvE6lon1ZHQhLeVxasD2oeRKEG2In90zRAqNPQ1xLfzR_ha1ife0JqdJankdexWaFjZiQN2mLGjavWCfMBYETbFfIst08iNtR; expires=Wed, 22-Apr-2015 19:30:35 GMT; path=/; domain=.google.com; HttpOnly

P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."

Server: gws

X-XSS-Protection: 1; mode=block

X-Frame-Options: SAMEORIGIN

Alternate-Protocol: 80:quic,p=0.01

http status: 200

url: http://www.google.com

我们能够用urlopen抓取网页，然后read方法获得全部的信息。

info获取http header，返回一个httplib.HTTPMessage对象。表示远程server返回的头信息。

getcode获得http status。假设是http请求，200表示成功。404表示网址没找到。

geturl获得信息来源站点。

还有getenv获得环境变量。putenv环境变量设置。等等。

print help(urllib.urlopen)

#result

Help on function urlopen in module urllib:

urlopen(url, data=None, proxies=None)

    Create a file-like object for the specified URL to read from.

上述。我们能够知道，就是创建一个类文件对象为指定的url来读取。

參数url表示远程数据的路径。通常是http或者ftp路径

參数data表示以get或者post方法提交到url数据

參数proxies表示用于代理的设置

urlopen返回一个类文件对象

有read()，readline()。readlines()，fileno()。close()等和文件对象一样的方法

####################################################################################################################################

Python 中的re 正則表達式模块

re.match 字符串匹配模式

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:

   print "matchObj.group() : ", matchObj.group()

   print "matchObj.group(1) : ", matchObj.group(1)

   print "matchObj.group(2) : ", matchObj.group(2)

else:

   print "No match!!"

上述的代码的结果是

matchObj.group() :  Cats are smarter than dogs

matchObj.group(1) :  Cats

matchObj.group(2) :  smarter

能够看出。group()返回整个match的对象。group(?)能够返回submatch，上述代码有两个匹配点。

主要函数语句 re.match(pattern, string, flags)

pattern就是写的regular expression用于匹配。

string就是传入的须要被匹配取值。

flags能够不写。能够用 | 分隔。

re.I 或者re.IGNORECASE，表示匹配部分大写和小写。case insensitively。

（Performs case-insensitive matching.）

re.S或者re.DOTALL，表示点随意匹配模式，改变'.'的行为，设置后能够匹配\n

（Makes a period (dot) match any character, including a newline.）

re.M或者re.MULTILINE，表示多行模式。改变'^'和'$'的行为

（Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).）

re.L或者re.LOCALE。使得提前定义字符类\w,\W, \b, \B, \s, \S取决于当前区域设定

（Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b
and \B).）

re.U或者re.UNICODE，使得提前定义字符类\w,\W, \b, \B, \s, \S取决于unicode定义的字符属性

（Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.）

re.X或者re.VERBOSE。具体模式。这个模式下正則表達式能够是多行。忽略空白字符，并能够增加凝视。

（Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a
comment marker.）

re.search v.s. re.match

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)

if matchObj:

   print "match --> matchObj.group() : ", matchObj.group()

else:

   print "No match!!"

searchObj = re.search( r'dogs', line, re.M|re.I)

if searchObj:

   print "search --> searchObj.group() : ", searchObj.group()

else:

   print "Nothing found!!"

# result

No match!!

search --> searchObj.group() :  dogs

我们能够看出来，match是从头開始check整个string的，假设開始没找到就是没找到了。

而search寻找完整个string。从头到尾。

re.sub

详细的语句例如以下

re.sub(pattern, repl, string, max=0)

替换string全部的match部分为repl，替换全部的知道替换max个。

然后返回一个改动过的string。

import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments

num = re.sub(r'#.*$', "", phone)

print "Phone Num : ", num

# Remove anything other than digits

num = re.sub(r'\D', "", phone)

print "Phone Num : ", num

# result

Phone Num :  2004-959-559

Phone Num :  2004959559

re.split (pattern, string, maxsplit=0)

能够使用re.split来切割字符串。maxsplit是分离次数，maxsplit=1表示分离一次。默认是0，不限制次数。

import re

print re.split('\W+', 'Words, words, words.')

print re.split('(\W+)', 'Words, words, words.')

print re.split('\W+', 'Words, words, words.', 1)

# result

['Words', 'words', 'words', '']

['Words', ', ', 'words', ', ', 'words', '.', '']

['Words', 'words, words.']

假设在字符串的开头或者结尾就匹配，那么返回的list会以空串開始或结尾。

import re

print re.split('(\W+)', '...words, words...')

# result

['', '...', 'words', ', ', 'words', '...', '']

假设字符串不能匹配，就返回整个字符串的list。

import re

print re.split('a', '...words, words...')

# result

['...words, words...']

####

str.split('\s') 和re.split('\s',str)都是切割字符串，返回list。可是是有差别的。

1. str.split('\s') 是字面上的依照'\s'来切割字符串

2. re.split('\s', str)是依照空白来切割的。由于正則表達式中的‘\s’就是空白的意思。

re.findall(pattern, string, flags=0)

找到re匹配的全部子串，并把它们作为一个列表返回。这个匹配从左到右有序的返回。假设没有匹配就返回空列表。

import re

print re.findall('a', 'bcdef')

print re.findall(r'\d+', '12a34b56c789e')

# result

[]

['12', '34', '56', '789']

re.compile(pattern, flags=0)

编译正則表達式，返回RegexObject对象，然后通过RegexObject对象调用match方法或者search方法。

prog = re.compile(pattern)

result = prog.match(string)

等价

result = re.match(pattern, string)

第一种方法可以实现正则表达式的重用。

Python challenge 3 - urllib & re的更多相关文章

Python Challenge 第四题
这一题没有显示提示语,仅仅有一幅图片,图片也看不出什么名堂,于是直接查看源代码,源代码例如以下: <html> <head> <title>follow the c ...
The Python Challenge 谜题全解（持续更新）
Python Challenge(0-2) The Python Challengehttp://www.pythonchallenge.com/ 是个很有意思的网站,可以磨练使用python的技巧, ...
The Python Challenge 0-4
The Python Challenge 0-4 项目地址:http://www.pythonchallenge.com/ Level-0 提示Hint: try to change the URL ...
The Python Challenge 闯关笔记
The Python Challenge : http://www.pythonchallenge.com/ Level 0: 看提示图片中为2**38,计算值为274877906944. Hint: ...
Python核心模块——urllib模块
现在Python基本入门了,现在开始要进军如何写爬虫了! 先把最基本的urllib模块弄懂吧. urllib模块中的方法 1.urllib.urlopen(url[,data[,proxies]]) ...
python challenge第1关--NoteBook上的“乱码”
在 python challenge第0关中已经得到第1关的地址了: http://www.pythonchallenge.com/pc/def/map.html 一.观察地址栏和标签: What a ...
[转]Python核心模块——urllib模块
现在Python基本入门了,现在开始要进军如何写爬虫了! 先把最基本的urllib模块弄懂吧. urllib模块中的方法 1.urllib.urlopen(url[,data[,proxies]]) ...
Python Challenge 过关心得（0）
最近开始用Openerp进行开发,在python语言本身上并没有什么太大的进展,于是决定利用空闲时间做一点python练习. 最终找到了这款叫做Python Challenge(http://www. ...
Python爬虫之urllib模块2
Python爬虫之urllib模块2 本文来自网友投稿作者:PG-55,一个待毕业待就业的二流大学生. 看了一下上一节的反馈,有些同学认为这个没什么意义,也有的同学觉得太简单,关于Beautiful ...

随机推荐

awk与sed：关于多行的样本
几天前CSDN看到一个帖子http://bbs.csdn.net/topics/390848841,楼主贴了以下的问题: 例: 12345 67890 1234567890 123 4567890 怎 ...
JAVA学习课第二十八届（多线程（七））- 停止-threaded多-threaded面试题
主密钥 /* * wait 和 sleep 差别? * 1.wait能够指定时间也能够不指定 * sleep必须指定时间 * 2.在同步中,对CPU的运行权和锁的处理不同 * wait释放运 ...
HDU 2845 Beans (动态调节)
Beans Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Subm ...
Stack-based buffer overflow in acdb audio driver (CVE-2013-2597)
/* 本文章由莫灰灰编写,转载请注明出处. 作者:莫灰灰邮箱: minzhenfei@163.com */ 1. 漏洞描写叙述音频驱动acdb提供了一个ioctl的系统接口让应用层调用, ...
ZOJ Monthly, October 2010 ABEFI
ZOJ 3406 Another Very Easy Task #include <cstdio> #include <cstring> const int N = 10000 ...
EL与JSTL注意事项汇总
EL使用表达式(5一个问题) JSTL使用标签(5问题) 什么是EL.它可以用做? EL全名Expression Language在JSP使用页面格公式${表达式} 样例${requestScop ...
Java设计模式（三）-修饰模式
我们都知道.能够使用两种方式给一个类或者对象加入行为. 一是使用继承.继承是给一个类加入行为的比較有效的途径.通过使用继承,能够使得子类在拥有自身方法的同一时候,还能够拥有父类的方法.可是使用继承是静 ...
ubuntu 下舒畅的使用libreoffice
step 1 英语渣的同学.或者对功能栏的一大堆略显专业的单词不敢下手的同学你须要一个中文汉化包不用去官网找了,源里就有 sudo apt-get install libreoffice-l10n ...
大约cocos2d-X 3.x使用引擎版本自带的物理引擎Physics
今天打算用BOX2D物理引擎, 我想我以前听说过一些时间cocos2d-X在3.0版本封装自己的物理引擎Physics, 听名字很霸气量, 这的确是一个比BOX2D非常多( 毕竟是基于BOX2D封装的 ...
Android手机定位技术的发展
基于以下三种方式的移动位置:1. 网络位置 :2. 基站定位. 3. GPS定位 1 网络位置前提是连接到网络:Wifi.3G.2G 到达IP址比如:彩虹版QQ,珊瑚虫版QQ,就有一个功能显示对 ...

Python challenge 3 - urllib &amp; re

Python challenge 3 - urllib &amp; re的更多相关文章

随机推荐

热门专题

Python challenge 3 - urllib & re

Python challenge 3 - urllib & re的更多相关文章