Python写网络爬虫(一)

关于Python:

学过C. 学过C++. 最后还是学Java来吃饭.

一直在Java的小世界里混迹.

有句话说: “Life is short, you need Python!” 翻译过来就是: 人生苦短, 我用Python

到底它有多么强大, 多么简洁?

抱着这个好奇心, 趁不忙的几天. 还是忍不住的小学了一下.(- - 事实上学了还不到两天)

随便用一个"HelloWorld"的样例

//Java

class Main{

	public static void main(String[] args){

		String str = "HelloWorld!";

		System.out.println(str);

	}

}

#Python

str = 'HelloWorld'

print str

乍一看Python确实非常爽! 简明了断节省时间

至于效率. 我相信无论是不论什么语言, 作为开发人员重要是的还是思维的精简!

也有人说: "Python一时爽, 重构火葬场"

可是Python的应用场景那么多. 依据自己的须要来选择, 相信那么多前辈没错.

Python的确是一门值得学习的课程.

关于网络爬虫:

这个东西我也也不懂... 随便抓个解释, 淡淡的理解下

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更常常的称为网页追逐者）。是一种依照一定的规则。自己主动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自己主动索引、模拟程序或者蠕虫。

真正的复杂而强大的爬虫有非常多爬行算法和策略.

我写的这个样例简直是简之又简.

Python基础还没学完, 就迫不及待的想做个东西来看看

于是就想到了写个网络爬虫. 来小小的练习一下

我的母校是甘肃农大. 我会常常看学校新闻, 于是就试着爬个学校新闻, 没事的时候拿出来看看

由于学的甚少...这个爬虫的功能还是很easy. 就是把学校官网中最新的新闻下载下来, 保存成网页. 想看多少页都能够.

这里我就抓了最新的前4页新闻

第一次写爬虫. 一个非常easy的功能, 我把它分了3步:

第一步: 先爬一条新闻, 把它下载保存!

第二部: 把这一页的全部新闻都爬下来, 下载保存!

第三部: 把x页的全部新闻爬下来, 下载保存!

网页爬虫非常重要的一点就是分析网页元素.

第一步: 先来爬一个url的内容

#crawler: 甘肃农业大学新闻网板块的->学校新闻.

#爬这个页面的第一篇新闻

#http://news.gsau.edu.cn/tzgg1/xxxw33.htm

#coding:utf-8

import urllib

str = '<a class="c43092" href="../info/1037/30577.htm" target="_blank" title="双联行动水泉乡克那村工作组赴联系村开展精准扶贫相关工作">双联行动水泉乡克那村工作组赴联系村开展精准扶贫相关工作</a>'

hrefBeg = str.find('href=')

hrefEnd = str.find('.htm', hrefBeg)

href = str[hrefBeg+6: hrefEnd+4]

print href

href = href[3:]

print href

titleBeg = str.find(r'title=')

titleEnd = str.find(r'>', titleBeg)

title = str[titleBeg+7: titleEnd-1]

print title

url = 'http://news.gsau.edu.cn/' + href

print 'url: '+url

content = urllib.urlopen(url).read()

#print content

filename = title + '.html'

#将抓取的页面content写入filename保存本地文件夹中

open(filename, 'w').write(content)

这个里面不用分析太多网页元素. 就字符串拼阿截阿就好了.

第二步: 爬取本页面的全部新闻, 每页23篇

这个时候就要小小的分析下, 这23个url, 每一个url怎么找?

这里能够先锁定一个元素, 进行查找. 再就是注意每次find时的规律, 事实上就是查找的顺序起始

这里我把每一个url都保存在一个数组中, 检索完毕后, 对数组里的url进行下载.

#crawler甘肃农业大学新闻网板块的->学校新闻.

#http://news.gsau.edu.cn/tzgg1/xxxw33.htm

#<a class="c43092" href="../info/1037/30567.htm" target="_blank" title="双联行动水泉乡克那村工作组赴联系村开展精准扶贫相关工作">双联行动水泉乡克那村工作组赴联系村开展精准扶贫相关工作</a>

#coding:utf-8

import urllib

import time

import stat, os

pageSize = 23

articleList = urllib.urlopen('http://news.gsau.edu.cn/tzgg1/xxxw33.htm').read()

urlList = [' ']*pageSize

#锁定class="c43092"

hrefClass = articleList.find('class="c43092"')

hrefBeg = articleList.find(r'href=', hrefClass)

hrefEnd = articleList.find(r'.htm', hrefBeg)

href=articleList[hrefBeg+6: hrefEnd+4][3:]

print href

#url = 'http://news.gsau.edu.cn/' + href

#print 'url: '+url

i = 0

while href!=-1 and i<pageSize:

    urlList[i] = 'http://news.gsau.edu.cn/' + href

    hrefClass = articleList.find('class="c43092"', hrefEnd)

    hrefBeg = articleList.find(r'href=', hrefClass)

    hrefEnd = articleList.find(r'.htm', hrefBeg)

    href=articleList[hrefBeg+6: hrefEnd+4][3:]

    print urlList[i]

    i = i+1

else:

    print r'本页所有URL已爬完!!!'

#将本页每一篇新闻下载到本地(已新闻标题文件名称存储)

#title: <HTML><HEAD><TITLE>酒泉市市长都伟来校对接校地合作事宜-新闻网</TITLE>

j = 0

while j<pageSize:

    content = urllib.urlopen(urlList[j]).read()

    titleBeg = content.find(r'<TITLE>')

    titleEnd = content.find(r'</TITLE>', titleBeg)

    title = content[titleBeg+7: titleEnd]

    print title

    print urlList[j]+r'正在下载...'

    time.sleep(1)

    open(r'GsauNews' + os.path.sep + title.decode('utf-8').encode('gbk')+'.html', 'w+').write(content)

    j = j + 1

else:

    print r'当页所有url完成下载!'

第三步: 爬取N个页面的全部新闻

这里要爬取N个页面, 首先就要分析你要爬取的是最新的, 而不能是固定的某几页

所以要分析下分页的数据, 正好主页最以下也给出了分页数据, 直接用它!

看下近期几页的url:

#http://news.gsau.edu.cn/tzgg1/xxxw33.htm        第一页

#http://news.gsau.edu.cn/tzgg1/xxxw33/221.htm    第二页

#http://news.gsau.edu.cn/tzgg1/xxxw33/220.htm    第三页

#http://news.gsau.edu.cn/tzgg1/xxxw33/219.htm    第四页

对照分页数据, 非常easy发现规律, 就是: fenyeCount-pageNo+1

这里非常烦的一点是不知道为什么, 除了第一页以外的其它页, 都会有不是本页而是前一页的一部分网页数据会掺进来. 导致我找了半天

做了非常多推断才检索出来

#crawler甘肃农业大学新闻网板块的->学校新闻.

#coding:utf-8

import urllib

import time

import stat, os

pageCount = 4

pageSize = 23

pageNo = 1

urlList = [' ']*pageSize*pageCount

#分析分页的网页元素

#<td width="1%" align="left" id="fanye43092" nowrap="">共5084条  1/222 </td>

indexContent = urllib.urlopen('http://news.gsau.edu.cn/tzgg1/xxxw33.htm').read()

fenyeId = indexContent.find('id="fanye43092"') #这里锁定分页的id进行查找

fenyeBeg = indexContent.find('1/', fenyeId)

fenyeEnd = indexContent.find(' ', fenyeBeg)

fenyeCount = int(indexContent[fenyeBeg+2: fenyeEnd])

i = 0

while pageNo <= pageCount:

    if pageNo==1:

        articleUrl = 'http://news.gsau.edu.cn/tzgg1/xxxw33.htm'

    else:

        articleUrl = 'http://news.gsau.edu.cn/tzgg1/xxxw33/'+ str(fenyeCount-pageNo+1) + '.htm'

    print r'--------共爬取'+ str(pageCount) + '页  当前第' + str(pageNo) + '页  URL:' + articleUrl

    articleList = urllib.urlopen(articleUrl).read()

    while i<pageSize*pageNo:

        if pageNo == 1:

            #i = 0,23,46...时,从头找, 其余从上一个url结束位置开找

            if i == pageSize*(pageNo-1):

                hrefId = articleList.find('id="line43092_0"')

            else:

                hrefId = articleList.find('class="c43092"', hrefEnd)

        else:

            if i == pageSize*(pageNo-1):

                hrefId = articleList.find('id="lineimg43092_16"')

            else:

                hrefId = articleList.find('class="c43092"', hrefEnd)

        hrefBeg = articleList.find(r'href=', hrefId)

        hrefEnd = articleList.find(r'.htm', hrefBeg)

        if pageNo == 1:

            href=articleList[hrefBeg+6: hrefEnd+4][3:]

        else:

            href=articleList[hrefBeg+6: hrefEnd+4][6:]

        urlList[i] = 'http://news.gsau.edu.cn/' + href

        print urlList[i]

        i = i+1

    else:

        print r'========第'+str(pageNo)+'页url提取完毕!!!'

    pageNo = pageNo + 1

print r'============全部url提取完毕!!!============'+'\n'*3

print r'==========開始下载到本地==========='

j = 0

while j < pageCount * pageSize:

    content = urllib.urlopen(urlList[j]).read()

    titleBeg = content.find(r'<TITLE>')

    titleEnd = content.find(r'</TITLE>', titleBeg)

    title = content[titleBeg+7: titleEnd]

    print title

    print urlList[j]+r'正在下载...'+'\n'

    time.sleep(1)

    open(r'GsauNews' + os.path.sep + title.decode('utf-8').encode('gbk')+'.html', 'w+').write(content)

    j = j + 1

else:

    print r'下载完毕, 共下载'+str(pageCount)+'页, '+str(pageCount*pageSize)+'篇新闻'

这就爬完了....

看下爬完的效果

==================== RESTART: D:\python\CSDNCrawler03.py ====================

--------共爬取4页  当前第1页  URL:http://news.gsau.edu.cn/tzgg1/xxxw33.htm

http://news.gsau.edu.cn/info/1037/30596.htm

http://news.gsau.edu.cn/info/1037/30595.htm

http://news.gsau.edu.cn/info/1037/30593.htm

http://news.gsau.edu.cn/info/1037/30591.htm

http://news.gsau.edu.cn/info/1037/30584.htm

http://news.gsau.edu.cn/info/1037/30583.htm

http://news.gsau.edu.cn/info/1037/30580.htm

http://news.gsau.edu.cn/info/1037/30577.htm

http://news.gsau.edu.cn/info/1037/30574.htm

http://news.gsau.edu.cn/info/1037/30573.htm

http://news.gsau.edu.cn/info/1037/30571.htm

http://news.gsau.edu.cn/info/1037/30569.htm

http://news.gsau.edu.cn/info/1037/30567.htm

http://news.gsau.edu.cn/info/1037/30566.htm

http://news.gsau.edu.cn/info/1037/30565.htm

http://news.gsau.edu.cn/info/1037/30559.htm

http://news.gsau.edu.cn/info/1037/30558.htm

http://news.gsau.edu.cn/info/1037/30557.htm

http://news.gsau.edu.cn/info/1037/30555.htm

http://news.gsau.edu.cn/info/1037/30554.htm

http://news.gsau.edu.cn/info/1037/30546.htm

http://news.gsau.edu.cn/info/1037/30542.htm

http://news.gsau.edu.cn/info/1037/30540.htm

========第1页url提取完毕!!!

--------共爬取4页  当前第2页  URL:http://news.gsau.edu.cn/tzgg1/xxxw33/221.htm

http://news.gsau.edu.cn/info/1037/30536.htm

http://news.gsau.edu.cn/info/1037/30534.htm

http://news.gsau.edu.cn/info/1037/30528.htm

http://news.gsau.edu.cn/info/1037/30525.htm

http://news.gsau.edu.cn/info/1037/30527.htm

http://news.gsau.edu.cn/info/1037/30524.htm

http://news.gsau.edu.cn/info/1037/30520.htm

http://news.gsau.edu.cn/info/1037/30519.htm

http://news.gsau.edu.cn/info/1037/30515.htm

http://news.gsau.edu.cn/info/1037/30508.htm

http://news.gsau.edu.cn/info/1037/30507.htm

http://news.gsau.edu.cn/info/1037/30506.htm

http://news.gsau.edu.cn/info/1037/30505.htm

http://news.gsau.edu.cn/info/1037/30501.htm

http://news.gsau.edu.cn/info/1037/30498.htm

http://news.gsau.edu.cn/info/1037/30495.htm

http://news.gsau.edu.cn/info/1037/30493.htm

http://news.gsau.edu.cn/info/1037/30482.htm

http://news.gsau.edu.cn/info/1037/30480.htm

http://news.gsau.edu.cn/info/1037/30472.htm

http://news.gsau.edu.cn/info/1037/30471.htm

http://news.gsau.edu.cn/info/1037/30470.htm

http://news.gsau.edu.cn/info/1037/30469.htm

========第2页url提取完毕!!!

--------共爬取4页  当前第3页  URL:http://news.gsau.edu.cn/tzgg1/xxxw33/220.htm

http://news.gsau.edu.cn/info/1037/30468.htm

http://news.gsau.edu.cn/info/1037/30467.htm

http://news.gsau.edu.cn/info/1037/30466.htm

http://news.gsau.edu.cn/info/1037/30465.htm

http://news.gsau.edu.cn/info/1037/30461.htm

http://news.gsau.edu.cn/info/1037/30457.htm

http://news.gsau.edu.cn/info/1037/30452.htm

http://news.gsau.edu.cn/info/1037/30450.htm

http://news.gsau.edu.cn/info/1037/30449.htm

http://news.gsau.edu.cn/info/1037/30441.htm

http://news.gsau.edu.cn/info/1037/30437.htm

http://news.gsau.edu.cn/info/1037/30429.htm

http://news.gsau.edu.cn/info/1037/30422.htm

http://news.gsau.edu.cn/info/1037/30408.htm

http://news.gsau.edu.cn/info/1037/30397.htm

http://news.gsau.edu.cn/info/1037/30396.htm

http://news.gsau.edu.cn/info/1037/30394.htm

http://news.gsau.edu.cn/info/1037/30392.htm

http://news.gsau.edu.cn/info/1037/30390.htm

http://news.gsau.edu.cn/info/1037/30386.htm

http://news.gsau.edu.cn/info/1037/30385.htm

http://news.gsau.edu.cn/info/1037/30376.htm

http://news.gsau.edu.cn/info/1037/30374.htm

========第3页url提取完毕!!!

--------共爬取4页  当前第4页  URL:http://news.gsau.edu.cn/tzgg1/xxxw33/219.htm

http://news.gsau.edu.cn/info/1037/30370.htm

http://news.gsau.edu.cn/info/1037/30369.htm

http://news.gsau.edu.cn/info/1037/30355.htm

http://news.gsau.edu.cn/info/1037/30345.htm

http://news.gsau.edu.cn/info/1037/30343.htm

http://news.gsau.edu.cn/info/1037/30342.htm

http://news.gsau.edu.cn/info/1037/30340.htm

http://news.gsau.edu.cn/info/1037/30339.htm

http://news.gsau.edu.cn/info/1037/30335.htm

http://news.gsau.edu.cn/info/1037/30333.htm

http://news.gsau.edu.cn/info/1037/30331.htm

http://news.gsau.edu.cn/info/1037/30324.htm

http://news.gsau.edu.cn/info/1037/30312.htm

http://news.gsau.edu.cn/info/1037/30311.htm

http://news.gsau.edu.cn/info/1037/30302.htm

http://news.gsau.edu.cn/info/1037/30301.htm

http://news.gsau.edu.cn/info/1037/30298.htm

http://news.gsau.edu.cn/info/1037/30294.htm

http://news.gsau.edu.cn/info/1037/30293.htm

http://news.gsau.edu.cn/info/1037/30289.htm

http://news.gsau.edu.cn/info/1037/30287.htm

http://news.gsau.edu.cn/info/1037/30286.htm

http://news.gsau.edu.cn/info/1037/30279.htm

========第4页url提取完毕!!!

============全部url提取完毕!!!============

==========開始下载到本地===========

甘肃爽口源生态科技股份有限公司来校洽谈校企合作-新闻网

http://news.gsau.edu.cn/info/1037/30596.htm正在下载...

我校第二届辅导员职业能力大赛圆满落幕-新闻网

http://news.gsau.edu.cn/info/1037/30595.htm正在下载...

双联行动周家山村工作组开展精准扶贫工作-新闻网

http://news.gsau.edu.cn/info/1037/30593.htm正在下载...

新疆生产建设兵团来校举办专场校园招聘会-新闻网

http://news.gsau.edu.cn/info/1037/30591.htm正在下载...

【图片新闻】桃李吐新叶   羲园醉春风-新闻网

http://news.gsau.edu.cn/info/1037/30584.htm正在下载...

甘农师生爱心救助白血病学生张渭-新闻网

http://news.gsau.edu.cn/info/1037/30583.htm正在下载...

副校长赵兴绪带队赴新庄村开展精准扶贫相关工作-新闻网

http://news.gsau.edu.cn/info/1037/30580.htm正在下载...

双联行动红庄村工作组开展精准扶贫工作-新闻网

http://news.gsau.edu.cn/info/1037/30577.htm正在下载...

校长吴建民带队赴广河县开展精准扶贫和双联工作-新闻网

http://news.gsau.edu.cn/info/1037/30574.htm正在下载...

动物医学院赴园子村开展精准扶贫相关工作-新闻网

http://news.gsau.edu.cn/info/1037/30573.htm正在下载...

..........

一分多钟爬了90个网页.

当然代码还能够优化好多. 可是我Python基础实在薄弱.还是去恶补

还能够继续升级, 用正则, 匹配自己想要的内容,而不是像这样泛泛全保存下来.

能够在以后的学习中继续改进. 爬些更有意思的东西

做这个样例仅仅是为了看看, 初学Python能为我做什么.

Python写爬虫-爬甘农大学校新闻的更多相关文章

Python写爬虫爬妹子
最近学完Python,写了几个爬虫练练手,网上的教程有很多,但是有的已经不能爬了,主要是网站经常改,可是爬虫还是有通用的思路的,即下载数据.解析数据.保存数据.下面一一来讲. 1.下载数据首先打 ...
用Python写爬虫爬取58同城二手交易数据
爬了14W数据,存入Mongodb,用Charts库展示统计结果,这里展示一个示意模块1 获取分类url列表 from bs4 import BeautifulSoup import request ...
利用Python网络爬虫爬取学校官网十条标题
利用Python网络爬虫爬取学校官网十条标题案例代码: # __author : "J" # date : 2018-03-06 # 导入需要用到的库文件 import urll ...
python写爬虫时的编码问题解决方案
在使用Python写爬虫的时候,常常会遇到各种令人抓狂的编码错误问题.下面给出一些简单的解决编码错误问题的思路,希望对大家有所帮助. 首先,打开你要爬取的网站,右击查看源码,查看它指定的编码是什么,如 ...
如何利用Python网络爬虫爬取微信朋友圈动态--附代码（下）
前天给大家分享了如何利用Python网络爬虫爬取微信朋友圈数据的上篇(理论篇),今天给大家分享一下代码实现(实战篇),接着上篇往下继续深入. 一.代码实现 1.修改Scrapy项目中的items.py ...
《用Python写爬虫》学习笔记（一）
注:纯文本内容,代码独立另写,属于本人学习总结,无任何商业用途,在此分享,如有错误,还望指教. 1.为什么需要爬虫? 答:目前网络API未完全放开,所以需要网络爬虫知识. 2.爬虫的合法性? 答:爬虫 ...
怎么用Python写爬虫抓取网页数据
机器学习首先面临的一个问题就是准备数据,数据的来源大概有这么几种:公司积累数据,购买,交换,政府机构及企业公开的数据,通过爬虫从网上抓取.本篇介绍怎么写一个爬虫从网上抓取公开的数据. 很多语言都可以写 ...
开发记录_自学Python写爬虫程序爬取csdn个人博客信息
每天刷开csdn的博客,看到一整个页面,其实对我而言,我只想看看访问量有没有上涨而已... 于是萌生了一个想法: 想写一个爬虫程序把csdn博客上边的访问量和评论数都爬下来. 打算通过网络各种搜集资料 ...
定向爬虫之爬一爬各个学校新闻的认识（【1】对Url的认识）
昨天早上,我习惯性的打开博客园,看一看别人的写的博客.突然想起,自己好像没有写过什么博客,所以就心血来潮,把我现在做得事情写出来, 这也是对我目前的学习的一种总结.望大神指点.... 对于一间学校的新 ...

随机推荐

Python的主要库
本文在Creative Commons许可证下发布市面上的分析工具大致分为两大类,菜单式的工具和命令行式的工具.前者适合于初学入门,类似于跟团旅游,提供了固定的路线.分析套路比较固定化,点几下鼠标就 ...
web开发快速提高工作效率的一些资源
前端学习资源实在是又多又广,在这样的一个知识的海洋里,我们像一块海绵一样吸收,想要快速提高效率,平时的总结不可缺少,以下总结了一些,排版自我感觉良好,推送出来,后续持续跟新中...... 开发工具 H ...
【Henu ACM Round#14 C】Duff and Weight Lifting
[链接] 我是链接,点我呀:) [题意] 在这里输入题意 [题解] 2^y可以由两个2^(y-1)相加得到. 则有一个贪心的策略. 就是2^x尽量都变成2^(x+1) (即能够凑就尽量凑) 如果x还有 ...
构建基于Javascript的移动CMS——加入滑动
在和几个有兴趣做移动CMS的小伙伴讨论了一番之后,我们认为当前比較重要的便是统一一下RESTful API.然而近期持续断网中,又遭遇了一次停电,暂停了对API的思考.在周末无聊的时光了看了<人 ...
android 动画xml属性具体解释
/** * 作者:crazyandcoder * 联系: * QQ : 275137657 * email: lijiwork@sina.com * 转载请注明出处! */ android 动画属性具 ...
POJ 1006 Biorhythms （数论-中国剩余定理）
Biorhythms Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 111285 Accepted: 34638 Des ...
关于exports 和 module.exports
本文来源为node.js社区附上链接 http://cnodejs.org/topic/5231a630101e574521e45ef8 require 用来加载代码,而 exports 和 modu ...
39.C语言操作数据库
一.准备工作: sqlite3工具集:链接:https://pan.baidu.com/s/1mjufXZa 密码:2ui7 安装步骤: 打开如下文件夹,找到sqlite3.dll,并放入系统目录 2 ...
Spring Cloud Netflix Eureka client源码分析
1.client端 EurekaClient提供三个功能: EurekaClient API contracts are:* - provide the ability to get Instance ...
CF 843 A. Sorting by Subsequences
A. Sorting by Subsequences You are given a sequence a1, a2, ..., an consisting of different integers ...

Python写爬虫-爬甘农大学校新闻