urllib2抓取HTML存入Excel

通过urllib2抓取HTML网页，然后过滤出包含特定字符的行，并写入Excel文件：

# -*- coding: utf-8 -*-

import sys

#import urllib

import urllib2

from xlwt import Workbook

def getdata(keywords, line):

    date = ''

    if keywords in line: # 本行包含keywords

        start = line.find('>',)

        end = line.find('</', start)

        data = line[start+1:end]

        return data

    return False

def FetchDataByUrllib(checkUrl):

    book=Workbook(encoding='gbk')

    # add_sheet新增sheet，默认不能overwrite数据，必须显示指定可更改。

    sheet=book.add_sheet('mySheet', cell_overwrite_ok=True)

    try:

        checkFile = urllib2.urlopen(checkUrl)

    except Exception, e:

        print e

        return

    type = sys.getfilesystemencoding()

    i = 1

    for line in checkFile:

        # 根据网页编码格式来解码

        line = line.decode("UTF-8").encode(type)

        #line = line.decode("GBK").encode(type)

        # 逐行全部写入excel文件。

        #sheet.write(i,1,line)

        #i+=1

        # 查找所需的特定数据，写入Excel文件。

        targetStr = getdata('体育', line) # 包含'体育'的行

        if targetStr != False:

            sheet.write(i,1,targetStr)

            i+=1

    book.save('simple.xls')

    print 'finish!'

print '开始...'

myUrl = 'http://www.sina.com.cn'

FetchDataByUrllib(myUrl)

输出结果：

urllib2抓取HTML存入Excel的更多相关文章

python 抓取数据存入 excel
import requestsimport datetimefrom random import choicefrom time import timefrom openpyxl import loa ...
爬虫学习一系列：urllib2抓取网页内容
爬虫学习一系列:urllib2抓取网页内容所谓网页抓取,就是把URL地址中指定的网络资源从网络中读取出来,保存到本地.我们平时在浏览器中通过网址浏览网页,只不过我们看到的是解析过的页面效果,而通过程 ...
python使用urllib2抓取网页
1.使用python的库urllib2,用到urlopen和Request方法. 2.方法urlopen原形 urllib2.urlopen(url[, data][, timeout]) 其中: u ...
【Python开发】python使用urllib2抓取防爬取链接
前几天刚看完<Linux/Unix设计思想>,真是一本不错的书,推荐想提高自己代码质量的童鞋看一下,里面经常提到要以小为美,一个程序做好一件事,短小精悍,因此我也按照这种思想来写pytho ...
python2 urllib2抓取51job网的招聘数据
#coding=utf-8 __author__ = "carry" import sys reload(sys) sys.setdefaultencoding('utf-8') ...
通过urllib2抓取网页内容（1）
一.urllib2发送请求 import urllib2 url = 'http://www.baidu.com' req = urllib2.Request(url) response = urll ...
python抓取历年特码开奖记录
背景: 小时候,有种游戏,两个主人公:白XX和曾XX,每个家庭把他俩像活菩萨一样供着,供他们吃,供他们穿做生意的老板为了这两位活菩萨,关门大吉农民为了这两位活菩萨卖牛卖田变卖家产做官的为了这两位 ...
python 爬虫抓取心得
quanwei9958 转自 python 爬虫抓取心得分享 urllib.quote('要编码的字符串') 如果你要在url请求里面放入中文,对相应的中文进行编码的话,可以用: urllib.quo ...
使用python抓取知乎日报的API数据
使用 urllib2 抓取数据时,最简单的方法是: import urllib2, json def getStartImage(): stream = urllib2.urlopen('http:/ ...

随机推荐

C++ Daily 《5》----虚函数表的共享问题
问题: 包含一个以上虚函数的 class B, 它所定义的对象是否共用一个虚函数表? 分析: 由于含有虚函数,因此对象内存包含了一个指向虚函数表的指针,但是这个指针指向的是同一个虚函数表吗? 实验如 ...
Appium 截屏截图操作
问题场景:有时当我们的脚本运行报错时,需要通过截屏来分析异常的来源.而selenium也提供了可以截图的方法TakesScreenshot.getScreenshotAs 举例:我们把截屏的图片存储在 ...
Caffe.proto使用
参考 http://blog.csdn.net/qq_16055159/article/details/45115359 书写.proto文件作用:编写一个 proto 文件,定义我们程序中需要处理 ...
Doherty Threshold
Prior to the publication of the IBM technical paper behind what commonly known today as the Doherty ...
boost array使用
#include <iostream> #include<boost/array.hpp> int main() { boost::array<int, 6> ar ...
Kafka Shell基本命令（包括topic的增删改查）
转载请注明出处:http://www.cnblogs.com/xiaodf/ 创建kafka topic 查看所有topic列表查看指定topic信息控制台向topic生产数据控制台消费topi ...
基于redis排行榜的实战总结
前言: 之前写过排行榜的设计和实现, 不同需求其背后的架构和设计模型也不一样. 平台差异, 有的立足于游戏平台, 为多个应用提供服务, 有的仅限于单个游戏.排名范围差异, 有的面向全局排名, 有的只做 ...
HTC One M7简易刷Recovery教程
HTC One M7作为当下HTC旗下的旗舰热门机,用户们对于刷机的需求都比较强烈,对于刷ROM的前提就是要刷入Recovery,当然作为安卓智能手机HTC one而言也不例外,最近有些用 ...
（整理） JQuery中的AJAX
$(document).ready(function () { $("#search").click(function () { $.ajax({ type:"GET&q ...
VS常用快捷键
智能提示:ctrl + J方法参数提示:ctrl + shift +空格智能标记(如:提示using.实现接口.抽象类等):ctrl + .执行测试:ctrl + R,T(当前上下文),ctrl + ...

urllib2抓取HTML存入Excel

urllib2抓取HTML存入Excel的更多相关文章

随机推荐

热门专题