Python_网页爬虫

 import sys

 import multiprocessing

 import re

 import os

 import urllib.request as lib

 def craw_links( url,depth,keyword,processed):

     ''' url:the url to craw

         deth:the current depth to craw

         keyword:the tuple of keywords to focus

         pool:process pool

     '''

     contents=[]

     if url.startswith(('htpp://','https://')):

         if url not in processed:

             #mark this url as processed

             processed.append(url)

         else:

             #avoid prossing the same url again

             return

         print('Crawing '+url+'...')

         fp = lib.urlopen(url)

         #python3 returns bytes,so need to decode

         contents = fp.read()

         contents_decoded = contents.decode('UTF-8')

         fp.close()

         pattern = '|'.join(keyword)

         #if this page contains certain keywords,save it to a file

         flag = False

         if pattern:

             searched = re.search(pattern,contents_decoded)

         else:

             #if the keywords to filter is not given,save current page

             flag = True

         if flag or searched:

             with open('craw\\'+url.replace(':','_').replace('/','_'),'wb')  as fp:

                 fp.write(contents)

         #find all the links in the current page

         links = re.findall('href="(.*?)"',contents_decoded)

         #craw all links in the current page

         for link in links:

             #consider the relative path

             if not link.startswith(('http://','https://')):

                 try:

                     index=url.rindex('/')

                     link = url[0:index+1]+link

                 except:

                     pass

             if depth>0 and link.endswith(('.htm','.html')):

                 craw_links(link,depth-1,keyword,processed)

 if __name__ == '__main__':

     processed = []

     keywords = ('KeyWord1','KeyWord2')

     if os.path.exists('craw') or not os.path.isdir('craw'):

         os.mkdir('craw')

     craw_links(r'http://docs.python.org/3/library/index.html',1,keywords,processed)

Python_网页爬虫的更多相关文章

cURL 学习笔记与总结（2）网页爬虫、天气预报
例1.一个简单的 curl 获取百度 html 的爬虫程序(crawler): spider.php <?php /* 获取百度html的简单网页爬虫 */ $curl = curl_init( ...
c#网页爬虫初探
一个简单的网页爬虫例子! html代码: <head runat="server"> <title>c#爬网</title> </head ...
网页爬虫--scrapy入门
本篇从实际出发,展示如何用网页爬虫.并介绍一个流行的爬虫框架~ 1. 网页爬虫的过程所谓网页爬虫,就是模拟浏览器的行为访问网站,从而获得网页信息的程序.正因为是程序,所以获得网页的速度可以轻易超过单 ...
网页爬虫的设计与实现（Java版）
网页爬虫的设计与实现(Java版) 最近为了练手而且对网页爬虫也挺感兴趣,决定自己写一个网页爬虫程序. 首先看看爬虫都应该有哪些功能. 内容来自(http://www.ibm.com/deve ...
Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱（转）
原文:http://www.52nlp.cn/python-网页爬虫-文本处理-科学计算-机器学习-数据挖掘曾经因为NLTK的缘故开始学习Python,之后渐渐成为我工作中的第一辅助脚本语言,虽然开 ...
[resource-]Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
reference: http://www.52nlp.cn/python-%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab-%e6%96%87%e6%9c%ac%e5%a4% ...
网页抓取：PHP实现网页爬虫方式小结
来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...
Java正则表达式--网页爬虫
网页爬虫:其实就一个程序用于在互联网中获取符合指定规则的数据爬取邮箱地址,爬取的源不同,本地爬取或者是网络爬取 (1)爬取本地数据: public static List<String> ...
从robots.txt開始网页爬虫之旅
做个网页爬虫或搜索引擎(下面统称蜘蛛程序)的各位一定不会陌生,在爬虫或搜索引擎訪问站点的时候查看的第一个文件就是robots.txt了.robots.txt文件告诉蜘蛛程序在server上什么文件是能 ...

随机推荐

ConcurrentHashMap和HashTable的区别
hashtable是做了同步的,hashmap未考虑同步.所以hashmap在单线程情况下效率较高.hashtable在的多线程情况下,同步操作能保证程序执行的正确性. 但是hashtable每次同步 ...
SpriteBuilder中粒子发射器的reset on visibility toggle选项解释
如果选中该选择框,表示粒子发射器将删除所有已存在的粒子当它们的可见状态被代码改变的时候. 如果该选择框没有选中,则发射器将保持产生粒子但不渲染它们(意思是有但你看不到)当它们的可视状态为NO的时候. ...
web容器的会话机制
基本所有web应用开发的朋友都很熟悉session会话这个概念,在某个特定时间内,我们说可以在一个会话中存储某些状态,需要的时候又可以把状态取出来,这整个过程的时间空间可以抽象成"会话&qu ...
SpannableString 给TextView添加不同的显示样式
TextView是用来显示文本的,有时需要给TextView中的个别字设置为超链接,或者设置个别字的颜色.字体等,那就需要用到Spannable对象,可以借助Spannable对象实现以上设置 myT ...
LeetCode之“字符串”：最短回文子串
题目链接题目要求: Given a string S, you are allowed to convert it to a palindrome by adding characters in f ...
高通 android平台LCD驱动分析
目前手机芯片厂家提供的源码里包含整个LCD驱动框架,一般厂家会定义一个xxx_fb.c的源文件,注册一个平台设备和平台驱动,在驱动的probe函数中来调用register_framebuffer(), ...
leetCode之旅（5）-博弈论中极为经典的尼姆游戏
题目介绍 You are playing the following Nim Game with your friend: There is a heap of stones on the table ...
复位windows网络参数的方法
使用电脑的时候,经常会遇到网络相关的问题,以前读大学的时候就知道怎么解决,就是下面这个方案. 开始-全部程序-附件-命令提示符-右键-以管理员身份运行出来一个黑底白字的窗口,在里面输入: netsh ...
恶补web之八:jQuery(2)
jquery中非常重要的部分,就是操作dom的能力: text() - 设置或返回所选元素的文本内容 html() - 设置或返回所选元素的内容(包括html标记) val() - 设置或返回表单字段 ...
oracle 11g杀掉锁的sql
oracle 11g杀掉锁的sql [引用 2013-3-6 17:19:12] 字号:大中小 --查询出出现锁的session_idselect session_id from v$lo ...

Python_网页爬虫

Python_网页爬虫的更多相关文章

随机推荐

热门专题