阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身，这样就形成了一个循环，一环套一环：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

    global pages

    html = urlopen("http://en.wikipedia.org"+pageUrl)

    bsObj = BeautifulSoup(html,"lxml")

    try:

        print(bsObj.h1.get_text())

        print(bsObj.find(id ="mw-content-text").findAll("p")[0])                     //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个

        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])         //找到id=ca-edit里面的span标签里面的a标签里面的href的值

    except AttributeError:

        print("This page is missing something! No worries though!")

    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

        if 'href' in link.attrs:

            if link.attrs['href'] not in pages:

                #We have encountered a new page

                newPage = link.attrs['href']

                print(newPage)

                pages.add(newPage)

                getLinks(newPage)

getLinks("")

　2.对网址进行处理，通过"/"对网址中的字符进行分割

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("https://hao.360.cn/?a1004")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['https:', '', 'hao.360.cn', '?a1004']                   //两个//之间没有内容，所用用''表示

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']

3.抓取网站的内部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all Internal links found on a page

def getInternalLinks(bsObj, includeUrl):

    internalLinks = []

    #Finds all links that begin with a "/"

    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in internalLinks:

                internalLinks.append(link.attrs['href'])

    return internalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])

print(internalLinks)

运行结果为（此页面内的所有内部链接）：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['https://www.oreilly.com', 'http://www.oreilly.com/ideas', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav', 
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business', 
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering', 
'/topics/web-programming', 'https://www.oreilly.com/topics', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now', 
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in', 
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course', 
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid', 
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform', 
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends', 
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/', 
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html', 
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']

4.抓取网站的外部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all external links found on a page

def getExternalLinks(bsObj, excludeUrl):

    externalLinks = []

    #Finds all links that start with "http" or "www" that do

    #not contain the current URL

    for link in bsObj.findAll("a",

                              href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in externalLinks:

                externalLinks.append(link.attrs['href'])

    return externalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

print(splitAddress(startingPage))

print(splitAddress(startingPage)[0])

externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])

print(externalLinks)

运行结果为：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['oreilly.com']

oreilly.com

['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
<Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

【Python】进程3
#练习: import time from multiprocessing import Pool def run(fn): #fn: 函数参数是数据列表的一个元素 time.sleep(1) ret ...
[转]Deep Reinforcement Learning Based Trading Application at JP Morgan Chase
Deep Reinforcement Learning Based Trading Application at JP Morgan Chase https://medium.com/@ranko.m ...
Wget用法、参数解释的比较好的一个文章
wget是一个从网络上自动下载文件的自由工具.它支持HTTP,HTTPS和FTP协议,可以使用HTTP代理. 所谓的自动下载是指,wget可以在用户退出系统的之后在后台执行.这意味这你可以登录系统,启 ...
Django框架的使用
1.创建项目: 语法:django-admin startproject 项目名称 2.Django的项目结构介绍 1.manage.py 功能:包含执行django中的各项操作的指令,不太清楚可以使 ...
[LeetCode&Python] Problem 784. Letter Case Permutation
Given a string S, we can transform every letter individually to be lowercase or uppercase to create ...
flask写入数据库
sqlalchemy是一个关系型数据库框架,它提供了高层的ORM 和底层的原生数据库的操作. sqlalchemy实际上是对数据库的抽象,通过python对象操作数据库,提高开发效率. 安装 flas ...
box布局中文字溢出问题
如果不设置-webkit-box-flex:1:会溢出,设置width也行,在电脑上模拟可能会有问题,手机上没问题
unet网络讲解，附代码
转: http://www.cnblogs.com/gujianhan/p/6030639.html key1: FCN对图像进行像素级的分类,从而解决了语义级别的图像分割(semantic segm ...
web四则混合运算3
一.程序要求: 可以控制下列参数: 是否有乘除法: 是否有括号(最多可以支持十个数参与计算): 数值范围: 加减有无负数: 除法有无余数! 二.设计思路要求能够通过参数来控制有无乘除法,加减有无 ...
Python学习-终端字体高亮显示1
Python学习-终端字体高亮显示 1.采用原生转义字符序列,对Windows有的版本不支持(比如win7),完美支持Linux 实现过程: 终端的字符颜色是用转义序列控制的,是文本模式下的系统显 ...

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

随机推荐

热门专题