[Python] Wikipedia Crawler

import time

import urllib

import bs4

import requests

start_url = "https://en.wikipedia.org/wiki/Special:Random"

target_url = "https://en.wikipedia.org/wiki/Philosophy"

def find_first_link(url):

    response = requests.get(url)

    html = response.text

    soup = bs4.BeautifulSoup(html, "html.parser")

    # This div contains the article's body

    # (June 2017 Note: Body nested in two div tags)

    content_div = soup.find(id="mw-content-text").find(class_="mw-parser-output")

    # stores the first link found in the article, if the article contains no

    # links this value will remain None

    article_link = None

    # Find all the direct children of content_div that are paragraphs

    for element in content_div.find_all("p", recursive=False):

        # Find the first anchor tag that's a direct child of a paragraph.

        # It's important to only look at direct children, because other types

        # of link, e.g. footnotes and pronunciation, could come before the

        # first link to an article. Those other link types aren't direct

        # children though, they're in divs of various classes.

        if element.find("a", recursive=False):

            article_link = element.find("a", recursive=False).get('href')

            break

    if not article_link:

        return

    # Build a full url from the relative article_link url

    first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)

    return first_link

def continue_crawl(search_history, target_url, max_steps=25):

    if search_history[-1] == target_url:

        print("We've found the target article!")

        return False

    elif len(search_history) > max_steps:

        print("The search has gone on suspiciously long, aborting search!")

        return False

    elif search_history[-1] in search_history[:-1]:

        print("We've arrived at an article we've already seen, aborting search!")

        return False

    else:

        return True

article_chain = [start_url]

while continue_crawl(article_chain, target_url):

    print(article_chain[-1])

    first_link = find_first_link(article_chain[-1])

    if not first_link:

        print("We've arrived at an article with no links, aborting search!")

        break

    article_chain.append(first_link)

    time.sleep(2) # Slow things down so as to not hammer Wikipedia's servers

[Python] Wikipedia Crawler的更多相关文章

Python Web Crawler
Python版本:3.5.2 pycharm URL Parsing¶ https://docs.python.org/3.5/library/urllib.parse.html?highlight= ...
【Python五篇慢慢弹】快速上手学python
快速上手学python 作者:白宁超 2016年10月4日19:59:39 摘要:python语言俨然不算新技术,七八年前甚至更早已有很多人研习,只是没有现在流行罢了.之所以当下如此盛行,我想肯定是多 ...
python百科
Python 编辑词条添加义项名 B 添加义项 ? Python(英语发音:/ˈpaɪθən/), 是一种面向对象.解释型计算机程序设计语言,由Guido van Rossum于1989年底发明,第 ...
Python in minute
Python 性能优化相关专题: https://www.ibm.com/developerworks/cn/linux/l-cn-python-optim/ Python wikipedi ...
Python网络数据采集7-单元测试与Selenium自动化测试
Python网络数据采集7-单元测试与Selenium自动化测试单元测试 Python中使用内置库unittest可完成单元测试.只要继承unittest.TestCase类,就可以实现下面的功能. ...
深入了解Python
一.Python的风格 Python在设计上坚持了清晰划一的风格,这使得Python成为一门易读.易维护,并且被大量用户所欢迎的.用途广泛的语言. 设计者开发时总的指导思想是,对于一个特定的问题,只要 ...
######【Python】【基础知识】Python的介绍 ######
Python 是一种面向对象.解释型计算机程序设计语言. Python是什么? Python(英国发音:/ˈpaɪθən/ 美国发音:/ˈpaɪθɑːn/), 是一种面向对象的解释型计算机程序设计语言 ...
所有selenium相关的库
通过爬虫获取官方文档库如果想获取相应的库修改对应配置即可代码如下 from urllib.parse import urljoin import requests from lxml im ...
500lines项目简介
"500行或更少" "What I cannot create, I do not understand." -- Richard Feynman <50 ...

随机推荐

自己封装js组件 - 初级
2天前抱着试试看的态度注册了此神博,心血来潮呕心沥血写了一篇关于vue 自定义组件的小文章尼玛果然一个评论的没有!果然毫无人气!(当然了我这文章内容有限和大神们的比起来简直是粗制滥造...)索性我就 ...
<a>标签是什么意思怎么使用？
转自:https://www.imooc.com/qadetail/190881 (1) a标签的作用:超链接,用于跳转到别的网页. (2) a标签的用法:<a href="网址&qu ...
Vue Syntax Highlight
Vue Syntax Highlight https://github.com/vuejs/vue-syntax-highlight
洛谷P1137 旅行计划解题报告(拓扑排序+DP)
我看了一下其他大佬的题解,大部分都是拓扑排序加上DP.那么我想有的人是不明白为什么这么做的,拓扑排序有什么性质使得可以DP呢?下面我就提一下. 对一个有向无环图(Directed Acyclic Gr ...
<Sicily>Brackets Matching
一.题目描述 Let us define a regular brackets sequence in the following way: Empty sequence is a regular s ...
ELK到底是什么？那么多公司用！__转载
Sina.饿了么.携程.华为.美团.freewheel.畅捷通 .新浪微博.大讲台.魅族.IBM...... 这些公司都在使用ELK!ELK!ELK! ELK竟然重复了三遍,是个什么? 一.ELK ...
sql索引碎片产生的原理解决碎片的办法(sql碎片整理)
本文讲述了SQL SERVER中碎片产生的原理,内部碎片和外部碎片的概念.以及解决碎片的办法和填充因子.在数据库中,往往每一个对于某一方面性能增加的功能也会伴随着另一方面性能的减弱.系统的学习数据库知 ...
Linux VNC Viewer客户端
1.realvnc-vnc-viewer 这个我感觉应该是做的做好用的VNC Viewer客户端了,毕竟是商业软件,但是VNC Viewer客户端是免费的.使用前需要到官网下载对应发行版的软件包进行 ...
hostid---打印当前主机的十六进制数字标识
hostid命令用于打印当前主机的十六进制数字标识.是主机的唯一标识,是被用来限时软件的使用权限,不可改变. hostid命令查找到的值是取hostname对应的ip地址.然后把ip地址转换成hex, ...
中断函数中不能使用printf的原因
vxworks 中断处理程序之所以不用printf,本质在于printf是将信息输出到标准输出设备(STDOUT)中, 整个标准输出设备是一个全局变量,由于有semTake操作,那么就会发生阻塞,vx ...

[Python] Wikipedia Crawler

[Python] Wikipedia Crawler的更多相关文章

随机推荐

热门专题