使用Python批量下载Plus上的Podcast

Plus是一个介绍数学之美与实际应用的网络杂志，其中包含了数学知识、轶闻趣事、历史典故等许多精彩的内容。该杂志恰好有一个Podcast栏目，提供了不少采访与讲座的mp3音频。于是，我使用Python脚本将所有的Podcast文件都下载了下来，用于上下班路途上不适宜看书的时候听。

该脚本引入了四个模块：

re用于正则表达式匹配，将Podcast标题转为音频文件名。这是由于Linux系统下的文件名应避免包含字符/><|:&。所以，如果Podcast标题包含了这些字符，就需要将它们替换掉。
shutil用于将从网络上下载的音频数据流转存为本地文件。
requests用于向网站服务器发出GET请求，以获取HTML页面。
html.parser用于解析得到的HTML页面内容，从中匹配并提取出感兴趣的内容，包括Podcast子页面的链接，mp3文件的url等。使用时，需要由该模块中的HTMLParser类派生出子类，重写其中的成员函数handle_starttag与handle_data，即可以实现对HTML标签及其属性与包含的内容进行提取与处理。

具体代码如下：

import re

import shutil

import requests

from html.parser import HTMLParser

# Remove characters unsuitable to be used in a file name.

def StringToFileName(s):

    s = s.replace('#', 'No.')

    s = re.sub('[?]', '', s)

    s = re.sub('[&+$*!@%^|<>/]', '_', s)

    s = re.sub(':\s*', '-', s)

    return s

# Class for parse podcast main page and localize the links to subpages.

class PlusPodcastPageParser(HTMLParser):

    def __init__(self):

        HTMLParser.__init__(self)

        self.root_url = 'https://plus.maths.org'

        self.subpage_url_list = list()

        self.subpage_title_list = list()

        self.is_subpage_div_found = False

        self.is_subpage_span_found = False

        self.is_subpage_link_found = False

    def handle_starttag(self, tag, attrs):

        if (not self.is_subpage_link_found and not self.is_subpage_div_found and tag == 'div'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'class' and attr[1] == 'views-field views-field-title'):

                        self.is_subpage_div_found = True

                        break

        elif (not self.is_subpage_link_found and self.is_subpage_div_found and tag == 'span'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'class' and attr[1] == 'field-content'):

                        self.is_subpage_span_found = True

                        break

        elif (not self.is_subpage_link_found and self.is_subpage_span_found and tag == 'a'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'href'):

                        self.is_subpage_link_found = True

                        self.subpage_url_list.append(self.root_url + attr[1])

                        break

    def handle_data(self, data):

        if (self.is_subpage_link_found):

            podcast_file_name = StringToFileName(data)

            self.subpage_title_list.append(podcast_file_name)

            # Reset pattern searching flags and prepare for the next search.

            self.is_subpage_div_found = False

            self.is_subpage_span_found = False

            self.is_subpage_link_found = False

# Class for parse podcast subpage which contains the mp3 file.

class PlusPodcastSubpageParser(HTMLParser):

    def __init__(self):

        HTMLParser.__init__(self)

        self.root_url = 'https://plus.maths.org'

        self.podcast_url = ''

        self.is_subpage_div_found = False

        self.is_subpage_span_found = False

        self.is_podcast_link_found = False

    def handle_starttag(self, tag, attrs):

        if (not self.is_podcast_link_found and not self.is_subpage_div_found and tag == 'div'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'class' and attr[1] == 'field-item even'):

                        self.is_subpage_div_found = True

                        break

        elif (not self.is_podcast_link_found and self.is_subpage_div_found and tag == 'span'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'class' and attr[1] == 'file'):

                        self.is_subpage_span_found = True

                        break

        elif (not self.is_podcast_link_found and self.is_subpage_span_found and tag == 'a'):

            if (len(attrs) > 0):

                for attr in attrs:

                    if (attr[0] == 'href'):

                        self.is_podcast_link_found = True

                        self.podcast_url = attr[1]

                        break

number_of_podcast_pages = 16

mp3_file_counter = 1

for page_idx in range(0, number_of_podcast_pages):

    if (page_idx > 0):

        page_url = 'https://plus.maths.org/content/podcast?page=' + str(page_idx)

    else:

        page_url = 'https://plus.maths.org/content/podcast'

    response = requests.get(page_url)

    if (response.status_code == 200):

        current_page = response.text

        del response

        page_parser = PlusPodcastPageParser()

        page_parser.feed(current_page)

        if (len(page_parser.subpage_url_list) == len(page_parser.subpage_title_list)):

            # Iterative over each found subpage url.

            for subpage_idx in range(len(page_parser.subpage_url_list)):

                response = requests.get(page_parser.subpage_url_list[subpage_idx])

                if (response.status_code == 200):

                    current_subpage = response.text

                    del response

                    subpage_parser = PlusPodcastSubpageParser()

                    subpage_parser.feed(current_subpage)

                    if (len(subpage_parser.podcast_url) > 0):

                        print ('*** Downloading ' + subpage_parser.podcast_url + ' ...')

                        response = requests.get(subpage_parser.podcast_url, stream = True)

                        with open(str(mp3_file_counter) + '-' + page_parser.subpage_title_list[subpage_idx] + '.mp3', 'wb') as mp3_file:

                            shutil.copyfileobj(response.raw, mp3_file)

                            del response

                        mp3_file_counter = mp3_file_counter + 1

                else:

                    print ('Cannot get the podcast subpage: ' + page_parser.subpage)

        else:

            print ('The numbers of subpage urls and titles should be the same!')

    else:

        print ('Cannot get the podcast page: ' + page_url)

运行该程序，稍等片刻，就可以得到所有的Podcast资源了。

使用Python批量下载Plus上的Podcast的更多相关文章

python——批量下载图片
前言批量下载网页上的图片需要三个步骤: 获取网页的URL 获取网页上图片的URL 下载图片例子 from html.parser import HTMLParser import urllib.r ...
用Python批量下载DACC的MODIS数据
本人初次尝试用Python批量下载DACC的MODIS数据,记下步骤,提醒自己,数据还在下载,成功是否未知,等待结果中...... 若有大佬发现步骤有不对之处,望指出,不胜感激. 1.下载Python ...
8行代码批量下载GitHub上的图片
[问题来源] 来打算写一个的小游戏,但是图片都在GitHub仓库中,GitHub网页版又没有批量下载图片的功能,只有单独一张一张的下载,所以自己就写了个爬虫脚本模拟人的操作把整个页面上需要的图片爬取下 ...
用python批量下载图片
一写爬虫注意事项网络上有不少有用的资源, 如果需要合理的用爬虫去爬取资源是合法的,但是注意不要越界,前一阶段有个公司因为一个程序员写了个爬虫,导致公司200多个人被抓,所以先进入正题之前了解下什么 ...
python批量下载微信好友头像，微信头像批量下载
#!/usr/bin/python #coding=utf8 # 自行下载微信模块 itchat 小和QQ496631085 import itchat,os itchat.auto_login() ...
用python批量下载贴吧图片附源代码
环境:windows 7 64位:python2.7:IDE pycharm2016.1 功能: 批量下载百度贴吧某吧某页的所有帖子中的所有图片使用方法: 1.安装python2.7,安装re模块, ...
Python 批量下载BiliBili视频打包成软件
文章目录很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经做案例的人,却不知道如何去学习更加高深的知识.那么针对这三类人,我给大家 ...
Python - 批量下载 IIS 共享的文件
1.说明用 IIS 以WEB形式发布了本地文件夹,提供文件下载,并设置了访问权限:默认下载需要点击一个一个的下载,web界面如下: 3.脚本执行脚本批量下载文件,会在当前目录创建文件夹,并压缩该文 ...
python 批量下载图片
#coding=utf-8import re,sysimport urllib def getHtml(url): page = urllib.urlopen(url) html = page.rea ...

随机推荐

CF739E Gosha is hunting DP+wqs二分
我是从其他博客里看到这题的,上面说做法是wqs二分套wqs二分?但是我好懒呀,只用了一个wqs二分,于是\(O(nlog^2n)\)→\(O(n^2logn)\) 首先我们有一个\(O(n^3)\)的 ...
[HNOI2009]最小圈（分数规划+SPFA判负环）
题解:求环长比环边个数的最小值,即求min{Σw[i]/|S|},其中i∈S.这题一眼二分,然后可以把边的个数进行转化,假设存在Σw[i]/|S|<=k,则Σw[i]-k|S|<=0,即Σ ...
app开发中的经常遇到的问题
1.banner不显示: 原因:配置文件中的域名写错了. img_path = https://www.beicaiduo.com/znbsite/static/tinymce/upload/ 解决 ...
Kubernetes实战：目录
一.Docker实战 Docker: 基础介绍 [一] Docker:Docker 性质及版本选择 [三] Docker:网络及数据卷设置 [四] Docker:手动制作镜像 [五] Docker:d ...
I/O模型系列之一：Linux I/O模型基本概念
1. IO模型矩阵基本 Linux I/O 模型的简单矩阵: 同步与异步:描述的是用户线程与内核的交互方式. 同步IO和异步IO的区别就在于:数据拷贝的时候进程是否阻塞! 同步是指用户线程发起IO请 ...
Angular记录(5)
文档资料箭头函数--MDN:https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Reference/Functions/Arrow_fun ...
《Java》第五周学习总结20175301
https://gitee.com/ShengHuoZaiDaXue/20175301.git 本周我学习了第六章的内容接口重要内容有理解接口接口参数面向接口编程 abstract类与接口的比 ...
JAVA集合2--Collection架构
Collectin有两个分支:List和Set List是有序集合,可以有重复元素:而Set不允许有重复元素为了方便,抽象出AbstractCollection这个抽象类,其实现了Collectio ...
SEO网页优化
1.h1~h6标签的使用: 大标题(最主要的标题)用h1,依次往下. 2.为每一个在HTML里的img添加Alt属性 3.给a标签加title 4.css sprites 5.启动keep-Alive ...
windows 下的 Rsync 同步
整理一下 windows 下的 rsync 文件同步. Rsync下载地址: 链接:https://pan.baidu.com/s/1nL0Ee_u76ytWKUFMeiKDIw 提取码:52in 一 ...

使用Python批量下载Plus上的Podcast

使用Python批量下载Plus上的Podcast的更多相关文章

随机推荐

热门专题