MusiCode 批量下载指定歌手的所有专辑（已解除验证码限制）

一直想把喜欢的歌手的专辑全都归类并下载下来，由于那专辑数量实在太多了，再加上最近开始学习python，就想着何不用python写个脚本把下载过程自动化呢？所以就花了点时间写了这么个东西，分享给有需要的人。:)

写这个东西，一开始并没有想到抓取过于频繁、时间过长会出现验证码，由于验证码的问题试了几种方式都无法得到很好的解决，于是加上了生成下载清单这一步，
加这一步的时候，一开始是把最终下载地址存储起来，结果发现，下载地址居然会过期，没办法最后只有将下载页面地址存储下来，使用下载命令的时候，再去下载
页面获取最终下载地址。

这段脚本使用了两个开源的模块，gevent和BeautifulSoup。

updated-------------------------------------------------------------------------------------------

已解除验证码限制，若出现验证码，则会从验证码页面中提取出所需cookie并重新发起请求。

   #coding=utf-8

    import urllib,urllib2,re,os,json,gevent,traceback

    from BeautifulSoup import BeautifulSoup

    from gevent import monkey

    monkey.patch_all()

    rootUrl='http://music.baidu.com'

    artistId=2825 #想批量下载并归类你喜欢的歌手的所有专辑？那就把这里替换成该歌手在百度音乐的Id吧，例如：http://music.baidu.com/artist/2825

    pagesize=10

    savePath='G:\\crawl\\david bowie\\' #改成你想存储的文件夹

    listDir='_____downlist\\'

    handleCount=0

    BAIDUVERIFY=''

    def crawlList():

        artistUrl=rootUrl+'/artist/'+str(artistId)

        homeHtml=request(artistUrl)

        soup=BeautifulSoup(homeHtml)

        try:

            pagecount=len(soup.findAll("div",{"class":"page-inner"})[1].findAll(text=re.compile(r'\d+')))

       except:

            print traceback.print_exc()

            print homeHtml

            return

        jobs=[]

        listPath=savePath+listDir

        if not os.path.exists(listPath):

            os.mkdir(listPath)

        for i in range(pagecount):

            jobs.append(gevent.spawn(crawlPage,i))

        gevent.joinall(jobs)

    def request(url):

        global BAIDUVERIFY

        req=urllib2.Request(url)

        if BAIDUVERIFY!='':

            req.add_header('Cookie','BAIDUVERIFY='+BAIDUVERIFY+';')

        resp=urllib2.urlopen(req)

        html= resp.read()

        verify=getBaiduVerify(html)

        if verify!='':

            print u'成功提取验证码并重新发起请求'

            BAIDUVERIFY=verify

            return request(url)

        return html

    def getBaiduVerify(html):

        vcode=re.search(r'name=\"vcode\" value=\"(.*?)\"' , html, re.I)

        id=re.search(r'name=\"id\" value=\"(.*?)\"' , html, re.I)

        di=re.search(r'name=\"di\" value=\"(.*?)\"' , html, re.I)

        if vcode and id and di:

            return vcode.group(1)+':'+id.group(1)+':'+di.group(1)

        return ''

    def crawlPage(page):

        start=page*pagesize

        albumListUrl='http://music.baidu.com/data/user/getalbums?start=%d&ting_uid=%d&order=time' % (start,artistId)

        print albumListUrl

        albumListHtml=json.loads(request(albumListUrl))["data"]["html"]

        albumListSoup=BeautifulSoup(albumListHtml)

        covers=albumListSoup.findAll('a',{'class':'cover'})

        pagePath=savePath+listDir+str(page)+'\\'

        if not os.path.exists(pagePath):

            os.mkdir(pagePath)

        for cover in covers:

            try:

                crawlAlbum(pagePath,rootUrl+cover['href'],cover['title'])

            except:

                print traceback.print_exc()

    def crawlAlbum(pagePath,albumUrl,title):

        print albumUrl,title

        albumHtml=request(albumUrl)

        albumSoup=BeautifulSoup(albumHtml)

        musicWraps=albumSoup.findAll('span',{'class':'song-title '})

        title=re.subn(r'\\|\/|:|\*|\?|\"|\<|\>|\|','',title)[0]

        path=savePath+title+'\\'

        albumListPath=pagePath+title+'.txt'

        albumFile=open(albumListPath,'w')

        for wrap in musicWraps:

            link=wrap.find('a')

            try:

                musicPage=rootUrl+link['href']

                albumFile.write('%s\t%s\t%s\n' % (musicPage,link['title'],path)) #真实下载地址会过期，这里保存下载页面

            except:

                print traceback.print_exc()

        albumFile.close()

    def crawlDownloadUrl(musicPage):

        downPage=musicPage+'/download'

        downHtml=request(downPage)

        downUrl=re.search('http://[^ ]*xcode.[a-z0-9]*' , downHtml, re.M).group()

        return downUrl

    def downList():

        listPath=savePath+listDir

        jobs=[]

        for pageDir in os.listdir(listPath):

            jobs.append(gevent.spawn(downPage,listPath+pageDir))

        gevent.joinall(jobs)

    def downPage(pagePath):

        for filename in os.listdir(pagePath):

            filePath=pagePath+'\\'+filename

            albumFile=open(filePath,'r')

            try:

                for args in albumFile.readlines():

                    arrArgs=args.split('\t')

                    downMusic(arrArgs[0],arrArgs[1],arrArgs[2].replace('\n',''))

            except:

                print traceback.print_exc()

            finally:

               albumFile.close()

   def downMusic(musicPage,title,path):

        global handleCount

        if not os.path.exists(path):

            os.mkdir(path)

        handleCount+=1

        print handleCount,musicPage,title,path

        filename=path+re.subn(r'\\|\/|:|\*|\?|\"|\<|\>|\|','',title)[0]+'.mp3'

        if os.path.isfile(filename):

            return

        downUrl=crawlDownloadUrl(musicPage)

        try:

            urllib.urlretrieve(downUrl,filename)

        except:

            print traceback.print_exc()

            os.remove(filename)

    if __name__=='__main__':

        print u'命令：\n\tlist\t生成下载清单\n\tdown\t开始下载\n\texit\t退出'

        cmd=raw_input('>>>')

        while cmd!='exit':

            if cmd=='list':

                crawlList()

                print u'已生成下载清单'

            elif cmd=='down':

               downList()

               print u'下载完成'

           else:

               print 'unknow cmd'

          cmd=raw_input('>>>')

MusiCode 批量下载指定歌手的所有专辑（已解除验证码限制）的更多相关文章

Linux运维之批量下载指定网站的100个图片文件，并找出大于200KB的文件
题目为: 有一百个图片文件,它们的地址都是http://down.fengge.com/img/1.pnghttp://down.fengge.com/img/2.png…一直到http://dow ...
批量下载网站图片的Python实用小工具
定位本文适合于熟悉Python编程且对互联网高清图片饶有兴趣的筒鞋.读完本文后,将学会如何使用Python库批量并发地抓取网页和下载图片资源.只要懂得如何安装Python库以及运行Python程序, ...
获取Google音乐的具体信息(方便对Google音乐批量下载)
Google音乐都是正版音乐, 不像百度所有都是盗链, 并且死链也多. 但有一个麻烦就是要下载Google音乐的时候得一个一个的点击下载链接, 进入下载页面再点"下载", 才干下载 ...
KRPano资源分析工具使用说明（KRPano XML/JS解密切片图批量下载球面图还原加密混淆JS还原美化）
软件交流群:571171251(软件免费版本在群内提供) krpano技术交流群:551278936(软件免费版本在群内提供) 最新博客地址:blog.turenlong.com 限时下载地址:htt ...
Java实现批量下载《神秘的程序员》漫画
上周看了西乔的博客“西乔的九卦”.<神秘的程序员们>系列漫画感觉很喜欢,很搞笑.这些漫画经常出现在CSDN“程序员”杂志末页的,以前也看过一些. 后来就想下载下来,但是一张一张的点击右键“ ...
C#实现图标批量下载
本文略微有些长,花了好几晚时间编辑修改,若在措辞排版上有问题,请谅解.本文共分为四篇,下面是主要内容,也是软件开发基本流程. 阶段描述需求分析主要描述实现本程序的目的及对需求进行分析,即为什么要 ...
在ASP.NET中实现压缩多个文件为.zip文件，实现批量下载功能（转载并优化处理篇）
转自:http://blog.csdn.net/yanlele424/article/details/6895986 这段时间一直在做一个网站,其中遇到了一个问题,就是在服务器端压缩多个服务器端的文件 ...
Lrc歌词批量下载助手 MP3歌词批量下载助手
Lrc歌词批量下载助手 MP3歌词批量下载助手易歌词的服务器已经挂掉,各个主流播放器已不提供明确的下载Lrc服务,当上G的MP3文件遇上苦逼的播放器,二缺就诞生了!本软件就是在这种背景下诞生的 ...
Python 爬取qqmusic音乐url并批量下载
qqmusic上的音乐还是不少的,有些时候想要下载好听的音乐,但有每次在网页下载都是烦人的登录什么的.于是,来了个qqmusic的爬虫. 至少我觉得for循环爬虫,最核心的应该就是找到待爬元素所在ur ...

随机推荐

Android Binder机制原理（史上最强理解，没有之一）（转）
原文地址: http://blog.csdn.net/universus/article/details/6211589 Binder是Android系统进程间通信(IPC)方式之一.Linux已经拥 ...
关于LCD的分屏与切屏 Tearing effect
详细文档(带图片):http://download.csdn.net/detail/xuehui869/5268852 1.LCM之Fmark功能 http://blog.csdn.net/zhand ...
好玩的获取目录信息的例子[C#]
DirectoryInfo dirinfo = new DirectoryInfo("d:\\111"); DirectoryInfo[] dirs = dirinfo.GetDi ...
抓包工具Fidder详解(主要来抓取Android中app的请求)
今天闲着没吊事,来写一篇关于怎么抓取Android中的app数据包?工欲行其事,必先利其器,上网google了一下,发现了一款神器:Fiddler,这个貌似是所有软件开发者必备神器呀!这款工具不仅可以 ...
hdu_2838_Cow Sorting(树状数组求逆序对)
题目连接:http://acm.hdu.edu.cn/showproblem.php?pid=2838 题意:给你一串数,让你排序,只能交换相邻的数,每次交换花费交换的两个树的和,问最小交换的价值题 ...
IntelliJ IDEA 7.0 正式版注册机代码
好神奇,第一次看见注册机的源代码,自己运行一下.可以是java IDE环境,也可以是配置好jdk的dos环境. 然后输入自己的用户名就可以获得相对应的注册码,输入到软件中即可. 仅供参考,请购买正版. ...
【python问题系列--1】SyntaxError:Non-ASCII character '\xe5' in file kNN.py on line 2, but no encoding declared;
因为Python在默认状态下不支持源文件中的编码所致.解决方案有如下三种: 一.在文件头部添加如下注释码: # coding=<encoding name> 例如,可添加# coding= ...
Ubuntu 网管服务器配置
1.设置Linux内核支持ip数据包的转发 echo "1" > /proc/sys/net/ipv4/ip_forward or vi /etc/sysctl.conf ...
109.110.100.56 samba用户名 PAS, 密码 111111
如果修改文件夹名字, 需要更改 cd /etc/samba/smb.conf 然后重启samba service smb restart 如果要修改文件夹权限 chmod -R 777 folder ...
window.open页面关闭后刷新父页面
如题 function openWin(url,text,winInfo){ var winObj = window.open(url,text,winInfo); var loop = setInt ...

MusiCode 批量下载指定歌手的所有专辑（已解除验证码限制）

MusiCode 批量下载指定歌手的所有专辑（已解除验证码限制）的更多相关文章

随机推荐

热门专题