用python做youtube自动化下载器 代码
根据 savefrom条例
本实例及教程只用于学习交流用,权利归savefrom.net所有
最后代码+注释大概100行左右,具体代码以github代码为主(可以会在上面修复bug),本文只做具体讲解
项目地址
思路
流程
1. post
根据思路里的第一步,我们首先需要用post方式取到加密后的js字段,笔者使用了requests第三方库来执行,关于爬虫可以参考我之前的文章
i. 先把post中的headers格式化
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
其中cookie部分可能要改,然后最好以你们浏览器上的为主,具体每个参数的含义不是本文范围,可以自行去搜索引擎搜
ii.然后把参数也格式化
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
其中sf_url字段是我们要下载的youtube视频的url,其他参数都不变
iii. 最后再执行requests库的post请求
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
注意是data=kv
iv. 封装成一个函数
import requests
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
2. 调用解密函数
i. 分析
这其中的难点在于在python里执行javascript代码,而晚上的解决方法有PyV8等,本文选用execjs。在思路部分我们可以发现js部分的最后几行是解密函数,所以我们只需要在execjs中先执行一遍全部,然后再单独执行解密函数就好了
ii. 先取出js部分
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
这里其实可以用正则,不过由于笔者正则表达式还不太熟练就直接用split了
iii. 取第一个解密函数作为我们用的解密函数
当你多取几次不同视频的结果,你就会发现每次的解密函数都不一样,不过位置都是还是在固定行数
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
所以name就是我们的解密函数了(变量名没取太好hhh)
iv. 用execjs执行
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(reo)
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
其中只取=后面的和去掉分号是指指执行这个函数而不用赋值,当先执行赋值+解密然后取值也不是不可以
但是我们可以发现马上就报错了(要是有这么简单就好了)
1. this也就是window变量不存在
如果没记错是报错this或者$b,笔者尝试把全部this去掉或者把全部框在一个class里面(这样子this就变成那个class了)不过都没有成功,然后发现在npm下有个jsdom可以在execjs里模拟window变量(其实应该有更好方法的),所以我们需要下载npm和里面的jsdom,然后改写以上代码
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')
其中
cwd字段是npm root -g的结果,也就是npm的modules路径addition是用来模拟window的
但是我们又可以发现下一个错误
2. alert不存在
这个错误是因为在execjs下执行alert函数是没有意义的,因为我们没有浏览器让他弹窗,且原本alert函数的定义是来源window而我们自定义了window,所以我们要在代码前重写覆盖alert函数(相当于定义一个alert)
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
v. 整合代码
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
3. 分析解密结果
i. 取关键json
运行完上面的部分,解密结果就存在text里了,而我们在思路中可以发现,真正对我们重要的就是存在window.parent.sf.videoResult.show()里的json,所以用正则表达式取这一部分的json
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
ii. 格式化json
python可以格式化json的库有很多,这里笔者用了json库(记得import)
# use `json` to load json
j = json.loads(result)
iii. 取下载地址
接下来就到了最后一步,根据思路里和json格式化工具我们可以发现j["url"][num]["url"]就是下载链接,而num是我们要的视频格式(不同分辨率和类型)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
3. 全部代码
# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
# - windows 10
# - python 3.6.2
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
import requests
import execjs
import re
import json
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
if __name__ == '__main__':
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
# use `json` to load json
j = json.loads(result)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
- 总计102行
- 开发环境
# @Environment:
# - windows 10
# - python 3.6.2
- 依赖
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
-end-
For 爬虫
版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。
本文作者: https://www.cnblogs.com/Eritque-arcus/ 或https://blog.csdn.net/qq_40832960
用python做youtube自动化下载器 代码的更多相关文章
- 用python做youtube自动化下载器 思路
目录 0. 思路 1.准备 i.savfrom.net 2. 探索并规划获取方式 i.总览 ii. 获取该网页取到下载url的请求 iii. 在本地获取请求 iv.解析请求结果 v.解析解密后的结果 ...
- Python实现多线程HTTP下载器
本文将介绍使用Python编写多线程HTTP下载器,并生成.exe可执行文件. 环境:windows/Linux + Python2.7.x 单线程 在介绍多线程之前首先介绍单线程.编写单线程的思路为 ...
- 使用appium+python做UI自动化的demo
使用appium+python做UI自动化的demo 案例使用的知乎app,下载最新的知乎apk,存在了电脑上,只需要配置本机上app目录,不需要再配置appPackage和appActivity # ...
- python多进程断点续传分片下载器
python多进程断点续传分片下载器 标签:python 下载器 多进程 因为爬虫要用到下载器,但是直接用urllib下载很慢,所以找了很久终于找到一个让我欣喜的下载器.他能够断点续传分片下载,极大提 ...
- Python + Selenium +Chrome 批量下载网页代码修改【新手必学】
Python + Selenium +Chrome 批量下载网页代码修改主要修改以下代码可以调用 本地的 user-agent.txt 和 cookie.txt来达到在登陆状态下 批量打开并下载网页, ...
- Qt+Python开发百度图片下载器
一.资源下载地址 https://www.aliyundrive.com/s/jBU2wBS8poH 本项目路径:项目->收费->百度图片下载器(可试用5分钟) 安装包直接下载地址:htt ...
- python的内置下载器
python有个内置下载器,有时候在内部提供文件下载很好用. 进入提供下载的目录 # ls abc.aaa chpw.py finance.py lsdir.py ping.py u2d-partia ...
- python ddt数据驱动(简化重复代码)
在接口自动化测试中,往往一个接口的用例需要考虑 正确的.错误的.异常的.边界值等诸多情况,然后你需要写很多个同样代码,参数不同的用例.如果测试接口很多,不但需要写大量的代码,测试数据和代码柔合在一起, ...
- 基于iOS 10、realm封装的下载器
代码地址如下:http://www.demodashi.com/demo/11653.html 概要 在决定自己封装一个下载器前,我本以为没有那么复杂,可在实际开发过程中困难重重,再加上iOS10和X ...
随机推荐
- 记一次storm提交任务遇到的坑
摘要:主要是自己没有真正理解storm jar命令参数的意义. 情景复现: 在storm集群中使用命令提交后,在UI界面中,一直看不见任务提交上来的任务,但是在集群提交的shell界面中,是可以看到相 ...
- C# 高性能对象映射
1.之前在使用AutoMapper 框架感觉用着比较不够灵活,而且主要通过表达式树Api 实现对象映射 ,写着比较讨厌,当出现复杂类型和嵌套类型时性能直线下降,甚至不如序列化快. 2.针对AutoMa ...
- uni-app中组件的使用
组件基本知识点: uniapp中:每个页面可以理解为一个单页面组件,这些单页面组件注册在pages.json里,在组件关系中可以看作父组件. 自定义可复用的组件,其结构与单页面组件类似,通常在需要的页 ...
- 前置机器学习(四):一文掌握Pandas用法
Pandas提供快速,灵活和富于表现力的数据结构,是强大的数据分析Python库. 本文收录于机器学习前置教程系列. 一.Series和DataFrame Pandas建立在NumPy之上,更多Num ...
- SpringBoot集成Swagger2并配置多个包路径扫描
1. 简介 随着现在主流的前后端分离模式开发越来越成熟,接口文档的编写和规范是一件非常重要的事.简单的项目来说,对应的controller在一个包路径下,因此在Swagger配置参数时只需要配置一 ...
- 马赛克密码破解——GitHub 热点速览 Vol.50
作者:HelloGitHub-小鱼干 "xx"(爆粗口) 这个词是最能体现本人看到本周 GitHub 热点的心情的.那一天,看到用图片处理技术还原马赛克密码的 Depix 便惊为天 ...
- 腾讯游戏 K8s 应用实践|更贴近业务场景的 K8s 工作负载:GameDeployment & GameStatefulSet
引言 蓝鲸容器服务(Blueking Container Service,以下简称BCS)是腾讯 IEG 互动娱乐事业群的容器上云平台,底层基于腾讯云容器服务(Tencent Kubernetes E ...
- JavaSE22-Lambda表达式&方法引用
1.Lambda表达式 1.1 Lambda表达式的标准格式 1 (形式参数) -> {代码块} 形式参数:如果有多个参数,参数之间用逗号隔开:如果没有参数,留空即可 ->:由英文中画线和 ...
- Ajax相关基础知识总结
URL:统一资源定位符 网络的七层协议:网卡 驱动 网络层(ip) 传输层(tcp udp) 会话层( ) 应用层(http.) restful表征状态转移(一种表征架构) CURD 增删改查 ...
- react第二十单元(react+react-router-dom+redux综合案例2)
第二十单元(react+react-router-dom+redux综合案例2) #课程目标 #知识点 #授课思路 #案例和作业