我拿这个站点作为案例:https://91mjw.com/  其他站点方法都是差不多的。

第一步:获得整站所有的视频连接

html = requests.get("https://91mjw.com",headers=gHeads).text
xmlcontent = etree.HTML(html)
UrlList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/a/@href")
NameList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/h2/a/text()")

  

第二步 :是进入选择的电影的页面 去获得视频的链接

UrlList = xmlContent.xpath("//div[@id='video_list_li']/a/@href")

第三步 构造下载视频用到的参数

第四步 下载视频 保存到本地

直接上实现代码  
   使用的多线程 加信号量实现  默认开启5条线程开始操作 每条线程去下载一套视频  是一套 一套 一套     
    也可以自己去修改同时开启几条线程
    实现代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import re
import requests
from threading import *
from bs4 import BeautifulSoup
from lxml import etree
from contextlib import closing
 
nMaxThread = 5
connectlock = BoundedSemaphore(nMaxThread)
gHeads = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
 
class MovieThread(Thread):
    def __init__(self,url,movieName):
        Thread.__init__(self)
        self.url = url
        self.movieName = movieName
 
    def run(self):
        try:
            urlList = self.GetMovieUrl(self.url)
            for i in range(len(urlList)):
                type,vkey = self.GetVkeyParam(self.url,urlList[i])
                if type != None and vkey !=None:
                    payload,DownloadUrl = self.GetOtherParam(self.url,urlList[i],type,vkey)
                    if DownloadUrl :
                        videoUrl = self.GetDownloadUrl(payload,DownloadUrl)
                        if videoUrl :
                            self.DownloadVideo(videoUrl,self.movieName,i+1)
        finally:
            connectlock.release()
 
    def GetMovieUrl(self,url):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host":"91mjw.com",
            "Referer":"https://91mjw.com/"
        }
        html = requests.get(url,headers=heads).text
        xmlContent = etree.HTML(html)
        UrlList = xmlContent.xpath("//div[@id='video_list_li']/a/@href")
        if  len(UrlList) > 0:
            return UrlList
        else:
            return None
 
    def GetVkeyParam(self,firstUrl,secUrl):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "91mjw.com",
            "Referer": firstUrl
        }
        try :
            html = requests.get(firstUrl+secUrl,headers=heads).text
            bs = BeautifulSoup(html,"html.parser")
            content = bs.find("body").find("script")
            reContent = re.findall('"(.*?)"',content.text)
            return reContent[0],reContent[1]
        except:
            return None,None
 
    def GetOtherParam(self,firstUrl,SecUrl,type,vKey):
        url = "https://api.1suplayer.me/player/?userID=&type=%s&vkey=%s"%(type,vKey)
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "api.1suplayer.me",
            "Referer": firstUrl+SecUrl
        }
        try:
            html = requests.get(url,headers=heads).text
            bs = BeautifulSoup(html,"html.parser")
            content = bs.find("body").find("script").text
            recontent = re.findall(" = '(.+?)'",content)
            payload = {
                    "type":recontent[3],
                    "vkey":recontent[4],
                    "ckey":recontent[2],
                    "userID":"",
                    "userIP":recontent[0],
                    "refres":1,
                    "my_url":recontent[1]
                }
            return payload,url
        except:
            return None,None
 
    def GetDownloadUrl(self,payload,refereUrl):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "api.1suplayer.me",
            "Referer": refereUrl,
            "Origin": "https://api.1suplayer.me",
            "X-Requested-With": "XMLHttpRequest"
        }
        while True:
            retData = requests.post("https://api.1suplayer.me/player/api.php",data=payload,headers=heads).json()
            if  retData["code"] == 200:
                return retData["url"]
            elif retData["code"] == 404:
                payload["refres"] += 1;
                continue
            else:
                return None
 
    def DownloadVideo(self,url,videoName,videoNum):
        CurrentSize = 0
        heads = {
            "chrome-proxy":"frfr",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host":"sh-yun-ftn.weiyun.com",
            "Range":"bytes=0-"
        }
        with closing(requests.get(url,headers=heads)) as response:
            retSize = int(response.headers['Content-Length'])
            chunkSize = 10240
            if response.status_code == 206:
                print '[File Size]: %0.2f MB\n' % (retSize/1024/1024)
                with open("./video/%s/%02d.mp4"%(videoName,videoNum),"wb") as f:
                    for data in response.iter_content(chunk_size=chunkSize):
                        f.write(data)
                        CurrentSize += len(data)
                        f.flush()
                        print '[Progress]: %0.2f%%' % float(CurrentSize*100/retSize) + '\r'
 
def main():
    html = requests.get("https://91mjw.com",headers=gHeads).text
    xmlcontent = etree.HTML(html)
    UrlList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/a/@href")
    NameList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/h2/a/text()")
    for i in range(len(UrlList)):
        connectlock.acquire()
        url = UrlList[i]
        name = NameList[i].encode("utf-8")
        t = MovieThread(url,name)
        t.start()
 
if __name__ == '__main__':
    main()

  

用python实现多线程爬取影视网站全部视频方法【笔记】的更多相关文章

  1. from appium import webdriver 使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium)

    使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium) - 北平吴彦祖 - 博客园 https://www.cnblogs.com/stevenshushu/p ...

  2. python之简单爬取一个网站信息

    requests库是一个简介且简单的处理HTTP请求的第三方库 get()是获取网页最常用的方式,其基本使用方式如下 使用requests库获取HTML页面并将其转换成字符串后,需要进一步解析HTML ...

  3. 初次尝试python爬虫,爬取小说网站的小说。

    本次是小阿鹏,第一次通过python爬虫去爬一个小说网站的小说. 下面直接上菜. 1.首先我需要导入相应的包,这里我采用了第三方模块的架包,requests.requests是python实现的简单易 ...

  4. 用Python爬取影视网站,直接解析播放地址。

    记录时刻! 写这个爬虫主要是想让自己的爬虫实用,把脚本放到了服务器,成为可随时调用的接口. 思路算是没思路吧!把影视名带上去请求影视网站,然后解析出我们需要的播放地址. 我也把自己的接口分享出来.接口 ...

  5. Python多线程爬取某网站表情包

    # 爬取网络图片import requestsfrom lxml import etreefrom urllib import requestfrom queue import Queue # 导入队 ...

  6. python协程爬取某网站的老赖数据

    import re import json import aiohttp import asyncio import time import pymysql from asyncio.locks im ...

  7. 使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium)

    抖音很火,楼主使用python随机爬取抖音视频,并且无水印下载,人家都说天下没有爬不到的数据,so,楼主决定试试水,纯属技术爱好,分享给大家.. 1.楼主首先使用Fiddler4来抓取手机抖音app这 ...

  8. Python爬虫一爬取B站小视频源码

    如果要爬取多页的话 在最下方循环中 填写好循环的次数就可以了 项目源码 from fake_useragent import UserAgent import requests import time ...

  9. Python多进程方式抓取基金网站内容的方法分析

    因为进程也不是越多越好,我们计划分3个进程执行.意思就是 :把总共要抓取的28页分成三部分. 怎么分呢? # 初始range r = range(1,29) # 步长 step = 10 myList ...

随机推荐

  1. PDF阅读器关闭“使用手型工具阅读文章”功能

    1.问题描述 某些PDF文件打开时,光标显示的手型工具里面有个箭头,一点击鼠标左键,就跳转到下一页了.给阅读带来很多不便. 2.原因 因为这类PDF文档中带有"文章"(articl ...

  2. leetcode 55 Jump Game 三种方法,回溯、动态规划、贪心

    Given an array of non-negative integers, you are initially positioned at the first index of the arra ...

  3. js获取日期时间

    获取当前时间 function getNowFormatDate() {//获取当前时间 var date = new Date(); var symbol_gang = "-"; ...

  4. gitlab升级备份

    一.备份有关备份和恢复的操作,详见我的另一篇博客:Gitlab的备份与恢复在开始升级之前,一定要做好备份工作,并记录好版本号.1.查看当前Gitlab的版本号 [root@gitlab ~]# cat ...

  5. Django最全思维导图

    思维导图传送门

  6. openstack-nova源码之阅读流程

    以创建虚拟机为例 1.项目入口setup.cfg文件 2.根据nova-compute = nova.cmd.compute:main找到功能入口 3.nova/api/openstack/compu ...

  7. SpringBoot 返回Json实体类属性大小写问题

    今天碰到的问题,当时找了半天为啥前台传参后台却接收不到,原来是返回的时候返回小写,但是前台依旧大写传参. 查了很多后发现其实是json返回的时候把首字母变小写了,也就是Spring Boot中Jack ...

  8. 如何用navicat导入数据?

    介绍了如何使用navicat导入数据到数据库 0背景介绍 这里用的软件版本号是11.2.7 1选择要导入的数据库,右击选择导入向导 2 选择导入数据文件的类型 根据要导入数据文件的类型,选择对应的导入 ...

  9. Dubbo学习摘录(二)

    扩展点机制 扩展点的配置 (1)根据关键字读取配置,获取具体的实现类 比如在 dubbo-demo-provider.xml 文件中配置: 则会根据rmi去读取具体的协议实现类RmiProtocol. ...

  10. Java 理论和实践: 了解泛型 识别和避免学习使用泛型过程中的陷阱

    Brian Goetz (brian@quiotix.com), 首席顾问, Quiotix 简介: JDK 5.0 中增加的泛型类型,是 Java 语言中类型安全的一次重要改进.但是,对于初次使用泛 ...