使用requests库和正则表达式爬取猫眼电影前100

import requests

import re

import json

import time

from requests.exceptions import RequestException

def get_one_page(url):

    try:

        headers = {

            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'

        }

        response = requests.get(url,headers=headers)

        if response.status_code == 200:

            return response.text

        return None

    except RequestException:

        return None

def parse_one_page(html):

    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'

                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'

                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)

    items = re.findall(pattern,html)

    for item in items:

        yield {

            'index':item[0],

            'image':item[1],

            'title':item[2].strip(),

            'actor':item[3].strip()[3:] if len(item[3]) > 3 else '',

            'time':item[4].strip()[5:] if len(item[4]) > 5 else '',

            'score':item[5].strip() + item[6].strip()

        }

def write_to_file(content):

    with open('result.txt','a',encoding='utf-8') as f:

        f.write(json.dumps(content,ensure_ascii=False)+'\n')

def main(offset):

    url = 'https://maoyan.com/board/4?offset=' + str(offset)

    html = get_one_page(url)

    # print(html)

    for item in parse_one_page(html):

        print(item)

        write_to_file(item)

if __name__ == '__main__':

    for i in range(10):

        main(offset=i*10)

        time.sleep(1)

使用requests库和Beautifulsoup库爬去猫眼电影前100

import requests

from bs4 import BeautifulSoup

def gethtmlpage(url):

    headers = {

        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"

    }

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

        return response.text

    return None

def parsehtmlpage(html):

    soup = BeautifulSoup(html, 'lxml')

    a = soup.select('.movie-item-info a')

    return a

def write_to_file(content):

    with open('result.txt', 'a', encoding="utf-8") as f:

        f.writelines(content + '\n')

def main(url):

    html = gethtmlpage(url)

    title = parsehtmlpage(html)

    for i in range(0, len(title)):

        write_to_file(title[i].string)

if __name__ == '__main__':

    for i in range(10,100,10):

        url = "https://maoyan.com/board/4?offset=%d" % i

        main(url)

使用Beautiful库和requests库爬去：

import requests

from bs4 import BeautifulSoup

import bs4

def gethtmlpage(url):

    try:

        r = requests.get(url)

        r.raise_for_status()

        r.encoding = r.apparent_encoding

        return r.text

    except ConnectionError:

        return "网络链接出错"

    except:

        return "未知错误"

def parsehtmlpage(html):

    soup = BeautifulSoup(html, 'lxml')

    ol = soup.select("ol.grid_view")

    li = ol[0].select('li')

    movie=[]

    for i in range(0, len(li)):

        index = li[i].select(".pic em")[0].string

        title = li[i].find("span", attrs={'class', 'title'}).string

        rating_num = li[i].find("span", attrs={'class', 'rating_num'}).string

        lianjie = li[i].select(".hd a")[0].get('href')

        if isinstance(li[i].find("span", attrs={'class', 'inq'}), bs4.element.Tag):

            inq = li[i].find("span", attrs={'class', 'inq'}).string

        else:

            inq = "没有简介"

        movie.append([index, title, rating_num, lianjie, inq])

    return movie

def writetofile(content):

    with open('result.csv', 'a', encoding='utf-8') as f:

        f.write(content)

def main(url):

    html = gethtmlpage(url)

    movie = parsehtmlpage(html)

    for i in range(0, len(movie)):

        writetofile("{0:^5}\t{1:{5}^10}\t{2:^10}\t{3:^40}\t{4:<10}\n".format(movie[i][0], movie[i][1], movie[i][2], movie[i][3], movie[i][4], chr(12288)))

if __name__ == '__main__':

    writetofile("{0:^5}\t{1:{5}^10}\t{2:^8}\t{3:^40}\t{4:<10}\n".format("排名", "电影名", "评分", "链接", "一句话介绍电影", chr(12288)))

    for i in range(0, 10):

        url = "https://movie.douban.com/top250?start={}".format(i*25)

        main(url)

python应用-爬取猫眼电影top100的更多相关文章

爬虫系列（1）-----python爬取猫眼电影top100榜
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天在整理代码时,整理了一下之前自己学习爬虫的一些代码,今天先上一个简单的例子,手把手教你入门Python爬虫,爬取 ...
python 爬取猫眼电影top100数据
最近有爬虫相关的需求,所以上B站找了个视频(链接在文末)看了一下,做了一个小程序出来,大体上没有修改,只是在最后的存储上,由txt换成了excel. 简要需求:爬虫爬取猫眼电影TOP100榜单数据 ...
PYTHON 爬虫笔记八:利用Requests+正则表达式爬取猫眼电影top100（实战项目一）
利用Requests+正则表达式爬取猫眼电影top100 目标站点分析流程框架爬虫实战使用requests库获取top100首页: import requests def get_one_pag ...
50 行代码教你爬取猫眼电影 TOP100 榜所有信息
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天,恋习Python的手把手系列,手把手教你入门Python爬虫,爬取猫眼电影TOP100榜信息,将涉及到基础爬虫 ...
40行代码爬取猫眼电影TOP100榜所有信息
主要内容: 一.基础爬虫框架的三大模块二.完整代码解析及效果展示 1️⃣ 基础爬虫框架的三大模块 1.HTML下载器:利用requests模块下载HTML网页. 2.HTML解析器:利用re正则表 ...
# [爬虫Demo] pyquery+csv爬取猫眼电影top100
目录 [爬虫Demo] pyquery+csv爬取猫眼电影top100 站点分析代码君 [爬虫Demo] pyquery+csv爬取猫眼电影top100 站点分析 https://maoyan.co ...
用requests库爬取猫眼电影Top100
这里需要注意一下,在爬取猫眼电影Top100时,网站设置了反爬虫机制,因此需要在requests库的get方法中添加headers,伪装成浏览器进行爬取 import requests from re ...
Python爬虫项目--爬取猫眼电影Top100榜
本次抓取猫眼电影Top100榜所用到的知识点: 1. python requests库 2. 正则表达式 3. csv模块 4. 多进程正文目标站点分析通过对目标站点的分析, 来确定网页结构, ...
# 爬虫连载系列(1)--爬取猫眼电影Top100
前言学习python有一段时间了,之前一直忙于学习数据分析,耽搁了原本计划的博客更新.趁着这段空闲时间,打算开始更新一个爬虫系列.内容大致包括:使用正则表达式.xpath.BeautifulSoup ...

随机推荐

nsq 初试
nsqlookupd tcp 4160 http 4161nsqd 4150nsqadmin 4171 1:安装$ brew install nsq1) 停止nsq默认的启动$ brew servic ...
win 常用CMD命令备忘
一.进入某个硬盘 1.直接输入盘符加冒号,如想进入D盘,则输入命令[d:] . 命令:C:\Windows\system32>d: 结果:d:\> 二.进入某个文件夹 1.输入cd加文件夹 ...
C#的split分割的举例
下面列举了split分割字符串的几种示例: string te = ";"; string re = "a;b"; string se = "a&qu ...
MySQL 1053错误服务无法正常启动的解决方法
MySQL 1053错误服务无法正常启动的解决方法 1.右键我的电脑,管理,进入服务 2.右键单击Mysql8 属性,选择登陆选择此账号登陆管理员账号
HttpWebRequest请求Https协议的WebApi
public static class RequestClient { /// <summary> /// 参数列表转为string /// </summary> /// &l ...
Runtime常用的几个场景
1.给分类动态添加属性在FDFullscreenPopGesture中给UIViewController的分类里有这么一个属性: @property (nonatomic, copy) _FDVie ...
day33 锁和队列
队列 #put 和 get #__author : 'liuyang' #date : 2019/4/16 0016 上午 11:32 # 多进程之间的数据是隔离的 # 进程之间的数据交互 # 是可 ...
C++二分图匹配基础：zoj1002 FireNet 火力网
直接给出题目吧... 问题 D(1988): [高级算法]火力网时间限制: 1 Sec 内存限制: 128 MB 题目描述给出一个N*N的网格,用'.'表示空地,用'X'表示墙.在网格上放碉堡,可 ...
【翻译】Flume 1.8.0 User Guide(用户指南) Processors
翻译自官网flume1.8用户指南,原文地址:Flume 1.8.0 User Guide 篇幅限制,分为以下5篇: [翻译]Flume 1.8.0 User Guide(用户指南) [翻译]Flum ...
selenium中maven的使用
一.maven的下载.解压以及环境变量配置 1.下载maven: 官网下载地址:http://maven.apache.org/download.cgi 在Files下面下载对应的maven版本(官网 ...

python应用-爬取猫眼电影top100

使用requests库和正则表达式爬取猫眼电影前100

使用requests库和Beautifulsoup库爬去猫眼电影前100

python应用-爬取猫眼电影top100的更多相关文章

随机推荐

热门专题