Python3：爬取新浪、网易、今日头条、UC四大网站新闻标题及内容

以爬取相应网站的社会新闻内容为例：

一、新浪：

新浪网的新闻比较好爬取，我是用BeautifulSoup直接解析的，它并没有使用JS异步加载，直接爬取就行了。

'''

新浪新闻：http://news.sina.com.cn/society/

Date：20180920

Author：lizm

Description：获取新浪新闻

'''

import requests

from bs4 import BeautifulSoup

from urllib import request

import sys

import re

import os

def getNews(title,url,m):

    Hostreferer = {

        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

    }

    req = request.Request(url)

    response = request.urlopen(req)

    #过滤非utf-8的网页新闻

    response = response.read().decode('utf-8',"ignore")

    soup = BeautifulSoup(response,'lxml')

    tag = soup.find('div',class_='article')

    if tag == None:

        return 0

    #获取文章发布时间

    fb_date = soup.find('div','date-source').span.string

    #获取发布网站名称

    fb_www= soup.find('div','date-source').a.string

    #获取文章内容

    rep = re.compile("[\s+\.\!\/_,$%^*(+\"\']+|[+<>?、~*（）]+")

    title = rep.sub('',title)

    title = title.replace(':','：')

    filename = sys.path[0]+"/news/"+title+".txt"

    with open(filename,'w',encoding='utf8') as file_object:

        file_object.write(fb_date + " " + fb_www)

        file_object.write("\n")

        file_object.write("网址:"+url)

        file_object.write("\n")

        file_object.write(title)

        file_object.write(tag.get_text())

    i = 0

    for image in tag.find_all('div','img_wrapper'):

        title_img = title +str(i)

        #保存图片

        #判断目录是否存在

        if (os.path.exists(sys.path[0]+"/news/"+title)):

            pass

        else:

            #不存在，则新建目录

            os.mkdir(sys.path[0]+"/news/"+title)

        os.chdir(sys.path[0]+"/news/"+title)

        file_name = "http://news.sina.com.cn/"+image.img.get('src').replace('//','')

        html = requests.get(file_name, headers=Hostreferer)

        # 图片不是文本文件，以二进制格式写入，所以是html.content

        title_img = title_img +".jpg"

        f = open(title_img, 'wb')

        f.write(html.content)

        f.close()

        i+=1

    print('成功爬取第', m,'个新闻',title)

    return 0

#获取社会新闻（最新的162条新闻）

def getTitle(url):

    req = request.Request(url)

    response = request.urlopen(req)

    response = response.read().decode('utf8')

    soup = BeautifulSoup(response,'lxml')

    y = 0

    for tag in soup.find('ul',class_='seo_data_list').find_all('li'):

        if tag.a != None:

            #if y== 27:

            print(y,tag.a.string,tag.a.get('href'))

            temp = tag.a.string

            getNews(temp,tag.a.get('href'),y)

            y += 1

if __name__ == '__main__':

    url = 'http://news.sina.com.cn/society/'

    getTitle(url)

二、网易：

网易新闻的标题及内容是使用js异步加载的，单纯的下载网页源代码是没有标题及内容的，我们可以在Network的js中找到我们需要的内容，这里我使用了正则表达式来获取我们需要的标题及其链接，并使用了BeautifulSoup来获取相应标题的内容。

import re

from urllib import request

from bs4 import BeautifulSoup

def download(title, url):

    req = request.urlopen(url)

    res = req.read()

    soup = BeautifulSoup(res,'lxml')

    #print(soup.prettify())

    tag = soup.find('div',class_='post_text')

    #print(tag.get_text())

    title = title.replace(':','')

    title = title.replace('"','')

    title = title.replace('|','')

    title = title.replace('/','')

    title = title.replace('\\','')

    title = title.replace('*','')

    title = title.replace('<','')

    title = title.replace('>','')

    title = title.replace('?','')

    #print(title)

    file_name = r'D:\code\python\spider_news\NetEase_news\sociaty\\' +title + '.txt'

    file = open(file_name,'w',encoding = 'utf-8')

    file.write(tag.get_text())

if __name__ == '__main__':

    urls = ['http://temp.163.com/special/00804KVA/cm_shehui.js?callback=data_callback',

            'http://temp.163.com/special/00804KVA/cm_shehui_02.js?callback=data_callback',

            'http://temp.163.com/special/00804KVA/cm_shehui_03.js?callback=data_callback']

    for url in urls:

    #url = 'http://temp.163.com/special/00804KVA/cm_shehui_02.js?callback=data_callback'

        req = request.urlopen(url)

        res = req.read().decode('gbk')

        #print(res)

        pat1 = r'"title":"(.*?)",'

        pat2 = r'"tlink":"(.*?)",'

        m1 = re.findall(pat1,res)

        news_title = []

        for i in m1:

            news_title.append(i)

        m2 = re.findall(pat2,res)

        news_url = []

        for j in m2:

            news_url.append(j)

        for i in range(0,len(news_url)):

            #print(news_title[i],news_body[i])

            download(news_title[i],news_url[i])

            print('正在爬取第' + str(i) + '个新闻',news_title[i])

三、头条：

头条的新闻跟前两个也都不一样，它的标题和链接是封装到json文件中的，但是他json文件的url参数是通过一个js随机算法变化的，所以我们需要模拟json文件的参数，否则我们找不到json文件的具体url，我是通过http://www.jianshu.com/p/5a93673ce1c0这篇博客才了解到url获取方法的，而且也解决了总是下载重复新闻的问题，该网站自带反爬机制，需要添加cookie。关于新闻的内容，我用了正则表达式提取了中文。

from urllib import request

import requests

import json

import time

import math

import hashlib

import re

from bs4 import BeautifulSoup

def get_url(max_behot_time, AS, CP):

    url = 'https://www.toutiao.com/api/pc/feed/?category=news_society&utm_source=toutiao&widen=1' \

          '&max_behot_time={0}' \

          '&max_behot_time_tmp={0}' \

          '&tadrequire=true' \

          '&as={1}' \

          '&cp={2}'.format(max_behot_time, AS, CP)

    return url

def get_ASCP():

    t = int(math.floor(time.time()))

    e = hex(t).upper()[2:]

    m = hashlib.md5()

    m.update(str(t).encode(encoding='utf-8'))

    i = m.hexdigest().upper()

    if len(e) != 8:

        AS = '479BB4B7254C150'

        CP = '7E0AC8874BB0985'

        return AS,CP

    n = i[0:5]

    a = i[-5:]

    s = ''

    r = ''

    for o in range(5):

        s += n[o] + e[o]

        r += e[o + 3] + a[o]

    AS = 'AL'+ s + e[-3:]

    CP = e[0:3] + r + 'E1'

   # print("AS:"+ AS,"CP:" + CP)

    return AS,CP

def download(title, news_url):

   # print('正在爬')

    req = request.urlopen(news_url)

    if req.getcode() != 200:

        return 0

    res = req.read().decode('utf-8')

    #print(res)

    pat1 = r'content:(.*?),'

    pat2 = re.compile('[\u4e00-\u9fa5]+')

    result1 = re.findall(pat1,res)

    #print(len(result1))

    if len(result1) == 0:

        return 0

    print(result1)

    result2 = re.findall(pat2,str(result1))

    result3 = []

    for i in result2:

        if i not in result3:

            result3.append(i)

    #print(result2)

    title = title.replace(':','')

    title = title.replace('"','')

    title = title.replace('|','')

    title = title.replace('/','')

    title = title.replace('\\','')

    title = title.replace('*','')

    title = title.replace('<','')

    title = title.replace('>','')

    title = title.replace('?','')

    with open(r'D:\code\python\spider_news\Toutiao_news\society\\' + title + '.txt','w') as file_object:

        file_object.write('\t\t\t\t')

        file_object.write(title)

        file_object.write('\n')

        file_object.write('该新闻地址：')

        file_object.write(news_url)

        file_object.write('\n')

        for i in result3:

            #print(i)

            file_object.write(i)

            file_object.write('\n')

       # file_object.write(tag.get_text())

    #print('正在爬取')

def get_item(url):

    #time.sleep(5)

    cookies = {'tt_webid': ''}

    wbdata = requests.get(url,cookies = cookies)

    wbdata2 = json.loads(wbdata.text)

    data = wbdata2['data']

    for news in data:

        title = news['title']

        news_url = news['source_url']

        news_url = 'https://www.toutiao.com' + news_url

        print(title, news_url)

        if 'ad_label' in news:

            print(news['ad_label'])

            continue

        download(title,news_url)

    next_data = wbdata2['next']

    next_max_behot_time = next_data['max_behot_time']

   # print("next_max_behot_time:{0}".format(next_max_behot_time))

    return next_max_behot_time

if __name__ == '__main__':

    refresh = 50

    for x in range(0,refresh+1):

        print('第{0}次：'.format(x))

        if x == 0:

            max_behot_time = 0

        else:

            max_behot_time = next_max_behot_time

            #print(next_max_behot_time)

        AS,CP = get_ASCP()

        url = get_url(max_behot_time,AS,CP)

        next_max_behot_time = get_item(url)

四、UC

UC和新浪差不多，没有太复杂的反爬虫，直接解析爬取就好。

from bs4 import BeautifulSoup

from urllib import request

def download(title,url):

    req = request.Request(url)

    response = request.urlopen(req)

    response = response.read().decode('utf-8')

    soup = BeautifulSoup(response,'lxml')

    tag = soup.find('div',class_='sm-article-content')

    if tag == None:

        return 0

    title = title.replace(':','')

    title = title.replace('"','')

    title = title.replace('|','')

    title = title.replace('/','')

    title = title.replace('\\','')

    title = title.replace('*','')

    title = title.replace('<','')

    title = title.replace('>','')

    title = title.replace('?','')

    with open(r'D:\code\python\spider_news\UC_news\society\\' + title + '.txt','w',encoding='utf-8') as file_object:

        file_object.write('\t\t\t\t')

        file_object.write(title)

        file_object.write('\n')

        file_object.write('该新闻地址：')

        file_object.write(url)

        file_object.write('\n')

        file_object.write(tag.get_text())

    #print('正在爬取')

if __name__ == '__main__':

    for i in range(0,7):

        url = 'https://news.uc.cn/c_shehui/'

    #    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",

    #               "cookie":"sn=3957284397500558579; _uc_pramas=%7B%22fr%22%3A%22pc%22%7D"}

    #    res = request.Request(url,headers = headers)

        res = request.urlopen(url)

        req = res.read().decode('utf-8')

        soup = BeautifulSoup(req,'lxml')

        #print(soup.prettify())

        tag = soup.find_all('div',class_ = 'txt-area-title')

        #print(tag.name)

        for x in tag:

            news_url = 'https://news.uc.cn' + x.a.get('href')

            print(x.a.string,news_url)

            download(x.a.string,news_url)

Python3：爬取新浪、网易、今日头条、UC四大网站新闻标题及内容的更多相关文章

selenium+BeautifulSoup+phantomjs爬取新浪新闻
一下载phantomjs,把phantomjs.exe的文件路径加到环境变量中,也可以phantomjs.exe拷贝到一个已存在的环境变量路径中,比如我用的anaconda,我把phantomjs. ...
python3爬虫-爬取新浪新闻首页所有新闻标题
准备工作:安装requests和BeautifulSoup4.打开cmd,输入如下命令 pip install requests pip install BeautifulSoup4 打开我们要爬取的 ...
python3使用requests爬取新浪热门微博
微博登录的实现代码来源:https://gist.github.com/mrluanma/3621775 相关环境使用的python3.4,发现配置好环境后可以直接使用pip easy_instal ...
Python 爬虫实例（7）—— 爬取新浪军事新闻
我们打开新浪新闻,看到页面如下,首先去爬取一级 url,图片中蓝色圆圈部分第二zh张图片,显示需要分页, 源代码: # coding:utf-8 import json import redis i ...
网站爬取-案例三：今日头条抓取(ajax抓取JS数据)
今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方法不太一样,对它的抓取需要抓取后台传来的JSON数据,先来看一下今日头条的源码结构:我们 ...
python2.7 爬虫初体验爬取新浪国内新闻_20161130
python2.7 爬虫初学习模块:BeautifulSoup requests 1.获取新浪国内新闻标题 2.获取新闻url 3.还没想好,想法是把第2步的url 获取到下载网页源代码再去分析源 ...
python爬取新浪股票数据—绘图【原创分享】
目标:不做蜡烛图,只用折线图绘图,绘出四条线之间的关系. 注:未使用接口,仅爬虫学习,不做任何违法操作. """ 新浪财经,爬取历史股票数据 ""&q ...
【python3】爬取新浪的栏目分类
目标地址: http://www.sina.com.cn/ 查看源代码,分析: 1 整个分类在 div main-nav 里边包含 2 分组情况:1,4一组 . 2,3一组 . 5 一组 .6一组 ...
xpath爬取新浪天气
参考资料: http://cuiqingcai.com/1052.html http://cuiqingcai.com/2621.html http://www.cnblogs.com/jixin/p ...

随机推荐

《C#高级编程》学习笔记------抗变和协变
1.协变和抗变在.NET 4之前,泛型接口是不变的..NET 4通过协变和抗变为泛型接口和泛型委托添加了一个重要的扩展.协变和抗变指对参数和返回值的类型进行转换.例如,可以给一个需要Shape参数的 ...
x86 体系指令
FASM 第二章 - 2.1 x86 体系指令 Author: 徐艺波 From: xuyibo.org Updated: 2008-04-17 官方论坛本站软件反馈.软件开发交流. ...
Shell主要逻辑源码级分析(1)——SHELL运行流程
版权声明:本文由李航原创文章,转载请注明出处: 文章原文链接:https://www.qcloud.com/community/article/109 来源:腾云阁 https://www.qclou ...
jQuery实例化的优势，为什么要有实例化，到底实例化后在解决什么问题？
jQuery实例化对象的方法相比于普通方法优势: 1.不需要出现大量的new关键字. 2.可实现链式写法. 3.书写更方便实例化的原因: 1.实例化有利于管理程序中不同的DOM选择和处理(不同的选 ...
request.get... getHeader 能取得的信息参数
转载▼ StringTokenizer st = new StringTokenizer(agent,";"); st.nextToken(); //得到用户的浏览器名 Str ...
interface Impl
public interface ActionBarOperations { void initSthOne(); void initSthTwo(); } public class ActionBa ...
Centos设置SSH限制登录用户及IP
1,系统版本查看 2,编辑ssh配置文件 vim /etc/ssh/sshd_config 在尾部加一行允许sysman用户从ip1.1.1.*登录 3,重启sshd即可 /etc/init.d/s ...
百度输入法引起的Mac远程桌面Ctrl+.快捷键不起作用
被这个问题困扰已久!在Mac中通过远程桌面(Remote Desktop)连接至Windows服务器时,Ctrl+.快捷键不起作用,而这是用Visual Studio写代码时常用的快捷键(对应的命令是 ...
设计模式之——Memento模式
Memento模式即快照模式,就是在某一时刻,设定一个状态,在后面随时可以返回到当前状态的模式. 我们拿一个闯关游戏作为举例,一共有十关,每闯一关,玩家所持金额增加一百,而闯关失败就扣一百.初始时,给 ...
设计模式之——Decorator模式
Decorator模式又叫装饰者模式,这种模式是为了满足Java开发的"面向扩展开放,面向修改闭源"的开发原则设计出来的. 在装饰者模式中,不修改源类的代码,却能修改源类中方法的功 ...

Python3：爬取新浪、网易、今日头条、UC四大网站新闻标题及内容

Python3：爬取新浪、网易、今日头条、UC四大网站新闻标题及内容

一、新浪：

二、网易：

三、头条：

四、UC

Python3：爬取新浪、网易、今日头条、UC四大网站新闻标题及内容的更多相关文章

随机推荐

热门专题