【python爬虫】一个简单的爬取百家号文章的小爬虫

需求

用“老龄智能”在百度百家号中搜索文章，爬取文章内容和相关信息。

观察网页

红色框框的地方可以选择资讯来源，我这里选择的是百家号，因为百家号聚合了来自多个平台的新闻报道。首先看了一下robots.txt，基本上对爬虫没有什么限制。然后就去定位网页元素，我的思路是先把上图搜索页的每篇文章的链接爬取下来，然后放在list里循环访问获取内容，这里再提一下为什么选百家号，因为你获取不同文章的链接之后，百家号文章页面的网页结构都是一样的。

通过Chrome浏览器F12可以轻松定位到文章链接。但是还要考虑翻页的问题，一般没啥反爬的网站，都是通过url就可以实现翻页。

https://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=%E8%80%81%E9%BE%84%E6%99%BA%E8%83%BD&medium=2&x_bfe_rqs=20001&x_bfe_tjscore=0.000000&tngroupname=organic_news&newVideo=12&rsv_dl=news_b_pn&pn=20

https://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=%E8%80%81%E9%BE%84%E6%99%BA%E8%83%BD&medium=2&x_bfe_rqs=03E80&x_bfe_tjscore=0.000000&tngroupname=organic_news&newVideo=12&rsv_dl=news_b_pn&pn=10

以上分别是第三页和第二页的url，很明显，最后的“pn=”后面的数字决定了页码，第一页显然是pn=0。好了现在就可以开始写爬虫了。

写代码

先写获取文章链接的部分。爬虫首先需要确认好headers的内容，我这里就用了user-agent和host，我试了一下，百度的话host还是需要的。

还是Chrome浏览器F12一下，在network里面随便点个东西，找到requests headers就可以找到我们需要的信息，将里面的user-agent和host复制进自己的代码中。

import requests

from bs4 import BeautifulSoup

import sys

import time

from openpyxl import workbook

from openpyxl import load_workbook

headers = {

        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',

        'Host': 'www.baidu.com'

    }

headers1 =  {

        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',

        'Host': 'baijiahao.baidu.com'

    }

导入库之后，这里我定义了两个header，第一个是百度搜索页使用，第二个是爬百家号文章时要用的。

#获取百家号文章链接

def get_connect(link):

    try:

        r = requests.get(link, headers=headers, timeout=10)

        if 200 != r.status_code:

            return None

        url_list = []

        soup = BeautifulSoup(r.text, "lxml")

        div_list = soup.find_all('div', class_='result-op c-container xpath-log new-pmd')

        for div in div_list:

            mu = div['mu'].strip()

            url_list.append(mu)

            print(mu)

        return get_content(url_list)

    except Exception as e:

        print('e.message:\t', e)

    finally:

        print(u'go ahead!\n\n')

这里是定义了获取百家号链接，也就是爬取百度搜索页的函数，直接用bs4解析网页就行了，没什么难的，用了try exception做一下报错机制。

#获取百家号内容

def get_content(url_list):

    try:

        for url in url_list:

            clist=[] #空列表存储文章内容

            r1 = requests.get(url,headers=headers1,timeout=10)

            soup1 = BeautifulSoup(r1.text, "lxml")

            s1 = soup1.select('.article-title > h2:nth-child(1)')

            s2 = soup1.select('.date')

            s3 = soup1.select('.author-name > a:nth-child(1)')

            s4 = soup1.find_all('span',class_='bjh-p')

            title = s1[0].get_text().strip()

            date = s2[0].get_text().strip()

            source = s3[0].get_text().strip()

            for t4 in s4:

                para =  t4.get_text().strip()  #获取文本后剔除两侧空格

                content = para.replace('\n','') #剔除段落前后的换行符

                clist.append(content)

            content = ''.join('%s' %c for c in clist)

            ws.append([title,date,source,content])

            print([title,date])

        wb.save('XXX.xlsx')

    except Exception as e:

        print("Error: ",e)

    finally:

        wb.save('XXX.xlsx')   #保存已爬取的数据到excel

        print(u'OK!\n\n')

爬虫定位网页元素一般有bs4普通的find方法，css选择器，xpath（用soup.select方法）等路径。firefox浏览器在F12中右键你想要获取的元素可以选择是要CSS选择器还是xpath。

#主函数

if __name__ == '__main__':

    raw_url='https://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=老龄智能&medium=2&x_bfe_rqs=20001&x_bfe_tjscore=0.000000&tngroupname=organic_news&newVideo=12&rsv_dl=news_b_pn&pn='

    wb = workbook.Workbook()  # 创建Excel对象

    ws = wb.active  # 获取当前正在操作的表对象

    # 往表中写入标题行,以列表形式写入！

    ws.append(['title','dt','source','content'])

    #通过循环完成url翻页

    for i in range(1000):

        link=raw_url+str(i*10)

        get_connect(link)

        print('page',i+1)

        #time.sleep(5)

    wb.save('XXX.xlsx')

    print('finished')

    wb.close()

最后把主函数写好，用一个for循环实现翻页操作，range范围这里我拍脑袋给的1000，根据实际情况来吧还是得。

实际效果

完事儿！

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理

想要获取更多Python学习资料可以加
QQ:2955637827私聊
或加Q群630390733
大家一起来学习讨论吧！