爬虫：HTTP请求与HTML解析（爬取某乎网站）

1. 发送web请求

1.1 requests

　　用requests库的get()方法发送get请求，常常会添加请求头"user-agent"，以及登录"cookie"等参数

1.1.1 user-agent

　　登录网站，将"user-agent"值复制到文本文件

1.1.2 cookie

　　登录网站，将"cookie"值复制到文本文件

1.1.3 测试代码

import requests

from requests.exceptions import RequestException

headers = {

    'cookie': '',

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'

} # 替换为自己的cookie

def get_page(url):

    try:

        html = requests.get(url, headers=headers, timeout=5)

        if html.status_code == 200:

            print('请求成功')

            return html.text

        else:   # 这个else语句不是必须的

            return None

    except RequestException:

        print('请求失败')

if __name__ == '__main__':

    input_url = 'https://www.zhihu.com/hot'

    get_page(input_url)

1.2 selenium

　　多数网站能通过window.navigator.webdriver的值识别selenium爬虫，因此selenium爬虫首先要防止网站识别selenium模拟浏览器。同样，selenium请求也常常需要添加请求头"user-agent"，以及登录"cookie"等参数

1.2.1 移除Selenium中window.navigator.webdriver的值

　　在程序中添加如下代码（对应老版本谷歌）

from selenium.webdriver import Chrome

from selenium.webdriver import ChromeOptions

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = Chrome(options=option)

time.sleep(10)

1.2.2 user-agent

　　登录网站，将"user-agent"值复制到文本文件，执行如下代码将添加请求头

from selenium.webdriver import Chrome

from selenium.webdriver import ChromeOptions

option = ChromeOptions()

option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')

1.2.3 cookie

　　因为selenium要求cookie需要有"name","value"两个键以及对应的值的值，如果网站上面的cookie是字符串的形式，直接复制网站的cookie值将不符合selenium要求，可以用selenium中的get_cookies()方法获取登录"cookie"

from selenium.webdriver import Chrome

from selenium.webdriver import ChromeOptions

import time

import json

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')

driver = Chrome(options=option)

time.sleep(10)

driver.get('https://www.zhihu.com/signin?next=%2F')

time.sleep(30)

driver.get('https://www.zhihu.com/')

cookies = driver.get_cookies()

jsonCookies = json.dumps(cookies)

with open('cookies.txt', 'a') as f:  # 文件名和文件位置自己定义

    f.write(jsonCookies)

    f.write('\n')

1.2.4 测试代码示例

　　将上面获取到的cookie复制到下面程序中便可运行

from selenium.webdriver import Chrome

from selenium.webdriver import ChromeOptions

import time

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = Chrome(options=option)

time.sleep(10)

driver.get('https://www.zhihu.com')

time.sleep(10)

driver.delete_all_cookies()   # 清除刚才的cookie

time.sleep(2)

cookie = {}  #  替换为自己的cookie

driver.add_cookie(cookie)

driver.get('https://www.zhihu.com/')

time.sleep(5)

for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):

    print(i.text)

2. HTML解析（元素定位）

　　要爬取到目标数据首先要定位数据所属元素，BeautifulSoup和selenium都很容易实现对HTML的元素遍历

2.1 BeautifulSoup元素定位

　　下面代码BeautifulSoup首先定位到属性为"HotItem-title"的"h2"标签，然后再通过.text()方法获取字符串值

import requests

from requests.exceptions import RequestException

headers = {

    'cookie': '',

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'

}           # 替换为自己的cookie

def get_page(url):

    try:

        html = requests.get(url, headers=headers, timeout=5)

        if html.status_code == 200:

            print('请求成功')

            return html.text

        else:   # 这个else语句不是必须的

            return None

    except RequestException:

        print('请求失败')

def parse_page(html):

    html = BeautifulSoup(html, "html.parser")

    titles = html.find_all("h2", {'class': 'HotItem-title'})[:10]

    for title in titles:

        print(title.text())

if __name__ == '__main__':

    input_url = 'https://www.zhihu.com/hot'

    parse_page(get_page(input_url))

2.2 selenium元素定位

　　selenium元素定位语法形式与requests不太相同，下面代码示例（1.2.4 测试代码示例）采用了一种层级定位方法：'div[itemprop="zhihu:question"] > a'，笔者觉得这样定位比较放心。

　　selenium获取文本值得方法是.text，区别于requests的.text()

from selenium.webdriver import Chrome

from selenium.webdriver import ChromeOptions

import time

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = Chrome(options=option)

time.sleep(10)

driver.get('https://www.zhihu.com')

time.sleep(10)

driver.delete_all_cookies()   # 清除刚才的cookie

time.sleep(2)

cookie = {}  #  替换为自己的cookie

driver.add_cookie(cookie)

driver.get('https://www.zhihu.com/')

time.sleep(5)

for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):

    print(i.text)