【Python】爬取理想论坛单帖爬虫

代码：

# 单帖爬虫，用于爬取理想论坛帖子得到发帖人，发帖时间和回帖时间,url例子见main函数
from bs4 import BeautifulSoup
import requests
import threading
import re

user_agent='Mozilla/4.0 (compatible;MEIE 5.5;windows NT)'
headers={'User-Agent':user_agent}

# 帖子爬虫类（多线程）
class topicCrawler(threading.Thread):
    def __init__(self,name,url):
        threading.Thread.__init__(self,name=name)
        self.name=name
        self.url=url
        self.infos=[]

    def run(self):
        while(self.url!="none"):
            print("线程"+self.name+"开始爬取页面"+self.url);

            try:
                rsp=requests.get(self.url,headers=headers)
                self.url="none"#用完之后置空，看下一页能否取到值
                soup= BeautifulSoup(rsp.text,'html.parser',from_encoding='utf-8')
                #print(rsp.text); # rsp.text是全文

                # 找出一页里每条发言
                for divs in soup.find_all('div',class_="postinfo"):
                    #print(divs.text) # divs.text包含作者和发帖时间的文字

                    # 用正则表达式将多个空白字符替换成一个空格
                    RE = re.compile(r'(\s+)')
                    line=RE.sub(" ",divs.text)

                    arr=line.split(' ')
                    #print('楼层='+arr[1])
                    #print('作者='+arr[2].replace('只看：',''))
                    #print('日期='+arr[4])
                    #print('时间='+arr[5])
                    info={'楼层':arr[1],
                          '作者':arr[2].replace('只看：',''),
                          '日期':arr[4],
                          '时间':arr[5]}
                    self.infos.append(info);

                #找下一页所在地址
                for pagesDiv in soup.find_all('div',class_="pages"):
                    for strong in pagesDiv.find_all('strong'):
                        print('当前为第'+strong.text+'页')

                        # 找右边的兄弟节点
                        nextNode=strong.next_sibling
                        if nextNode and nextNode.get("href"): # 右边的兄弟节点存在，且其有href属性
                            #print(nextNode.get("href"))
                            self.url='http://www.55188.com/'+nextNode.get("href")

                if self.url!="none":
                    print("有下一页，线程"+self.name+"前往下一页")
                    continue
                else:
                    print("无下一页，线程"+self.name+'爬取结束，开始打印...')

                    for info in self.infos:
                        print('\n')
                        for key in info:
                            print(key+":"+info[key])

                    print("线程"+self.name+'打印结束.')

            except Exception as e:
                print("线程"+self.name+"发生异常。重新爬行")# 不管怎么出现的异常，就让它一直爬到底
                print(e);
                continue

# 入口函数
def main():
        #http://www.55188.com/thread-8205979-1-1.html
        #http://www.55188.com/thread-8324517-1-1.html
        #http://www.55188.com/thread-8205979-61-1.html
        url='http://www.55188.com/thread-8319519-1-1.html'
        tc=topicCrawler(name='crawler01',url=url)
        tc.start()

# 开始
main()

输出：

C:\Users\horn1\Desktop\python\14>python topicCrawler.py
线程crawler01开始爬取页面http://www.55188.com/thread-8319519-1-1.html
C:\Users\horn1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:146: UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
  warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
当前为第1页
当前为第1页
有下一页，线程crawler01前往下一页
线程crawler01开始爬取页面http://www.55188.com/thread-8319519-2-1.html
当前为第2页
当前为第2页
有下一页，线程crawler01前往下一页
线程crawler01开始爬取页面http://www.55188.com/thread-8319519-3-1.html
当前为第3页
当前为第3页
无下一页，线程crawler01爬取结束，开始打印...

楼层:楼主
作者:马泰的哥们
日期:2018-3-30
时间:09:59

楼层:2楼
作者:龙波2010
日期:2018-3-30
时间:10:00

楼层:3楼
作者:吗日个边
日期:2018-3-30
时间:10:07

楼层:4楼
作者:小兵旨
日期:2018-3-30
时间:10:30

楼层:5楼
作者:勇儿马甲
日期:2018-3-30
时间:10:37

楼层:6楼
作者:培训资料
日期:2018-3-30
时间:10:43

楼层:7楼
作者:短线冲
日期:2018-3-30
时间:10:56

楼层:8楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:9楼
作者:一赚
日期:2018-3-30
时间:11:01

楼层:10楼
作者:叼叼狼
日期:2018-3-30
时间:11:25

楼层:11楼
作者:酷我行
日期:2018-3-30
时间:11:40

楼层:12楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:13楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:14楼
作者:生活如愿
日期:2018-3-30
时间:11:55

楼层:15楼
作者:小兵旨
日期:2018-3-30
时间:12:42

楼层:16楼
作者:李汶安
日期:2018-3-30
时间:12:50

楼层:17楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:18楼
作者:小兵旨
日期:2018-3-30
时间:13:49

楼层:19楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:20楼
作者:酷我行
日期:2018-3-30
时间:17:21

楼层:21楼
作者:酷我行
日期:2018-3-30
时间:17:24

楼层:22楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:23楼
作者:酷我行
日期:2018-3-30
时间:21:37

楼层:24楼
作者:马泰的哥们
日期:发表于
时间:2018-3-30

楼层:25楼
作者:破局
日期:2018-3-30
时间:21:50

楼层:26楼
作者:小中大学生
日期:2018-3-31
时间:00:27

楼层:27楼
作者:理想5e9a18
日期:2018-3-31
时间:00:57

楼层:28楼
作者:龍樹
日期:2018-3-31
时间:06:29

楼层:29楼
作者:生活如愿
日期:2018-3-31
时间:07:49

楼层:30楼
作者:胶东判官
日期:2018-3-31
时间:12:32

楼层:31楼
作者:胶东判官
日期:2018-3-31
时间:12:32

楼层:32楼
作者:天上下鱼
日期:2018-3-31
时间:13:04

楼层:33楼
作者:天上下鱼
日期:2018-3-31
时间:13:05

楼层:34楼
作者:股市小小手
日期:2018-3-31
时间:14:48

楼层:35楼
作者:股市小小手
日期:2018-3-31
时间:14:50

楼层:36楼
作者:逍遥茶
日期:2018-3-31
时间:15:45

楼层:37楼
作者:马泰的哥们
日期:发表于
时间:2018-4-1

楼层:38楼
作者:理想5e9a18
日期:2018-4-1
时间:03:04

楼层:39楼
作者:马泰的哥们
日期:发表于
时间:2018-4-1

楼层:40楼
作者:陈龙333
日期:2018-4-1
时间:03:05

楼层:41楼
作者:马泰的哥们
日期:发表于
时间:2018-4-1

楼层:42楼
作者:理想5e9a18
日期:2018-4-1
时间:03:10

楼层:43楼
作者:马泰的哥们
日期:发表于
时间:2018-4-2

楼层:44楼
作者:理想5e9a18
日期:2018-4-2
时间:11:18

楼层:45楼
作者:马泰效应
日期:2018-4-4
时间:03:00

楼层:46楼
作者:马泰效应
日期:2018-4-4
时间:03:00

楼层:47楼
作者:韭菜008
日期:2018-4-4
时间:08:08
线程crawler01打印结束.

这个爬虫虽然简单，却是大计划中的一步。

【Python】爬取理想论坛单帖爬虫的更多相关文章

【pyhon】理想论坛单帖爬虫取得信息存入MySql数据库
代码: # 单帖爬虫,用于爬取理想论坛单个帖子得到发帖人,发帖时间和回帖时间并存入数据库,url例子见main函数 from bs4 import BeautifulSoup import reque ...
Java爬取校内论坛新帖
Java爬取校内论坛新帖为了保持消息灵通,博主没事会上上校内论坛看看新帖,作为爬虫爱好者,博主萌生了写个爬虫自动下载的想法. 嗯,这次就选Java. 第三方库准备 Jsoup Jsoup是一款比较好 ...
【ichartjs】爬取理想论坛前30页帖子获得每个子贴的发帖时间，总计83767条数据进行统计，生成统计图表
统计数据如下: {': 2451} 图形化后效果如下: 源码: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//E ...
【Python爬虫案例学习】Python爬取天涯论坛评论
用到的包有requests - BeautSoup 我爬的是天涯论坛的财经论坛:'http://bbs.tianya.cn/list.jsp?item=develop' 它里面的其中的一个帖子的URL ...
python 爬取猫眼榜单100（二）--多个页面以及多进程
#!/usr/bin/env python # -*- coding: utf- -*- # @Author: Dang Kai # @Date: -- :: # @Last Modified tim ...
[Python]爬取CSDN论坛标题 2020.2.8
首先新建一个Lei.txt 内容为: CloudComputingParentBlockchainTechnologyEnterpriseDotNETJavaWebDevelopVCVBDelphiB ...
Python爬取网易云音乐歌手歌曲和歌单
仅供学习参考 Python爬取网易云音乐网易云音乐歌手歌曲和歌单,并下载到本地很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经做 ...
【Python爬虫案例】用Python爬取李子柒B站视频数据
一.视频数据结果今天是2021.12.7号,前几天用python爬取了李子柒的油管评论并做了数据分析,可移步至: https://www.cnblogs.com/mashukui/p/1622025 ...
Python爬取CSDN博客文章
0 url :http://blog.csdn.net/youyou1543724847/article/details/52818339Redis一点基础的东西目录 1.基础底层数据结构 2.win ...

随机推荐

PBR Step by Step（三）BRDFs
BRDF BRDF(Bidirectional Reflectance Distribution Function)双向反射分布函数,用来描述给定入射方向上的入射辐射度以及反射方向上的出辐射度分布,B ...
Linux中磁盘还有空间，但创建文件时提示空间不足
首先需要知道创建文件时,需要满足两个条件:1.磁盘上还有空间:2.inode号还有剩余. 这两个条件可以分别使用"df -h"以及"df -i"查看使用情况 [ ...
类命名空间与对象、实例的命名空间 and 面向对象的组合用法
类命名空间与对象.实例的命名空间创建一个类就会创建一个类的名称空间,用来存储类中定义的所有名字,这些名字称为类的属性而类有两种属性:静态属性和动态属性静态属性就是直接在类中定义的变量动态属性就 ...
【BZOJ 1566】 1566: [NOI2009]管道取珠（DP）
1566: [NOI2009]管道取珠 Time Limit: 20 Sec Memory Limit: 650 MBSubmit: 1659 Solved: 971 Description In ...
Android为什么需要广播Broadcast
在Android系统中,为什么需要广播机制呢?广播机制,本质上它就是一种组件间的通信方式,如果是两个组件位于不同的进程当中,那么可以用Binder机制来实现,如果两个组件是在同一个进程中,那么它 ...
django自定义分页控件
1.准备数据在models创建测试表 from django.db import models class Host(models.Model): hostname = models.CharFie ...
Python turtle绘图实例分析
画一个红色的五角星 from turtle import * color('red','red') begin_fill() for i in range(5): fd(200) rt(144) en ...
javascript 手机号间隔显示 123 4567 8910
// 手机号分隔显示 let tel = this.data.tel_value // 原始手机号 let len = tel_value.length // 原始手机号的长度 let mobile ...
java类中属性的加载顺序，以及内存分配情况介绍
看下面例子及说明: /** 假如有外部类调用了该类,代码为:new StaticTest(); 那么下面是类属性的加载顺序 */ public class StaticTest{ public int ...
Codeforces Round #FF (Div. 1) B. DZY Loves Modification
枚举行取了多少次,如行取了i次,列就取了k-i次,假设行列单独贪心考虑然后相加,那么有i*(k-i)个交点是多出来的:dpr[i]+dpc[k-i]-i*(k-i)*p 枚举i取最大值.... B. ...

【Python】爬取理想论坛单帖爬虫

【Python】爬取理想论坛单帖爬虫的更多相关文章

随机推荐

热门专题