『Python』爬取 WooYun 论坛所有漏洞条目的相关信息

每个漏洞条目包含：

乌云ID，漏洞标题，漏洞所属厂商，白帽子，漏洞类型，厂商或平台给的Rank值

主要是做数据分析使用：
可以分析某厂商的各类型漏洞的统计；
或者对白帽子的能力进行分析.....

数据更新时间：2016/5/27
漏洞条目：104796条

数据截图如下：

数据网盘链接：

链接：http://pan.baidu.com/s/1bpDNKOv 密码：6y57

爬虫脚本：

# coding:utf-8

# author: anka9080

# version: 1.0  py3

import sys,re,time,socket

from requests import get

from queue import Queue, Empty

from threading import Thread

# 全局变量

COUNT = 1

START_URL = 'http://wooyun.org/bugs'

ID_DETAILS = []

ALL_ID = []

Failed_ID = []

PROXIES = []

HEADERS = {

	"Accept": "text/html,application/xhtml+xml,application/xml,application/json;q=0.9,image/webp,*/*;q=0.8",

	"Accept-Encoding": "gzip, deflate, sdch",

	"Accept-Language": "zh-CN,zh;q=0.8",

	"Cache-Control": "max-age=0",

	"Connection": "keep-alive",

	"DNT": "1",

	"Host": "wooyun.org",

	"Upgrade-Insecure-Requests": "1",

	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2716.0 Safari/537.36"

}

class WooYunSpider(Thread):

	"""docstring for WooYunSpider"""

	def __init__(self,queue):

		Thread.__init__(self)

		self.pattern1 = re.compile(r'title>(.*?)\| WooYun.*?keywords" content="(.*?),(.*?),(.*?),wooyun',re.S)  # 匹配模式在 compile 的时候指定

		self.pattern2 = re.compile(r"漏洞Rank：(\d{1,3})")

		self.queue = queue

		self.start() # 执行 run()

	def run(self):

		"每次读取 queue 的一条"

		global COUNT,RES_LOG,ERR_LOG

		while(1):

			try:

				id = self.queue.get(block = False)

				r = get('http://wooyun.org/bugs/' + id,headers = HEADERS)

				html = r.text

			except Empty:

				break

			except Exception as e:

				msg = '[ - Socket_Excpt ] 链接被拒绝，再次添加到队列：' + id

				print(msg)

				ERR_LOG.write(msg+'\n')

				self.queue.put(id)  # 访问失败则把这个 URL从新加入队列

			else:

				title,comp,author,bug_type,rank = self.get_detail(html,id)

				detail = id+'----'+title+'----'+comp+'----'+author+'----'+bug_type+'----'+rank

				try: # 写入文件可能会诱发 gbk 编码异常，这里保存 id 到 failed

					RES_LOG.write(detail + '\n')

				except Exception as e:

					Failed_ID.append(id)

					msg = '[ - Encode_Excpt ] 字符编码异常：' + id

					print(msg)

					ERR_LOG.write(msg+'\n')

				ID_DETAILS.append(detail)

			# time.sleep(1)

			print('[ - info ] id: {}  count: {}  time: {:.2f}s'.format(id,COUNT,time.time() - start))

			COUNT += 1

	# 由 缺陷编号 获得对应的 厂商 和 漏洞类型信息

	def get_detail(self,html,id):

		global ERR_LOG

		try:

			# print(html)

			res = self.pattern1.search(html)

			title = res.group(1).strip()

			comp = res.group(2).strip()

			author = res.group(3).strip()

			bug_type = res.group(4).strip()

		except Exception as e:

			msg = '[ - Detail_Excpt ] 未解析出 标题等相关信息：' + id

			print(msg)

			ERR_LOG.write(msg+'\n')

			Failed_ID.append(id)

			title,comp,author,bug_type,rank = 'Null','Null','Null','Null','Null'

		else:

			try:

				res2 = self.pattern2.search(html)  # 若厂商暂无回应则 rank 为 Null

				rank = res2.group(1).strip()

			except Exception as e:

				msg = '[ - Rank_Excpt ] 未解析出 Rank：' + id

				print(msg)

				ERR_LOG.write(msg+'\n')

				rank = 'Null'

		finally:

			try:

				print (title,comp,author,bug_type,rank)

			except Exception as e:

				msg = '[ - Print_Excpt ] 字符编码异常：' + id +'::'+ str(e)

				print(msg)

				ERR_LOG.write(msg+'\n')

			return title,comp,author,bug_type,rank

class ThreadPool(object):

	def __init__(self,thread_num,id_file):

		self.queue = Queue() # 需要执行的队列

		self.threads = [] # 多线程列表

		self.add_task(id_file)

		self.init_threads(thread_num)

	def add_task(self,id_file):

		with open(id_file) as input:

			for id in input.readlines():

				self.queue.put(id.strip())			

	def init_threads(self,thread_num):

		for i in range(thread_num):

			print ('[ - info :] loading threading ---> ',i)

			# time.sleep(1)

			self.threads.append(WooYunSpider(self.queue)) # threads 列表装的是 爬虫线程

	def wait(self):

		for t in self.threads:

			if t.isAlive():

				t.join()

def test():

	url = 'http://wooyun.org/bugs/wooyun-2016-0177647'

	r = get(url,headers = HEADERS)

	html = r.text

	# print type(html)

	# keywords" content="(.*?),(.*?),(.*?),wooyun  ====> 厂商，白帽子，类型

	pattern1 = re.compile(r'title>(.*?)\| WooYun')

	pattern2 = re.compile(r'keywords" content="(.*?),(.*?),(.*?),wooyun')

	pattern3 = re.compile(r'漏洞Rank：(\d{1,3})')

	for x in range(500):

		res = pattern1.search(html)

		# print (res.group(1))

		res = pattern2.search(html)

		# print (res.group(1),res.group(2),res.group(3))

		res = pattern3.search(html)

		# print (res.group(1))

		x += 1

		print(x)

	# rank = res.group(4).strip()

	# print html

def test2():

	url = 'http://wooyun.org/bugs/wooyun-2016-0177647'

	r = get(url,headers = HEADERS)

	html = r.text

	pattern = re.compile(r'title>(.*?)\| WooYun.*?keywords" content="(.*?),(.*?),(.*?),wooyun.*?漏洞Rank：(\d{1,3})',re.S)

	for x in range(500):

		res = pattern.search(html)

		# print (res.group(1),res.group(2),res.group(3),res.group(4),res.group(5))

		x += 1

		print(x)

# 保存结果

def save2file(filename,filename_failed_id):

	with open(filename,'w') as output:

		for item in ID_DETAILS:

			try: # 写入文件可能会诱发 gbk 编码异常，这里忽略

				output.write(item + '\n')

			except Exception as e:

				pass

	with open(filename_failed_id,'w') as output:

		output.write('\n'.join(Failed_ID))

if __name__ == '__main__':

	socket.setdefaulttimeout(1)

	start = time.time()

	# test()

	# 日志记录

	ERR_LOG = open('err_log.txt','w')

	RES_LOG = open('res_log.txt','w')

	id_file = 'id_0526.txt'

	# id_file = 'id_test.txt'

	tp = ThreadPool(20,id_file)

	tp.wait()

	save2file('id_details.txt','failed_id.txt')

	end = time.time()

	print ('[ - info ] cost time :{:.2f}s'.format(end - start))

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章

【Python】爬取理想论坛单帖爬虫
代码: # 单帖爬虫,用于爬取理想论坛帖子得到发帖人,发帖时间和回帖时间,url例子见main函数 from bs4 import BeautifulSoup import requests impo ...
『Scrapy』爬取斗鱼主播头像
分析目标爬取的是斗鱼主播头像,示范使用的URL似乎是个移动接口(下文有提到),理由是网页主页属于动态页面,爬取难度陡升,当然爬取斗鱼主播头像这么恶趣味的事也不是我的兴趣...... 目标URL如下, ...
『Scrapy』爬取腾讯招聘网站
分析爬取对象初始网址, http://hr.tencent.com/position.php?@start=0&start=0#a (可选)由于含有多页数据,我们可以查看一下这些网址有什么相 ...
python scrapy爬取HBS 汉堡南美航运公司柜号信息
下面分享个scrapy的例子利用scrapy爬取HBS 船公司柜号信息 1.前期准备查询提单号下的柜号有哪些,主要是在下面的网站上,输入提单号,然后点击查询 https://www.hamburg ...
大神：python怎么爬取js的页面
大神:python怎么爬取js的页面可以试试抓包看看它请求了哪些东西, 很多时候可以绕过网页直接请求后面的API 实在不行就上 selenium (selenium大法好) selenium和pha ...
python连续爬取多个网页的图片分别保存到不同的文件夹
python连续爬取多个网页的图片分别保存到不同的文件夹作者:vpoet mail:vpoet_sir@163.com #coding:utf-8 import urllib import ur ...
python定时器爬取豆瓣音乐Top榜歌名
python定时器爬取豆瓣音乐Top榜歌名作者:vpoet mail:vpoet_sir@163.com 注:这些小demo都是前段时间为了学python写的,现在贴出来纯粹是为了和大家分享一下 # ...
python大规模爬取京东
python大规模爬取京东主要工具 scrapy BeautifulSoup requests 分析步骤打开京东首页,输入裤子将会看到页面跳转到了这里,这就是我们要分析的起点我们可以看到这个页面 ...
Python爬虫 - 爬取百度html代码前200行
Python爬虫 - 爬取百度html代码前200行 - 改进版, 增加了对字符串的.strip()处理源代码如下: # 改进版, 增加了 .strip()方法的使用 # coding=utf-8 ...

随机推荐

Android程序的入口点和全局变量设置--application
首先看看 application的官方文档我之前一直以为Android程序的入口点就是带MAIN和LAUNCHER的Activity的onCreate方法,看来我是错了~ 原来真正的入口点是 Ap ...
MapReduce多用户任务调度器——容量调度器（Capacity Scheduler）原理和源码研究
前言:为了研究需要,将Capacity Scheduler和Fair Scheduler的原理和代码进行学习,用两篇文章作为记录.如有理解错误之处,欢迎批评指正. 容量调度器(Capacity Sch ...
UVALIVE 5893 计算几何+搜索
题意:很复杂的题意,我描述不清楚. 题目链接:http://acm.bnu.edu.cn/bnuoj/contest_show.php?cid=3033#problem/33526 大致是,给定一个起 ...
FreeRTOS学习笔记——任务间使用队列同步数据
1.前言在嵌入式操作系统中队列是任务间数据交换的常用手段,队列是生产者消费者模型的重要组成部分.FreeRTOS的队列简单易用,下面结合一个具体例子说明FreeRTOS中的队列如何使用. 2.参考代 ...
geektool--一款很geek的工具
2016/12/18 今天尝试一款很geek的工具 geektool 听名字就超级geek有木有 get it geektool website 从官网直接下载app,一键傻瓜式安装. use it ...
Cocos3.0测试版发布（中文）
最新的cocos2d-x 3.0版本,我们的目标不仅是改进渲染机制,增加对2.5D的支持,基于组件的系统功能,和更好的Label功能.同时我们希望能够进一步优化引擎,并且使用更友好的C++ API ...
[Firebase] Deploy you website to Firebase
If you are looking for a host website, you can try Firebase, heroku or AWS... Today, I tried to depl ...
VMware于CentOS网络设置
VMware于CentOS网络设置底: 笔记本电脑有两块网卡: 1. 网卡连接公司内网,仅仅配置了内网ip和子网掩码. 2. 无线网卡.连接4g无线路由器.dhcp自己主动配置. 问题: 在VMwa ...
Java基础知识强化之IO流笔记21：FileInputStream读取数据
1. 字节输入流的操作步骤: (1)创建字节输入流的对象 (2)调用read()方法读取数据,并把数据显示到控制台 (3)关闭字节输入流的对象资源 2. FileInputStream构造: File ...
Android（java）学习笔记246：ContentProvider使用之学习ContentProvider（内容提供者）的目的
1.使用ContentProvider,把应用程序私有的数据暴露给别的应用程序,让别的应用程序完成对自己私有的数据库数据的增删改查的操作. 2.ContentProvider的应用场景: 获取手机系统 ...

『Python』 爬取 WooYun 论坛所有漏洞条目的相关信息

『Python』 爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章

随机推荐

热门专题

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章