python-爬免费ip并验证其可行性

前言

最近在重新温习python基础-正则，感觉正则很强大，不过有点枯燥，想着，就去应用正则，找点有趣的事玩玩

00xx01---代理IP

有好多免费的ip,不过一个一个保存太难了，也不可能，还是用我们的python爬取吧

00xx02---正则提取ip

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 #print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 print(ips)

 print(ports)

00xx03---拼接IP和端口

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 # print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 #print(ips)

 #print(ports)

 for ip in zip(ips,ports ):  #提取拼接ip和端口

     print(ip)

00xx03---验证IP可行性

思路：带着ip和端口去访问一个网站，百度就可以

 import requests

 import re

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 for i in range(1,1000):

     #网址

     url = "https://www.xicidaili.com/nn/{}".format(i)

     response = requests.get(url,headers=headers)

     # print(response.text)

     html = response.text

     #re.S忽略换行的干扰

     ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

     ports = re.findall(("<td>(\d+)</td>"),html,re.S)

     # print(ips)

     # print(ports)

     for ip in zip(ips,ports ):  #提取拼接ip和端口

         proxies = {

             "http":"http://" + ip[0] + ":" + ip[1],

             "https":"http://" + ip[0] + ":" + ip[1]

         }

         try:

             res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

             print(ip,"能使用")

             with open("ip.text",mode="a+") as f:

                 f.write(":".join(ip))  #写入ip.text文本

                 f.write("\n") #换行

         except Exception as e:   #捕捉错误异常

             print(ip,"不能使用")

00xx04---写入文本

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 # print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 #print(ips)

 #print(ports)

 for ip in zip(ips,ports ):  #提取拼接ip和端口

     print(ip)

     proxies = {

             "http":"http://" + ip[0] + ":" + ip[1],

             "https":"http://" + ip[0] + ":" + ip[1]

         }

     try:

         res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

         print(ip,"能使用")

         with open("ip.text",mode="a+") as f:

             f.write(":".join(ip))  #写入ip.text文本

             f.write("\n") #换行

     except Exception as e:   #捕捉错误异常

         print(ip,"不能使用")

爬了一页，才几个能用，有3000多页，不可能手动的

00xx05---批量爬

 #!/usr/bin/env python3

 # coding:utf-8

 # 2019/11/18 22:38

 #lanxing

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 for i in range(1,3000):  #爬3000个网页

     #网站

     url = "https://www.xicidaili.com/nn/{}".format(i)

     response = requests.get(url,headers=headers)

         # print(response.text)

     html = response.text

     # print(html)

     #re.S忽略换行的干扰

     ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

     ports = re.findall(("<td>(\d+)</td>"),html,re.S)

     #print(ips)

     #print(ports)

     for ip in zip(ips,ports ):  #提取拼接ip和端口

         print(ip)

         proxies = {

                 "http":"http://" + ip[0] + ":" + ip[1],

                 "https":"http://" + ip[0] + ":" + ip[1]

             }

         try:

             res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

             print(ip,"能使用")

             with open("ip.text",mode="a+") as f:

                 f.write(":".join(ip))  #写入ip.text文本

                 f.write("\n") #换行

         except Exception as e:   #捕捉错误异常

             print(ip,"不能使用")

00xx06---最后

哈哈，感觉爬的速度太慢了，毕竟是单线程，如果要快速爬，可以试试用多线程爬取，

以后再补充完善代码吧

python-爬免费ip并验证其可行性的更多相关文章

[python]爬代理ip v2.0(未完待续）
爬代理ip 所有的代码都放到了我的github上面, HTTP代理常识 HTTP代理按匿名度可分为透明代理.匿名代理和高度匿名代理. 特别感谢:勤奋的小孩在评论中指出我文章中的错误. REMOTE_ ...
python爬取ip地址
ip查询,异步get请求分析接口,请求接口响应json 发现可以data中获取 result.json()['data'][0]['location'] # _*_ coding : utf-8 _ ...
python爬取免费优质IP归属地查询接口
python爬取免费优质IP归属地查询接口具体不表,我今天要做的工作就是: 需要将数据库中大量ip查询出起归属地刚开始感觉好简单啊,毕竟只需要从百度找个免费接口然后来个python脚本跑一晚上就o ...
爬取西刺网的免费IP
在写爬虫时,经常需要切换IP,所以很有必要自已在数据维护库中维护一个IP池,这样,就可以在需用的时候随机切换IP,我的方法是爬取西刺网的免费IP,存入数据库中,然后在scrapy 工程中加入tools ...
无忧代理免费ip爬取（端口js加密）
起因为了训练爬虫技能(其实主要还是js技能-),翻了可能有反爬的网站挨个摧残,现在轮到这个网站了:http://www.data5u.com/free/index.shtml 解密过程打开网站,在 ...
第二篇 - python爬取免费代理
代理的作用参考https://wenda.so.com/q/1361531401066511?src=140 免费代理很多,但也有很多不可用,所以我们可以用程序对其进行筛选.以能否访问百度为例. 1. ...
爬取快代理的免费IP并测试
各大免费IP的网站的反爬手段往往是封掉在一定时间内访问过于频繁的IP,因此在爬取的时候需要设定一定的时间间隔,不过说实话,免费代理很多时候基本都不能用,可能一千个下来只有十几个可以用,而且几分钟之后估 ...
python 单例模式获取IP代理
python 单例模式获取IP代理 tags:python python单例模式 python获取ip代理引言:最近在学习python,先说一下我学Python得原因,一个是因为它足够好用,完成同样 ...
Python获取免费的可用代理
Python获取免费的可用代理在使用爬虫多次爬取同一站点时,常常会被站点的ip反爬虫机制给禁掉,这时就能够通过使用代理来解决.眼下网上有非常多提供最新免费代理列表的站点.这些列表里非常多的代理主机是 ...

随机推荐

Mybatis与Spring整合（CURD）
项目采用Maven构建,用Junit进行测试,数据库是Mysql,连接池是c3p0,未测试缓存部分 1.Maven的“pom.xml”文件 <project xmlns="http:/ ...
JDBC操作之连接和关闭mysql数据库
首先导入jdbc所用的jar包然后分别调用getCon()和closeCon方法 import java.sql.DriverManager; import java.sql.SQLExceptio ...
浏览器自带记忆功能，使input颜色和字体丢失
方法一 : 会有视觉上颜色的变化input:-internal-autofill-selected { /*内置阴影填充背景颜色*/ box-shadow: inset 0 0 0 1000px # ...
Python 正整数相加其余忽略
从键盘上输入若干数值,对其中的正整数求和,非正整数(负整数,实数或其他符号)忽略,这个过程一直到输入“#”结束. i = 0while True: m = input("请输入一个数:&qu ...
CSS——垂直居中
vertical-align 垂直对齐以前我们讲过让带有宽度的块级元素居中对齐,是margin: 0 auto; 以前我们还讲过让文字居中对齐,是 text-align: center; 但是我们从 ...
BIO、NIO、AIO入门认识
同步.异步.阻塞.非阻塞概念理解. 同步: 比如在执行某个逻辑业务,在没有得到结果之前一直处于等待阻塞状态,得到结果后才继续执行异步: 比如在执行某个逻辑业务,在没有得到结果可以去干其他的事情,等待 ...
Flink on YARN（下）：常见问题与排查思路
Flink 支持 Standalone 独立部署和 YARN.Kubernetes.Mesos 等集群部署模式,其中 YARN 集群部署模式在国内的应用越来越广泛.Flink 社区将推出 Flink ...
sublime 3打开中文乱码问题
首先到官网 https://packagecontrol.io/installation#Simple 下载一个控制台支持的扩展包Package Control.sublime-package 在su ...
c++中变量、变量名、变量地址、指针、引用等含义
首先了解内存,内存就是一排房间,编号从0开始,0,1,2,3,4,5...... 房间里面一定要住人,新人住进去了,原来的人就走了:不管你住不住,里面都有人. 编号就是地址.里面的人就是内容,为了我们 ...
第二十一篇：spring怎么做缓存
项目背景:你可能遇情景:1.一个做统计的页面,每次刷新需要调接口做查询 ,是联表查询,查出来的数据还需要做一些计算或者加工,不算页面上的图表插件,刷新一次,延迟个几秒钟才出的来2. 一个统计接口如此 ...

python-爬免费ip并验证其可行性

python-爬免费ip并验证其可行性的更多相关文章

随机推荐

热门专题