python——代理ip获取

python爬虫要经历爬虫、爬虫被限制、爬虫反限制的过程。当然后续还要网页爬虫限制优化，爬虫再反限制的一系列道高一尺魔高一丈的过程。

爬虫的初级阶段，添加headers和ip代理可以解决很多问题。

贴代码：说下思路

1、到http://www.xicidaili.com/nn/抓取相应的代理ip地址，地址比较多，但是不保证能用。先保存到列表

2、多线程验证代理ip的可行性，然后写入到对应的txt文件

3、当需要代理ip的时候，倒入模块，执行main（）函数，可得到可用的代理ip进行后续功能。

验证ip用到了telnetlib和requests两种方法。建议要爬取哪个网页，直接requests相应网页验证比较好。

#coding:utf-8

from bs4 import BeautifulSoup

import time

import threading

import random

import telnetlib,requests

#设置全局超时时间为3s，也就是说，如果一个请求3s内还没有响应，就结束访问，并返回timeout（超时）

import socket

socket.setdefaulttimeout(3)

headers = {

"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

}

def get_ip():

    #获取代理IP，返回列表

    httpResult=[]

    httpsResult=[]

    try:

        for page in range(1,2):

            IPurl = 'http://www.xicidaili.com/nn/%s' %page

            rIP=requests.get(IPurl,headers=headers)

            IPContent=rIP.text

            print IPContent

            soupIP = BeautifulSoup(IPContent,'lxml')

            trs = soupIP.find_all('tr')

            for tr in trs[1:]:

                tds = tr.find_all('td')

                ip = tds[1].text.strip()

                port = tds[2].text.strip()

                protocol = tds[5].text.strip()

                if protocol == 'HTTP':

                    httpResult.append( 'http://' + ip + ':' + port)

                elif protocol =='HTTPS':

                    httpsResult.append( 'https://' + ip + ':' + port)

    except:

        pass

    return httpResult,httpsResult

'''

#验证ip地址的可用性，使用telnetlib模块_http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用telnetlib模块_https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

'''

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页 http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('http://ip.chinaz.com/getip.aspx',proxies={'http':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页。https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('https://www.lagou.com/',proxies={'https':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

def main():

    httpResult,httpsResult = get_ip()

    threads = []

    open("E:\ip_http.txt","a").truncate()

    for i in httpResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=cip,args=(a,b,))

        threads.append(t)

    for i in range(len(httpResult)):

        threads[i].start()

    for i in range(len(httpResult)):

        threads[i].join()

    threads1 = []

    open("E:\ip_https.txt","a").truncate()

    for i in httpsResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=csip,args=(a,b,))

        threads1.append(t)

    for i in range(len(httpsResult)):

        threads1[i].start()

    for i in range(len(httpsResult)):

        threads1[i].join()

if __name__ == '__main__':

    main()

python——代理ip获取的更多相关文章

爬虫的新手使用教程（python代理IP）
前言 Python爬虫要经历爬虫.爬虫被限制.爬虫反限制的过程.当然后续还要网页爬虫限制优化,爬虫再反限制的一系列道高一尺魔高一丈的过程.爬虫的初级阶段,添加headers和ip代理可以解决很多问题. ...
c# 代理IP获取通用方法
调用: ConcurrentQueue<string> proxyIpQueue = new ConcurrentQueue<string>(); Grab_ProxyIp(p ...
python通过ip获取地址
# -*- coding: utf-8 -*- url = "http://ip.taobao.com/service/getIpInfo.php?ip=" #查找IP地址 def ...
PYTHON代理IP
import urllib.request url = 'http://www.whatismyip.com.tw/' proxy_support = urllib.request.ProxyHand ...
使用TaskManager爬取2万条代理IP实现自动投票功能
话说某天心血来潮想到一个问题,朋友圈里面经常有人发投票链接,让帮忙给XX投票,以前呢会很自觉打开链接帮忙投一票.可是这种事做多了就会考虑能不能使用工具来进行投票呢,身为一名程序猿决定研究解决这个问题. ...
写了个小爬虫，为何用上代理ip总是出现错误。
import urllib.request import re import os import random import threading def url_open(url): #在第8到第12 ...
代理 IP 云打码平台的使用
代理ip 获取代理ip的网站: 快代理西祠代理 www.goubanjia.com #代理ip import requests headers = { 'User-Agent':'Mozilla/5 ...
python爬虫实战（一）——实时获取代理ip
在爬虫学习的过程中,维护一个自己的代理池是非常重要的. 详情看代码: 1.运行环境 python3.x,需求库:bs4,requests 2.实时抓取西刺-国内高匿代理中前3页的代理ip(可根据需求自 ...
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客 undefined Python多线程抓取代理服务器 | Linux运维笔记 undefined java如 ...

随机推荐

iOS Autoresizing Autolayout Size classes
Autoresizing:出现最早,仅仅能够针对父控件做约束(注意:要关闭Autolayout&Size classes才能够看到Autoresizing) 代码对应: UIView.h中的a ...
P2762 太空飞行计划问题最大权闭合子图
link:https://www.luogu.org/problemnew/show/P2762 题意承担实验赚钱,但是要花去对应仪器的费用,仪器可能共用.求最大的收益和对应的选择方案. 思路这道 ...
POJ-1469 COURSES ( 匈牙利算法 dfs + bfs )
题目链接: http://poj.org/problem?id=1469 Description Consider a group of N students and P courses. Each ...
hdu 5974 A Simple Math Problem(数学题）
Problem Description Given two positive integers a and b,find suitable X and Y to meet the conditions ...
Python操作MongoDB文档数据库
1.Pymongo 安装安装pymongo: pip install pymongo PyMongo是驱动程序,使python程序能够使用Mongodb数据库,使用python编写而成: 2.Pym ...
zookeeper学习(零)_安装与启动
zookeeper学习(零)_安装与启动最近换了新的电脑,终于买了梦寐以求的macbook.最近也换了新的公司,公司技术栈用到了zookeeper.当然自己也要安装学习下.省的渣渣的我,被鄙视就麻烦 ...
上传文件的C#代码
1 <%@ WebHandler Language="C#" Class="UpLoadFile" %> 2 3 using System; 4 u ...
Spring Cloud同步场景分布式事务怎样做？试试Seata
一.概述在微服务架构下,虽然我们会尽量避免分布式事务,但是只要业务复杂的情况下这是一个绕不开的问题,如何保证业务数据一致性呢?本文主要介绍同步场景下使用Seata的AT模式来解决一致性问题. Sea ...
Vue.js学习总结——1
1.什么是Vue.js 1.Vue.js 是目前最火的一个前端框架,React是最流行的一个前端框架 2.Vue.js 是前端的主流框架之一,和Angular.js.React.js 一起,并成为前端 ...
完整剖析SpringAOP的自调用
摘要 spring全家桶帮助java web开发者节省了很多开发量,提升了效率.但是因为屏蔽了很多细节,导致很多开发者只知其然,不知其所以然,本文就是分析下使用spring的一些注解,不能够自调用的问 ...

python——代理ip获取

python——代理ip获取的更多相关文章

随机推荐

热门专题