1.案例一

a.创建项目

scrapy startproject renren_login

进入项目路径

scrapy genspider renren "renren.com"

renren.py

# -*- coding: utf-8 -*-
import scrapy class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/'] def start_requests(self):
url="http://www.renren.com/PLogin.do"
data={"email":"xxxxxxxx@126.com","password":"xxxxxxx"}
request=scrapy.FormRequest(url,formdata=data,callback=self.parse_page)
yield request def parse_page(self, response):
request=scrapy.Request(url='http://www.renren.com/326282648/profile',callback=self.parse_profile)
yield request def parse_profile(self,response):
with open("wenliang.html","w",encoding="utf-8") as fp:
fp.write(response.text)

在项目路径下创建start.py

from scrapy import cmdline
cmdline.execute(["scrapy","crawl","renren"])

2.案例2

a.手动输入验证码

创建项目

scrapy startproject douban_login

进去项目路径

scrapy genspider douban "douban.com"

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douban_login project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban_login' SPIDER_MODULES = ['douban_login.spiders']
NEWSPIDER_MODULE = 'douban_login.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban_login (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban_login.pipelines.DoubanLoginPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

douban.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/login']
login_url="https://www.douban.com/login"
profile_url="https://www.douban.com/people/184480369/"
editsignature_url="https://www.douban.com/j/people/184480369/edit_signature" def parse(self, response):
formdata={
"source":"None",
"redir":"https://www.douban.com/",
"form_email":"xxxxxx@qq.com",
"form_password":"xxxxxx!",
"remember":"on",
"login":"登录"
} captcha_url=response.css("img#captcha_image::attr(src)").get() if captcha_url:
captcha=self.regonize_captcha(captcha_url)
formdata["captcha-solution"]=captcha
captcha_id=response.xpath("//input[@name='captcha-id']/@value").get()
formdata["captcha-id"]=captcha_id
yield scrapy.FormRequest(url=self.login_url,formdata=formdata,callback=self.parse_after_login) def parse_after_login(self,response):
if response.url=="https://www.douban.com/":
yield scrapy.Request(self.profile_url,callback=self.parse_profile)
print("登录成功")
else:
print("登录失败") def parse_profile(self,response):
print(response.url)
if response.url==self.profile_url:
print("进入到了个人中心")
ck=response.xpath("//input[@name='ck']/@value").get()
formdata={
"ck":ck,
"signature":"丈夫处世兮立功名"
}
yield scrapy.FormRequest(self.editsignature_url,formdata=formdata)
else:
print("没有进入个人中心") def regonize_captcha(self,image_url):
request.urlretrieve(image_url,"captcha.png")
image=Image.open("captcha.png")
image.show()
captcha=input("请输入验证码:")
return captcha

在douban_login目录下创建start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

执行start.py即可

b.自动识别验证码

from urllib import request
from base64 import b64decode
import requests captcha_url="https://www.douban.com/misc/captcha?id=TCEAV2F8SbBgKbXZ5JAI2G6L:en&size=s"
request.urlretrieve(captcha_url,"captcha.png") recognize_url="http://xxxxxx"
formdata={}
with open("captcha.png","rb") as fp:
data=fp.read()
pic=b64decode(data)
formdata['pic']=pic appcode='xxxxxxxxxxxxxxx'
headers={
"Content-Type":"application/x-www-form-urlencode; charset=UTF-8",
'Authorization':'APPCODE'+appcode
}
response=requests.post(recognize_url,data=formdata,headers=headers)
print(response)

c.其他自动识别案例

from selenium import webdriver
import time
import requests
from lxml import etree
import base64 # 操作浏览器
driver = webdriver.Chrome()
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001' driver.get(url)
time.sleep(1)
driver.find_element_by_id('email').send_keys('')
time.sleep(1)
driver.find_element_by_id('password').send_keys('yaoqinglin2011')
time.sleep(1) # 获取验证码相关信息
html_str = driver.page_source
html_ele = etree.HTML(html_str)
# 得到验证码的url
image_url = html_ele.xpath('//img[@id="captcha_image"]/@src')[0]
# 获取这个图片的内容
response = requests.get(image_url) # 获取base64的str
# https://market.aliyun.com/products/57124001/cmapi028447.html?spm=5176.2020520132.101.5.2HEXEG#sku=yuncode2244700000
b64_str = base64.b64encode(response.content)
v_type = 'cn'
# post 提交打码平台的数据
form = {
'v_pic': b64_str,
'v_type': v_type,
} # authtication的header
headers = {
'Authorization': 'APPCODE eab23fa1d03f40d48b43c826c57bd284',
}
# 从打码平台获取验证码信息
dmpt_url = 'http://yzmplus.market.alicloudapi.com/fzyzm'
response = requests.post(dmpt_url, form, headers=headers)
print(response.text)
# captcha_value 就是我们的验证码信息
captcha_value = response.json()['v_code'] print(image_url)
print(captcha_value)
# captcha_value = input('请输入验证码') driver.find_element_by_id('captcha_field').send_keys(captcha_value)
time.sleep(1)
driver.find_element_by_class_name('btn-submit').click()
time.sleep(1)
# 获取所有的cookie的信息
cookies = driver.get_cookies()
cookie_list =[] # 对于每一个cookie_dict, 就是将name 和 value取出, 拼接成name=value;
for cookie_dict in cookies:
cookie_str = cookie_dict['name'] + '=' + cookie_dict['value']
cookie_list.append(cookie_str) # 拼接所有的cookie到header_cookie中
header_cookie = '; '.join(cookie_list) headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Cookie': header_cookie,
}
another_url = 'https://www.douban.com/accounts/'
response = requests.get(another_url, headers=headers) with open('cc.html', 'wb') as f:
f.write(response.content) # with open('douban.html', 'wb') as f:
# f.write(driver.page_source.encode('utf-8'))

15.scrapy模拟登陆案例的更多相关文章

  1. 第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别 第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://gith ...

  2. Scrapy模拟登陆

    1. 为什么需要模拟登陆? #获取cookie,能够爬取登陆后的页面 2. 回顾: requests是如何模拟登陆的? #1.直接携带cookies请求页面 #2.找接口发送post请求存储cooki ...

  3. Scrapy 模拟登陆知乎--抓取热点话题

    工具准备 在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了.           Python   1 scrapy genspid ...

  4. 爬虫入门之scrapy模拟登陆(十四)

    注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态 COOKIES_ENABLED = True或# COOKIES_ENABLE ...

  5. python之scrapy模拟登陆人人网

    1.settings.py主要配置信息,包括USER_AGENT等 # -*- coding: utf-8 -*- # Scrapy settings for renren project # # F ...

  6. Scrapy模拟登陆豆瓣抓取数据

    scrapy  startproject douban 其中douban是我们的项目名称 2创建爬虫文件 进入到douban 然后创建爬虫文件 scrapy genspider dou douban. ...

  7. scrapy 模拟登陆

    import scrapy import urllib.request from scrapy.http import Request,FormRequest class LoginspdSpider ...

  8. 二十二 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://github.com/muchrooms/zheye 注意:此程序依赖以下模块包 Keras==2.0.1 Pillow= ...

  9. 识别图片验证码的三种方式(scrapy模拟登陆豆瓣网)

    1.通过肉眼识别,然后输入到input里面 from PIL import image Image request.urlretrieve(url,'image')  #下载验证码图片 image = ...

随机推荐

  1. rt-thread中动态内存分配之小内存管理模块方法的一点理解

    @2019-01-18 [小记] rt-thread中动态内存分配之小内存管理模块方法的一点理解 > 内存初始化后的布局示意 lfree指向内存空闲区首地址 /** * @ingroup Sys ...

  2. <Android基础>(三) UI开发 Part 1

    1.常用控件 1)TextView 2)Button 3)EditText 4)ImageView 5)ProgressBar 6)AlertDialog 7)ProgressDialog 2.四种布 ...

  3. 洛谷 P1129 [ZJOI2007]矩阵游戏 解题报告

    P1129 [ZJOI2007]矩阵游戏 题目描述 小\(Q\)是一个非常聪明的孩子,除了国际象棋,他还很喜欢玩一个电脑益智游戏――矩阵游戏.矩阵游戏在一个\(N*N\)黑白方阵进行(如同国际象棋一般 ...

  4. 如何优雅的解决mac安装zsh不执行.bash_profile

    最近刚刚重装了系统,并安装了优雅的shell命令工具zsh,突然发现我放在我的工作目录下的.bash_profile居然在启动的时候执行,导致我的java的一些配置没有注册到bash中.然后查资料得知 ...

  5. tyvj/joyoi 2018 小猫爬山

    2018,这个题号吼哇! 搜索第一题,巨水. WA了一次,因为忘了还原... #include <cstdio> ; int n, W, ans, weigh[N], cost[N]; i ...

  6. MAC安装JDK及环境变量配置

    1.访问Oracle官网 http://www.oracle.com,浏览到首页的底部菜单 ,然后按下图提示操作: 2.点击“JDK DOWNLOAD”按钮: 3.选择“Accept Lisence ...

  7. 第四篇-以ConstraintLayout进行Android界面设计

    此文章基于第三篇. 一.新建一个layout.xml文件,创建方法不再赘述,在Design界面右击LinearLayout,点击Convert LinearLayout to ConstraintLa ...

  8. python的序列化与反序列化

    ------------------------------------------------------------------- 文件的序列化与反序列化:

  9. Ubuntu寻找某某库

    感觉这个方法很有用,记录一下 ubuntu14.04的error while loading shared libraries: libz.so.1问题 我们怎么这知道 libz.so.1在哪个包? ...

  10. (线性结构dp )POJ 1260 Pearls

    Pearls Time Limit: 1000MS   Memory Limit: 10000K Total Submissions: 10558   Accepted: 5489 Descripti ...