这几天一直都再用scrapy写网站数据采集的爬虫,这里我就选一个写过的爬虫来记录一下。

杭州造价网:http://183.129.219.195:8081/bs/hzzjb/web/list

这里出现的主要问题就是:

1.
这里我的代码会出现一些问题,内存溢出,由于程序一直在运行,内存一直在增加(未释放内存,有待改进),就会导致爬虫程序假死等待。

2.
yield scrapy.FormRequest(url='http://183.129.219.195:8081/bs/hzzjb/web/list', callback=self.parse, formdata=data, method="POST", dont_filter=True)

dont_filter 默认设置False 导致 去重过滤,未能获取下一页链接。

3.
还有一个问题就是最后一页的数据我没解决,我拿了5032页,这个翻页还是没完善好。

网站主要信息:


hzzjb.py

# -*- coding: utf-8 -*-
import scrapy
import json
import re
from hzzjb_web.items import HzzjbWebItem
class HzzjbSpider(scrapy.Spider):
name = 'hzzjb'
allowed_domains = ['183.129.219.195:8081/bs']
start_urls = ['http://183.129.219.195:8081/bs/hzzjb/web/list']
custom_settings = {
"DOWNLOAD_DELAY": 1,
"ITEM_PIPELINES": {
'hzzjb_web.pipelines.MysqlPipeline': 320,
},
"DOWNLOADER_MIDDLEWARES": {
'hzzjb_web.middlewares.HzzjbWebDownloaderMiddleware': 500
},
}
def parse(self, response):
_response=response.text
# print(_response)
try : #获取信息表
tag_list=response.xpath("//table[@class='table1']//tr/td").extract() # print(tag_list)
# for i in tag_list:
# print(i)
tag1=tag_list[:9]
tag2=tag_list[9:18]
tag3=tag_list[18:27]
tag4=tag_list[27:36]
tag5=tag_list[36:45]
tag6=tag_list[45:54]
tag7=tag_list[54:63]
tag8=tag_list[63:72]
tag9=tag_list[72:81]
tag10=tag_list[81:90]
tag11=tag_list[90:99]
tag12=tag_list[99:108]
tag13=tag_list[108:117]
tag14=tag_list[117:126]
tag15=tag_list[126:135]
tag16=tag_list[135:144]
tag17=tag_list[144:153]
tag18=tag_list[153:162]
tag19=tag_list[162:171]
tag20=tag_list[171:180] list=[]
list.append(tag1)
list.append(tag2)
list.append(tag3)
list.append(tag4)
list.append(tag5)
list.append(tag6)
list.append(tag7)
list.append(tag8)
list.append(tag9)
list.append(tag10)
list.append(tag11)
list.append(tag12)
list.append(tag13)
list.append(tag14)
list.append(tag15)
list.append(tag16)
list.append(tag17)
list.append(tag18)
list.append(tag19)
list.append(tag20) # print(list)
except:
print('————————————————网站编码有异常!————————————————————') for index,tag in enumerate(list):
# print('*'*100)
# print(index+1,TAG(i)) item = HzzjbWebItem()
# 地区
district = tag[0].replace('<td>','').replace('</td>','')
# print(district)
item['district'] = district
# 类别
category = tag[1].replace('<td>','').replace('</td>','')
# print(category)
item['category'] = category
# 材料名称
material_name = tag[2].replace('<td>','').replace('</td>','')
# print(material_name)
item['material_name'] = material_name
# 规格及型号
version = tag[3].replace('<td>','').replace('</td>','')
# print(version)
item['version'] = version
# 单位
unit = tag[4].replace('<td>','').replace('</td>','')
# print(unit)
item['unit'] = unit
# 含税信息价
tax_information_price = tag[5].replace('<td>','').replace('</td>','')
# print(tax_information_price)
item['tax_information_price'] = tax_information_price
# 除税信息价
except_tax_information_price = tag[6].replace('<td>','').replace('</td>','')
# print(except_tax_information_price)
item['except_tax_information_price'] = except_tax_information_price
# 年/月
year_month = tag[7].replace('<td>','').replace('</td>','')
# print(year_month)
item['y_m'] = year_month
# print('*'*100) yield item
for i in range(2, 5032):
# 翻页
data={
'mtype': '',
'_query.nfStart':'',
'_query.yfStart':'',
'_query.nfEnd':'',
'_query.yfEnd':'',
'_query.dqstr':'',
'_query.dq':'',
'_query.lbtype':'',
'_query.clmc':'',
'_query.ggjxh':'',
'pageNumber': '{}'.format(i),
'pageSize':'',
'orderColunm':'',
'orderMode':'',
} yield scrapy.FormRequest(url='http://183.129.219.195:8081/bs/hzzjb/web/list', callback=self.parse, formdata=data, method="POST", dont_filter=True)
items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import scrapy class HzzjbWebItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field() district=scrapy.Field()
category=scrapy.Field()
material_name=scrapy.Field()
version=scrapy.Field()
unit=scrapy.Field()
tax_information_price=scrapy.Field()
except_tax_information_price=scrapy.Field()
y_m=scrapy.Field()
middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class HzzjbWebSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider. # Should return None or raise an exception.
return None def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response. # Must return an iterable of Request, dict or Item objects.
for i in result:
yield i def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict
# or Item objects.
pass def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated. # Must return only requests (not items).
for r in start_requests:
yield r def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name) class HzzjbWebDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware. # Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None def process_response(self, request, response, spider):
# Called with the response returned from the downloader. # Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception. # Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.conf import settings
import pymysql class HzzjbWebPipeline(object):
def process_item(self, item, spider):
return item # 数据保存mysql
class MysqlPipeline(object): def open_spider(self, spider):
self.host = settings.get('MYSQL_HOST')
self.port = settings.get('MYSQL_PORT')
self.user = settings.get('MYSQL_USER')
self.password = settings.get('MYSQL_PASSWORD')
self.db = settings.get(('MYSQL_DB'))
self.table = settings.get('TABLE')
self.client = pymysql.connect(host=self.host, user=self.user, password=self.password, port=self.port, db=self.db, charset='utf8') def process_item(self, item, spider):
item_dict = dict(item)
cursor = self.client.cursor()
values = ','.join(['%s'] * len(item_dict))
keys = ','.join(item_dict.keys())
sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=self.table, keys=keys, values=values)
try:
if cursor.execute(sql, tuple(item_dict.values())): # 第一个值为sql语句第二个为 值 为一个元组
print('数据入库成功!')
self.client.commit()
except Exception as e:
print(e) print('数据已存在!')
self.client.rollback()
return item def close_spider(self, spider):
self.client.close()
setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for hzzjb_web project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'hzzjb_web' SPIDER_MODULES = ['hzzjb_web.spiders']
NEWSPIDER_MODULE = 'hzzjb_web.spiders' # mysql配置参数
MYSQL_HOST = "172.16.10.197"
MYSQL_PORT = 3306
MYSQL_USER = "root"
MYSQL_PASSWORD = ""
MYSQL_DB = 'web_datas'
TABLE = "web_hzzjb" # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'hzzjb_web (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'hzzjb_web.middlewares.HzzjbWebSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'hzzjb_web.middlewares.HzzjbWebDownloaderMiddleware': 500,
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'hzzjb_web.pipelines.HzzjbWebPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
执行文件  :scrapy crawl hzzjb -s JOBDIR=myspider 设置持久化

就会生成如下文件:


数据入库如图:


33.scrapy采集网站表单数据的更多相关文章

  1. 使用jQuery实现跨域提交表单数据

    我们在WEB开发中有时会遇到这种情况,比如要从A网站收集用户信息,提交给B网站处理,这个时候就会涉及到跨域提交数据的问题.本文将给您介绍如何使用jQuery来实现异步跨域提交表单数据.   在jQue ...

  2. 如何发送HTML表单数据

    多数时候,HTML表单的目的只是为了把数据发给服务器,之后服务器再处理这些数据并发送响应给用户.虽然看起来挺简单的,但我们还是得注意一些事情以确保传送的数据不会破坏服务器.或者给你的用户制造麻烦. 数 ...

  3. ASP.NET MVC案例教程(基于ASP.NET MVC beta)——第四篇:传递表单数据

    摘要      本文将完成我们“MVC公告发布系统”的公告发布功能,以此展示在ASP.NET MVC中如何传递处理表单的数据. 前言      通过前几篇文章,我们已经能比较自如的使用ASP.NET ...

  4. 如何使用PHP验证客户端提交的表单数据

    PHP 表单验证 本章节我们将介绍如何使用PHP验证客户端提交的表单数据. PHP 表单验证 在处理PHP表单时我们需要考虑安全性. 本章节我们将展示PHP表单数据安全处理,为了防止黑客及垃圾信息我们 ...

  5. Servlet的5种方式实现表单提交(注册小功能),后台获取表单数据

    用servlet实现一个注册的小功能 ,后台获取数据. 注册页面: 注册页面代码 : <!DOCTYPE html> <html> <head> <meta ...

  6. easyui不提交window中的form表单数据

    <form id="ff" method="post">, <div id="win" class="easyu ...

  7. Struct2提交表单数据到Acion

    Struct2提交表单数据到Action,Action取表单的数据,传递变量.对象 HTML.jsp <form action="reg.do" method="p ...

  8. json化表单数据

    /** * josn化表单数据 * @name baidu.form.json * @function * @grammar baidu.form.json(form[, replacer]) * @ ...

  9. [原创作品] Express 4.x 接收表单数据

    好久没有写博客,从现在开始,将介绍用nodejs进行web开发的介绍.欢迎加群讨论:164858883. 之前的express版本在接收表单数据时,可以统一用res.params['参数名'],但在4 ...

随机推荐

  1. linux 如何运行rpm

    rpm -ivh linuxqq-v1.0.2-beta1.i386.rpm

  2. Python标示符和关键字

    标示符 什么是标示符,看下图: 标识符就是开发人员在程序中自定义的一些符号和名称. 标示符是自己定义的,如变量名 .函数名等. 标示符的规则 标示符由字母.下划线和数字组成,且数字不能开头 pytho ...

  3. InfluxDB 1.6文档

    警告!此页面记录了不再积极开发的InfluxDB的早期版本.InfluxDB v1.7是InfluxDB的最新稳定版本. InfluxDB是一个时间序列数据库,旨在处理高写入和查询负载.它是TICK堆 ...

  4. pytest.1.快速开始

    From: http://www.testclass.net/pytest/quick_start/ 简介 pytest测试框架可以让我们很方便的编写测试用例,这些用例写起来虽然简单,但仍然可以规模化 ...

  5. jquery下拉菜单

    下拉菜单或者导航是我们在网站开发中不可或缺的网站元素之一,使用jQuery可以制作出简洁易用.美观大方的下拉菜单或者导航效果. 下面展示的12款利用jQuery实现的下拉菜单即导航效果整理自前端大牛爱 ...

  6. scroll家族属性

    上一篇主要分析了一下offset家族属性,本篇文章则主要是来分析一下scroll家族属性. 首先,scroll家族包括4个属性: 网页正文宽度:document.body.scrollWidth; 网 ...

  7. 忽略时间的小时分,展示的方法 data函数

    date(create_at) 列表: sql:

  8. 服务容错保护断路器Hystrix之二:Hystrix工作流程解析

    一.总运行流程 当你发出请求后,hystrix是这么运行的 红圈 :Hystrix 命令执行失败,执行回退逻辑.也就是大家经常在文章中看到的“服务降级”. 绿圈 :四种情况会触发失败回退逻辑( fal ...

  9. 廖雪峰Java1-4数组操作-5命令行参数

    adb和ideviceinstaller提供了许多参数供我们使用.命令行参数提供了这样的入口,针对不同的参数执行不同的命令. 1.命令行参数 命令行参数是一个String[] 数组,由JVM接收用户输 ...

  10. go语言学习--channel的关闭

    在使用Go channel的时候,一个适用的原则是不要从接收端关闭channel,也不要在多个并发发送端中关闭channel.换句话说,如果sender(发送者)只是唯一的sender或者是chann ...