Python 自用代码（scrapy多级页面(三级页面)爬虫）

【Python 自用代码（scrapy多级页面(三级页面)爬虫）】的更多相关文章

scrapy之盗墓笔记三级页面爬取

#今日目标 **scrapy之盗墓笔记三级页面爬取** 今天要爬取的是盗墓笔记小说,由分析该小说的主要内容在三级页面里,故需要我们一一解析 *代码实现* daomu.py ``` import scrapy from ..items import DaomuItem class DaomuSpider(scrapy.Spider): name = 'daomu' allowed_domains = ['daomubiji.com'] start_urls = ['http://www.daom…

Python 自用代码（scrapy多级页面(三级页面)爬虫）

2017-03-28 入职接到的第一个小任务,scrapy多级页面爬虫,从来没写过爬虫,也没学过scrapy,甚至连xpath都没用过,最后用了将近一周才搞定.肯定有很多low爆的地方,希望大家可以给我一些建议. spider文件: # -*- coding: utf-8 -*- import scrapy from nosta.items import NostaItem import time import hashlib class NostaSpider(scrapy.Spider):…

Python 自用代码（知网会议论文网页源代码清洗）

#coding=utf-8 from pymongo import MongoClient from lxml import etree import requests jigou = u"\r\n [机构]\r\n " zuozhe = u"\r\n [作者]\r\n " # 获取数据库 def get_db(): client = MongoClient('localhost', 27017) db = client.cnki db.authenticate(&…

Python 自用代码（某方标准类网页源代码清洗）

用于mongodb中“标准”数据的清洗,数据为网页源代码,须从中提取: 标准名称,标准外文名称,标准编号,发布单位,发布日期,状态,实施日期,开本页数,采用关系,中图分类号,中国标准分类号,国际标准分类号,国别,关键词,摘要,替代标准. 提取后组成字典存入另一集合. #coding=utf-8 from pymongo import MongoClient from lxml import etree import requests s = [u'标准编号:',u'发布单位:',u'发布日期:'…

Python 自用代码（递归清洗采标情况）

将‘ISO 3408-1-2006,MOD ISO 3408-2-1991,MOD ISO 3408-3-2006,MOD’类似格式字符串存为: [{'code': 'ISO 3408-1-2006', 'type': 'MOD'}, {'code': 'ISO 3408-2-1991', 'type': 'MOD'}, {'code': 'ISO 3408-3-2006', 'type': 'MOD'}]格式 #coding=utf-8 s = 'ISO 3408-1-2006,MOD I…

Python 自用代码（调整日期格式）

2017年6月28日 to 2017-06-282017年10月27日 to 2017-10-272017年12月1日 to 2017-12-012017年7月1日 to 2017-07-01 #coding=utf-8 def func(string): year = string.find(u'年') month = string.find(u'月') day = string.find(u'日') if month-year==2: string = string.replace(u"年&…