Python 爬基金数据

爬科学基金共享服务网中基金数据

#coding=utf-8

import json

import requests

from lxml import etree

from HTMLParser import HTMLParser

from pymongo import MongoClient

data = {'pageSize':10,'currentPage':1,'fundingProject.projectNo':'','fundingProject.name':'','fundingProject.person':'','fundingProject.org':'',

'fundingProject.applyCode':'','fundingProject.grantCode':'','fundingProject.subGrantCode':'','fundingProject.helpGrantCode':'','fundingProject.keyword':'',

'fundingProject.statYear':'','checkCode':'%E8%AF%B7%E8%BE%93%E5%85%A5%E9%AA%8C%E8%AF%81%E7%A0%81'}

url = 'http://npd.nsfc.gov.cn/fundingProjectSearchAction!search.action'

headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate',

'Accept-Language':'zh-CN,zh;q=0.9',

'Cache-Control':'max-age=0',

'Connection':'keep-alive',

'Content-Length':'',

'Content-Type':'application/x-www-form-urlencoded',

'Cookie':'JSESSIONID=8BD27CE37366ED8022B42BFC68FF82D4',

'Host':'npd.nsfc.gov.cn',

'Origin':'http://npd.nsfc.gov.cn',

'Referer':'http://npd.nsfc.gov.cn/fundingProjectSearchAction!search.action',

'Upgrade-Insecure-Requests':'',

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

def main():

    client = MongoClient('localhost', 27017)

    db = client.ScienceFund

    db.authenticate("","")

    collection=db.science_fund

    for i in range(1, 43184):

        print i

        data['currentPage'] = i

        result = requests.post(url, data = data, headers = headers)

        html = result.text

        tree = etree.HTML(html)

        table = tree.xpath("//dl[@class='time_dl']")

        for item in table:

            content = etree.tostring(item, method='html')

            content =  HTMLParser().unescape(content)

            # print content

            bson = jiexi(content)

            collection.insert(bson)

def jiexi(content):

    # 标题

    title1 = content.find('">', 20)

    title2 = content.find('</')

    title = content[title1+2:title2]

    # print title

    # 批准号

    standard_no1 = content.find(u'批准号', title2)

    standard_no2 = content.find('</dd>', standard_no1)

    standard_no = content[standard_no1+4:standard_no2].strip()

    # print standard_no

    # 项目类别

    standard_type1 = content.find(u'项目类别', standard_no2)

    standard_type2 = content.find('</dd>', standard_type1)

    standard_type = content[standard_type1+5:standard_type2].strip()

    # print standard_type

    # 依托单位

    supporting_institution1 = content.find(u'依托单位', standard_type2)

    supporting_institution2= content.find('</dd>', supporting_institution1)

    supporting_institution = content[supporting_institution1+5:supporting_institution2].strip()

    # print supporting_institution

    # 项目负责人

    project_principal1 = content.find(u'项目负责人', supporting_institution2)

    project_principal2 = content.find('</dd>', project_principal1)

    project_principal = content[project_principal1+6:project_principal2].strip()

    # print project_principal

    # 资助经费

    funds1 = content.find(u'资助经费', project_principal2)

    funds2 = content.find('</dd>', funds1)

    funds = content[funds1+5:funds2].strip()

    # print funds

    # 批准年度

    year1 = content.find(u'批准年度', funds2)

    year2 = content.find('</dd>', year1)

    year = content[year1+5:year2].strip()

    # print year

    # 关键词

    keywords1 = content.find(u'关键词', year2)

    keywords2 = content.find('</dd>', keywords1)

    keywords = content[keywords1+4:keywords2].strip()

    # print keywords

    dc = {}

    dc['title'] = title

    dc['standard_no'] = standard_no

    dc['standard_type'] = standard_type

    dc['supporting_institution'] = supporting_institution

    dc['project_principal'] = project_principal

    dc['funds'] = funds

    dc['year'] = year

    dc['keywords'] = keywords

    return dc

if __name__ == '__main__':

    main()

Python 爬基金数据的更多相关文章

python爬取数据需要注意的问题
1 爬取https的网站或是接口的时候,如果是不受信用的SSL证书,会报错,需要添加如下代码,如下代码可以保证当前代码块内所有的请求都自动屏蔽ssl证书问题: import ssl # 这个是爬取ht ...
python爬取数据保存到Excel中
# -*- conding:utf-8 -*- # 1.两页的内容 # 2.抓取每页title和URL # 3.根据title创建文件,发送URL请求,提取数据 import requests fro ...
python爬取数据保存入库
import urllib2 import re import MySQLdb class LatestTest: #初始化 def __init__(self): self.url="ht ...
Python 爬起数据时 'gbk' codec can't encode character '\xa0' 的问题
1.被这个问题折腾了一上午终于解决了,再网上看到有用 string.replace(u'\xa0',u' ') 替换成空格的,方法试了没用. 后来发现要在open的时候加utf-8才解决问题. 以 ...
Python 爬取数据入库mysql
# -*- enconding:etf-8 -*- import pymysql import os import time import re serveraddr="localhost& ...
Python 爬取美团酒店信息
事由:近期和朋友聊天,聊到黄山酒店事情,需要了解一下黄山的酒店情况,然后就想着用python 爬一些数据出来,做个参考主要思路:通过查找,基本思路清晰,目标明确,仅仅爬取美团莫一地区的酒店信息,不过 ...
如何使用Python爬取基金数据，并可视化显示
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理以下文章来源于Will的大食堂,作者打饭大叔前言美国疫情越来越严峻,大选也进入 ...
python爬取股票最新数据并用excel绘制树状图
大家好,最近大A的白马股们简直跌妈不认,作为重仓了抱团白马股基金的养鸡少年,每日那是一个以泪洗面啊. 不过从金融界最近一个交易日的大盘云图来看,其实很多中小股还是红色滴,绿的都是白马股们. 以下截图 ...
python爬取网站数据
开学前接了一个任务,内容是从网上爬取特定属性的数据.正好之前学了python,练练手. 编码问题因为涉及到中文,所以必然地涉及到了编码的问题,这一次借这个机会算是彻底搞清楚了. 问题要从文字的编码讲 ...

随机推荐

RabbitMQ消息队列（二）: 工作队列
1. 工作队列: 对于资源密集型任务,我们等待其处理完成在很多情况下是不现实的,比如无法在http的短暂请求窗口中处理大量耗时任务, 为了达到主线程无需等待,任务异步执行的要求,我们可以将任务加入任务 ...
python基础===python3 get和post请求(转载)
get请求 #encoding:UTF-8 importurllib importurllib.request data={} data['name']='aaa' url_parame=urllib ...
hihocoder-第六十一周 Combination Lock
题目1 : Combination Lock 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 Finally, you come to the interview roo ...
TCP 粘包及其解决方案(zz)
首先,我们回顾一下 TCP 和 UDP 的头部信息: 具体说明看:http://www.cnblogs.com/aomi/p/7776582.html 我们知道,TCP 和 UDP 是 TCP/IP ...
C# 调试程序时如何输入命令行参数
调试程序时如何输入命令行参数http://www.a769.com/archives/320.html 开发命令行程序时,我们会疑惑,从那里输入参数呢?请看下面的教程,让你摆脱困扰. 1.点击菜单栏: ...
在Ubuntu 16.04安装 Let’s Encrypt并配置ssl
1.安装前准备 1)要确保python的默认版本为2.7及以上版本. 2)需要配置的apache.nginx需要提前配置绑定域名. 2.安装ssl 在这个https://certbot.eff.org ...
CodeVS 1226 倒水问题【DFS/BFS】
题目描述 Description 有两个无刻度标志的水壶,分别可装 x 升和 y 升 ( x,y 为整数且均不大于 100 )的水.设另有一水缸,可用来向水壶灌水或接从水壶中倒出的水, 两水壶间,水 ...
hadoop学习二：hadoop基本架构与shell操作
1.hadoop1.0与hadoop2.0的区别:
[AGC025E]Walking on a Tree
题意:有一棵树,你要按顺序在树上走$m$次,每次从$u_i$到$v_i$或从$v_i$到$u_i$,走完后,如果一条边被单向经过,那么它贡献$1$的价值,如果一条边被双向经过,那么它贡献$2$的价值, ...
【暴力】UVALive - 4882 - Parenthesis
就不断地扫整个序列,如果发现多余的括号就删除.大概复杂度还是O(n²)左右.如何判断不合法请详见代码. To a computer, there is no difference between th ...

Python 爬基金数据

Python 爬基金数据的更多相关文章

随机推荐

热门专题