一.在python3中操作mongodb

　　1.连接条件


安装好pymongo库

启动mongodb的服务端(如果是前台启动后就不关闭窗口,窗口关闭后服务端也会跟着关闭)

　　3.使用

import pymongo

#连接mongodb需要使用里面的mongoclient,一般来说传入mongodb的ip和端口即可
#第一个参数为host,,第二个为ip.默认为27017,
client=pymongo.MongoClient(host='127.0.0.1',port=27017)
#这样就可以拿到一个客户端对象了
#另外MongoClient的第一个参数host还可以直接传MongoDB的连接字符串，以mongodb开头，
#例如：client = MongoClient('mongodb://localhost:27017/')可以达到同样的连接效果
# print(client)

###################指定数据库
db=client.test
#也可以这样写
# db=client['test']

##################指定集合
collections=db.student
#也可以这样写
# collections=db['student']

###################插入数据
# student={
#     'id':'1111',
#     'name':'xiaowang',
#     'age':20,
#     'sex':'boy',
# }
#
# res=collections.insert(student)
# print(res)
#在mongodb中,每一条数据其实都有一个_id属性唯一标识,
#如果灭有显示指明_id,mongodb会自动产生yigeObjectId类型的_id属性
#insert执行后的返回值就是_id的值,5c7fb5ae35573f14b85101c0

#也可以插入多条数据
# student1={
#     'name':'xx',
#     'age':20,
#     'sex':'boy'
# }
#
# student2={
#     'name':'ww',
#     'age':21,
#     'sex':'girl'
# }
# student3={
#     'name':'xxx',
#     'age':22,
#     'sex':'boy'
# }
#
# result=collections.insertMany([student1,student2,student3])
# print(result)
#这边的返回值就不是_id,而是insertoneresult对象
#我们可以通过打印insert_id来获取_id

#insert方法有两种
#insert_one,insertMany,一个是单条插入,一个是多条插入,以列表形式传入
#也可以直接inset(),如果是单个就直接写,多个还是以列表的形式传入

###################查找  单条查找
# re=collections.find_one({'name':'xx'})
# print(re)
# print(type(re))
#{'_id': ObjectId('5c7fb8d535573f13f85a6933'), 'name': 'xx', 'age': 20, 'sex': 'boy'}
# <class 'dict'>

#####################多条查找
# re=collections.find({'name':'xx'})
# print(re)
# print(type(re))
# for r in re:
#     print(r)
#结果是一个生成器,我们可以遍历里面的这个对象,拿到里面的值
# <pymongo.cursor.Cursor object at 0x000000000A98E630>
# <class 'pymongo.cursor.Cursor'>

# re=collections.find({'age':{'$gt':20}})
# print(re)
# print(type(re))
# for r in re:
#     print(r)
# 在这里查询的条件键值已经不是单纯的数字了，而是一个字典，其键名为比较符号$gt，意思是大于，键值为20，这样便可以查询出所有
# 年龄大于20的数据。

# 在这里将比较符号归纳如下表：
"""
符号含义示例
$lt小于{'age': {'$lt': 20}}
$gt大于{'age': {'$gt': 20}}
$lte小于等于{'age': {'$lte': 20}}
$gte大于等于{'age': {'$gte': 20}}
$ne不等于{'age': {'$ne': 20}}
$in在范围内{'age': {'$in': [20, 23]}}
$nin不在范围内{'age': {'$nin': [20, 23]}}
"""

#正则匹配来查找
# re = collections.find({'name': {'$regex': '^x.*'}})
# print(re)
# print(type(re))
# for r in re:
#     print(r)

# 在这里将一些功能符号再归类如下：
"""
符号含义示例示例含义
$regex匹配正则{'name': {'$regex': '^M.*'}}name以M开头
$exists属性是否存在{'name': {'$exists': True}}name属性存在
$type类型判断{'age': {'$type': 'int'}}age的类型为int
$mod数字模操作{'age': {'$mod': [5, 0]}}年龄模5余0
$text文本查询{'$text': {'$search': 'Mike'}}text类型的属性中包含Mike字符串
$where高级条件查询{'$where': 'obj.fans_count == obj.follows_count'}自身粉丝数等于关注数
"""

################计数
# count=collections.find({'age':{'$gt':20}}).count()
# print(count)

#################排序
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING)
# print([re['name'] for re in result])

########### 偏移,可能想只取某几个元素，在这里可以利用skip()方法偏移几个位置，比如偏移2，就忽略前2个元素，得到第三个及以后的元素。
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING).skip(1)
# print([re['name'] for re in result])

##################另外还可以用limit()方法指定要取的结果个数，示例如下：
# results = collections.find().sort('age', pymongo.ASCENDING).skip(1).limit(2)
# print([result['name'] for result in results])

# 值得注意的是，在数据库数量非常庞大的时候，如千万、亿级别，最好不要使用大的偏移量来查询数据，很可能会导致内存溢出，
# 可以使用类似find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}}) 这样的方法来查询，记录好上次查询的_id。

################################数据更新
# 对于数据更新要使用update方法
# condition={'name':'xx'}
# student=collections.find_one(condition)
# student['age']=100
# result=collections.update(condition,student)
# print(result)

# 在这里我们将name为xx的数据的年龄进行更新，首先指定查询条件，然后将数据查询出来，修改年龄，
# 之后调用update方法将原条件和修改后的数据传入，即可完成数据的更新。
# {'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}
# 返回结果是字典形式，ok即代表执行成功，nModified代表影响的数据条数。

# 另外update()方法其实也是官方不推荐使用的方法，在这里也分了update_one()方法和update_many()方法，用法更加严格，
# 第二个参数需要使用$类型操作符作为字典的键名，我们用示例感受一下。

# condition={'name':'xx'}
# student=collections.find_one(condition)
# print(student)
# student['age']=112
# result=collections.update_one(condition,{'$set':student})
# print(result)
# print(result.matched_count,result.modified_count)

#再看一个例子
# condition={'age':{'$gt':20}}
# result=collections.update_one(condition,{'$inc':{'age':1}})
# print(result)
# print(result.matched_count,result.modified_count)
# 在这里我们指定查询条件为年龄大于20，
# 然后更新条件为{'$inc': {'age': 1}}，执行之后会讲第一条符合条件的数据年龄加1。
# <pymongo.results.UpdateResult object at 0x000000000A99AB48>
# 1 1

# 如果调用update_many()方法，则会将所有符合条件的数据都更新，示例如下：

condition = {'age': {'$gt': 20}}
result = collections.update_many(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)
# 这时候匹配条数就不再为1条了，运行结果如下：

# <pymongo.results.UpdateResult object at 0x10c6384c8>
# 3 3
# 可以看到这时所有匹配到的数据都会被更新。

# ###############删除
# 删除操作比较简单，直接调用remove()方法指定删除的条件即可，符合条件的所有数据均会被删除，示例如下：

# result = collections.remove({'name': 'Kevin'})
# print(result)
# 运行结果：

# {'ok': 1, 'n': 1}
# 另外依然存在两个新的推荐方法，delete_one()和delete_many()方法，示例如下：

# result = collections.delete_one({'name': 'Kevin'})
# print(result)
# print(result.deleted_count)
# result = collections.delete_many({'age': {'$lt': 25}})
# print(result.deleted_count)
# # 运行结果：

# <pymongo.results.DeleteResult object at 0x10e6ba4c8>
# 1
# 4
# delete_one()即删除第一条符合条件的数据，delete_many()即删除所有符合条件的数据，返回结果是DeleteResult类型，
# 可以调用deleted_count属性获取删除的数据条数。

# 更多
# 另外PyMongo还提供了一些组合方法，如find_one_and_delete()、find_one_and_replace()、find_one_and_update()，
# 就是查找后删除、替换、更新操作，用法与上述方法基本一致。

二.爬取腾讯招聘

　　爬虫文件

# -*- coding: utf-8 -*-

import scrapy

from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):

    name = 'tencent'

    # allowed_domains = ['www.xxx.com']

    #指定基础url用来做拼接用的

    base_url = 'http://hr.tencent.com/position.php?&start='

    page_num = 0

    start_urls = [base_url + str(page_num)]

    def parse(self, response):

        tr_list = response.xpath("//tr[@class='even' ] | //tr[@class='odd']")

        #先拿到存放类目的标签列表,然后循环标签列表

        for tr in tr_list:

            name = tr.xpath('./td[1]/a/text()').extract_first()

            url = tr.xpath('./td[1]/a/@href').extract_first()

            #在工作类别的时候,有时候是空值,会报错,需要这样直接给他一个空值

            # if len(tr.xpath("./td[2]/text()")):

            #    worktype = tr.xpath("./td[2]/text()").extract()[0].encode("utf-8")

            # else:

            #     worktype = "NULL"

            #如果不报错就用这种

            worktype = tr.xpath('./td[2]/text()').extract_first()

            num = tr.xpath('./td[3]/text()').extract_first()

            location = tr.xpath('./td[4]/text()').extract_first()

            publish_time = tr.xpath('./td[5]/text()').extract_first()

            item = TencentItem()

            item['name'] = name

            item['worktype'] = worktype

            item['url'] = url

            item['num'] = num

            item['location'] = location

            item['publish_time'] = publish_time

            print('----', name)

            print('----', url)

            print('----', worktype)

            print('----', location)

            print('----', num)

            print('----', publish_time)

            yield item

        # 分页处理:方法一

        # 这是第一中写法,在知道他的页码的情况下使用

        # 适用场景,在没有下一页可以点击,只能通过url拼接的情况

        # if self.page_num<3060:

        #     self.page_num+=10

        #     url=self.base_url+str(self.page_num)

        #     # yield  scrapy.Request(url=url,callback=self.parse)

        #     yield  scrapy.Request(url, callback=self.parse)

        # 方法二:

        # 直接提取的他的下一页连接

        # 这个等于0,说明不是最后一页,可以继续下一页,否则不等于0就继续提取

        #获取下一页的url直接拼接就可以了

        if len(response.xpath("//a[@id='next' and @class='noactive']")) == 0:

            next_url = response.xpath('//a[@id="next"]/@href').extract_first()

            url = 'https://hr.tencent.com/' + next_url

            yield scrapy.Request(url=url, callback=self.parse)

爬虫文件

　　pipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

import json

from redis import Redis

import pymongo

#存储到本地

class TencentPipeline(object):

    f=None

    def open_spider(self,spider):

        self.f=open('./tencent2.txt','w',encoding='utf-8')

    def process_item(self, item, spider):

        self.f.write(item['name']+':'+item['url']+':'+item['num']+':'+item['worktype']+':'+item['location']+':'+item['publish_time']+'\n')

        return item

    def close_spider(self,spider):

        self.f.close()

#存储到mysql

class TencentPipelineMysql(object):

    conn=None

    cursor=None

    def open_spider(self,spider):

        self.conn=pymysql.connect(host='127.0.0.1',port=3306,user='root',password='',db='tencent')

    def process_item(self,item,spider):

        print('这是mydql.米有进来吗')

        self.cursor = self.conn.cursor()

        try:

            self.cursor.execute('insert into tencent values("%s","%s","%s","%s","%s","%s")'%(item['name'],item['worktype'],item['url'],item['num'],item['publish_time'],item['location']))

            self.conn.commit()

        except Exception as  e:

            print('错误提示',e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.conn.close()

#储存到redis

class TencentPipelineRedis(object):

    conn=None

    def open_spider(self,spider):

        self.conn=Redis(host='127.0.0.1',port=6379)

    def process_item(self,item,spider):

        item_dic=dict(item)

        item_json=json.dumps(item_dic)

        self.conn.lpush('tencent',item_json)

        return item

#存储到mongodb

class TencentPipelineMongo(object):

    client=None

    def open_spider(self,spider):

        self.client=pymongo.MongoClient(host='127.0.0.1',port=27017)

        self.db=self.client['test']

    def process_item(self,item,spider):

        collection = self.db['tencent']

        item_dic=dict(item)

        collection.insert(item_dic)

        return item

    def close_spider(self,spider):

        self.client.close()

pipeline

　　settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for Tencent project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']

NEWSPIDER_MODULE = 'Tencent.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'Tencent.middlewares.TencentSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'Tencent.pipelines.TencentPipeline': 300,

    'Tencent.pipelines.TencentPipelineMysql': 301,

    'Tencent.pipelines.TencentPipelineRedis': 302,

    'Tencent.pipelines.TencentPipelineMongo': 303,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　　item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name=scrapy.Field()

    url=scrapy.Field()

    worktype=scrapy.Field()

    location=scrapy.Field()

    num=scrapy.Field()

    publish_time=scrapy.Field()

pymongodb的使用和一个腾讯招聘爬取的案例的更多相关文章

如何手动写一个Python脚本自动爬取Bilibili小视频
如何手动写一个Python脚本自动爬取Bilibili小视频国庆结束之余,某个不务正业的码农不好好干活,在B站瞎逛着,毕竟国庆嘛,还让不让人休息了诶-- 我身边的很多小伙伴们在朋友圈里面晒着出去游玩 ...
第一个nodejs爬虫：爬取豆瓣电影图片
第一个nodejs爬虫:爬取豆瓣电影图片存入本地: 首先在命令行下 npm install request cheerio express -save; 代码: var http = require( ...
一个简单java爬虫爬取网页中邮箱并保存
此代码为一十分简单网络爬虫,仅供娱乐之用. java代码如下: package tool; import java.io.BufferedReader; import java.io.File; im ...
Python 之scrapy框架58同城招聘爬取案例
一.项目目录结构: 代码如下: # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See docu ...
用WebCollector制作一个爬取《知乎》并进行问题精准抽取的爬虫（JAVA）
简单介绍: WebCollector是一个无须配置.便于二次开发的JAVA爬虫框架(内核),它提供精简的的API.仅仅需少量代码就可以实现一个功能强大的爬虫. 怎样将WebCollector导入项目请 ...
写一个python 爬虫爬取百度电影并存入mysql中
目标是利用python爬取百度搜索的电影在类型地区年代各个标签下电影的名字评分和图片连接以及电影连接首先我们先在mysql中建表 create table liubo4( id in ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
Scrapy项目 - 实现腾讯网站社会招聘信息爬取的爬虫设计
通过使Scrapy框架,进行数据挖掘和对web站点页面提取结构化数据,掌握如何使用Twisted异步网络框架来处理网络通讯的问题,可以加快我们的下载速度,也可深入接触各种中间件接口,灵活的完成各种需求 ...
利用scrapy爬取腾讯的招聘信息
利用scrapy框架抓取腾讯的招聘信息,爬取地址为:https://hr.tencent.com/position.php 抓取字段包括:招聘岗位,人数,工作地点,发布时间,及具体的工作要求和工作任务 ...

随机推荐

Servlet请求转发 RequestDispatcher接口.RP
在Servlet中,利用RequestDispatcher对象,可以将请求转发给另外一个Servlet或JSP页面,甚至是HTML页面,来处理对请求的响应. 一,RequestDispatcher接口 ...
Responsive设计——meta标签
media-queries.js(http://code.google.com/p/css3-mediaqueries-js/) respond.js(https://github.com/scott ...
(转)EASYUI+MVC4通用权限管理平台
原文地址:http://www.cnblogs.com/hn731/archive/2013/07/15/3190947.html 通用权限案例平台在经过几年的实际项目使用,并取得了不错的用户好评.在 ...
Linq to SQL Like Operator
As a response for customer's question, I decided to write about using Like Operator in Linq to SQL q ...
js Date 生成某年某月的天数
$(function () { //构造一个日期对象: var day = new Date(2014, 2, 0); //获取天数: var daycount = day.getDate(); al ...
Mongo Windows 基本使用入门
1.安装https://www.mongodb.com/download-center#community注意:安装 "install mongoDB compass" 不勾选下载 ...
JS内置对象的原型不能重定义？只能动态添加属性或方法？
昨天马上就快下班了,坐在我对面的同事突然问我一个问题,我说“爱过”,哈哈,开个玩笑.情况是这样的,他发现JS的内置对象的原型好像不能通过字面量对象的形式进行覆盖, 只能动态的为内置对象的原型添加属性或 ...
八大排序算法的python实现（七）基数排序
代码: #coding:utf-8 #author:徐卜灵 import math #print math.ceil(3.2) 向上取整4.0 #print math.floor(3.2) 向下取整3 ...
WebApplicationContext wac=WebApplicationContextUtils.getWebApplicationContext(this.getServletContext());这句话的意思
在jsp中出现提取的代码: <% WebApplicationContext wac = WebApplicationContextUtils .getWebApplication ...
如果将markdown视作一门编程语言可以做哪些有趣的事情?
如题,然后就有了为解决这个好奇而开的项目:https://github.com/racaljk/llmd 源码主要是parser+interpreter,其中parser使用sundown,然后生成l ...

pymongodb的使用和一个腾讯招聘爬取的案例

一.在python3中操作mongodb

1.连接条件

3.使用

二.爬取腾讯招聘

pymongodb的使用和一个腾讯招聘爬取的案例的更多相关文章

随机推荐

热门专题

　　1.连接条件

　　3.使用