Python爬虫学习——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8

#!/usr/bin/env python

from bitarray import bitarray

# 3rd party

import mmh3

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class BloomFilter(set):

    def __init__(self, size, hash_count):

        super(BloomFilter, self).__init__()

        self.bit_array = bitarray(size)

        self.bit_array.setall(0)

        self.size = size

        self.hash_count = hash_count

    def __len__(self):

        return self.size

    def __iter__(self):

        return iter(self.bit_array)

    def add(self, item):

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            self.bit_array[index] = 1

        return self

    def __contains__(self, item):

        out = True

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            if self.bit_array[index] == 0:

                out = False

        return out

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        bloom = BloomFilter(1000, 10)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8

#!/usr/bin/env python

from pybloom import BloomFilter

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        # bloom = BloomFilter(100, 10)

        bloom = BloomFilter(1000, 0.001)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

输出

dog is in bloom filter as expected

cat is in bloom filter as expected

giraffe is in bloom filter as expected

fly is in bloom filter as expected

mosquito is in bloom filter as expected

horse is in bloom filter as expected

eagle is in bloom filter as expected

bird is in bloom filter as expected

bison is in bloom filter as expected

boar is in bloom filter as expected

butterfly is in bloom filter as expected

ant is in bloom filter as expected

anaconda is in bloom filter as expected

bear is in bloom filter as expected

chicken is in bloom filter as expected

dolphin is in bloom filter as expected

donkey is in bloom filter as expected

crow is in bloom filter as expected

crocodile is in bloom filter as expected

badger is not in the bloom filter as expected

cow is not in the bloom filter as expected

pig is not in the bloom filter as expected

sheep is not in the bloom filter as expected

bee is not in the bloom filter as expected

wolf is not in the bloom filter as expected

fox is not in the bloom filter as expected

whale is not in the bloom filter as expected

shark is not in the bloom filter as expected

fish is not in the bloom filter as expected

turkey is not in the bloom filter as expected

duck is not in the bloom filter as expected

dove is not in the bloom filter as expected

deer is not in the bloom filter as expected

elephant is not in the bloom filter as expected

frog is not in the bloom filter as expected

falcon is not in the bloom filter as expected

goat is not in the bloom filter as expected

gorilla is not in the bloom filter as expected

hawk is not in the bloom filter as expected

Python爬虫学习——布隆过滤器的更多相关文章

python爬虫学习(1) —— 从urllib说起
0. 前言如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫所以建议先学习一下cuiqingcai大神的 Pyth ...
python爬虫学习 —— 总目录
开篇作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录听说你叫爬虫 - ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
Python爬虫学习：四、headers和data的获取
之前在学习爬虫时,偶尔会遇到一些问题是有些网站需要登录后才能爬取内容,有的网站会识别是否是由浏览器发出的请求. 一.headers的获取就以博客园的首页为例:http://www.cnblogs.c ...
Python爬虫学习：二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
python爬虫学习视频资料免费送，用起来非常666
当我们浏览网页的时候,经常会看到像下面这些好看的图片,你是否想把这些图片保存下载下来. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片点击鼠标右键的时候并没有另存为选项,或者你可以通过截图工 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
[转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...

随机推荐

【熊掌号mip插件】织梦DEDECMS百度熊掌号mip改造教程
第一部分:模板修改 1.js部分:删除或使用现有组件替换 2.调用百度mip文件: head里加<link rel="stylesheet" type="text/ ...
CentOS下多网卡绑定bond/多网卡聚合
网卡bond我直接理解成网卡聚合了,就是把多张网卡虚拟成1张网卡,出口时,这张网卡无论哪个断线都不影响网络,入口时,需要结合交换机的端口聚合功能实现和网卡配置的bond模式进行负载均衡.bond需要在 ...
WordPress主题开发实例：获取当前分类的文章列表
思路: 如果使用默认的wordpress的方法,调出来的数据就会被后台的显示个数所限制,而我们需要的是自由控制文章数和翻页,所以我使用WP_Query 获取当前分类的方法可以通过 get_query_ ...
dhtmlxtree 节点展开收缩：新增了直接点文本内容也实现了展开收缩功能（并记住了展开、收缩状态）
dhtmlxtree 节点展开收缩通常情况我们按 +- 就实现了展开收缩功能,为了方便我们新增了直接点文本内容也实现了展开收缩功能(并记住了展开.收缩状态) tree = new dh ...
Java LinkedList工作原理及实现
1. 概述以双向链表实现.链表无容量限制,但双向链表本身使用了更多空间,也需要额外的链表指针操作. 按下标访问元素—get(i)/set(i,e) 要悲剧的遍历链表将指针移动到位(如果i>数组 ...
python 常见问题
C:\Users\user\AppData\Local\Programs\Python\Python36\Scripts\ C:\Users\user\AppData\Local\Programs\P ...
Java字符串转16 进制工具类Hex.java
Java字符串转16 进制工具类Hex.java 学习了:https://blog.csdn.net/jia635/article/details/56678086 package com.strin ...
LeetCode Permutations问题详解
题目一 permutations 题目描述 Given a collection of numbers, return all possible permutations. For example,[ ...
HTML中input type="text"和type="password" 显示的长度不一样
在CSS里边加上input {width:100px;}能把所有input标签的控件宽度改为相同! 加上这个属性 style="width:180px;"
（原）Show, Attend and Translate: Unsupervised Image Translation with Self-Regularization and Attention
转载请注明出处: https://www.cnblogs.com/darkknightzh/p/9333844.html 论文网址:https://arxiv.org/abs/1806.06195 在 ...

Python爬虫学习——布隆过滤器

Python爬虫学习——布隆过滤器的更多相关文章

随机推荐

热门专题