scrapy_redis 相关: 多线程更新 score/request.priority

0.背景

使用 scrapy_redis 爬虫，忘记或错误设置 request.priority(Rule 也可以通过参数 process_request 设置 request.priority)，导致提取 item 的 request 排在有序集 xxx:requests 的队尾，持续占用内存。

1.代码实现

遍历 SortedSet 的所有 item 并根据预定义字典对 data 中的 url 进行正则匹配，更新 score 并复制到临时 newkey，最后执行 rename

# -*- coding: UTF-8 -*

import sys

import re

from multiprocessing.dummy import Pool as ThreadPool

from functools import partial

try:

   input = raw_input #For py2

except NameError:

   pass

import redis

def print_line(string):

    print('\n{symbol}{space}{string}'.format(symbol='#'*10, space=' '*5, string=string))

def check_key_scores(key):

    try:

        total = redis_server.zcard(key)

    except redis.exceptions.ResponseError:

        print("The value of '{key}' is not a SortedSet".format(key=key))

        sys.exit()

    except Exception as err:

        print(err)

        sys.exit()

    if total == 0:

        print("key '{key}' does not exist or has no items".format(key=key))

        sys.exit()

    __, min_score = redis_server.zrange(key, 0, 0, withscores=True)[0]

    __, max_score = redis_server.zrange(key, -1, -1, withscores=True)[0]

    print('score  amount')

    total_ = 0

    # Asuming that score/request.priority is an integer, rather than float number like 1.1

    for score in range(int(min_score), int(max_score)+1):

        count = redis_server.zcount(key, score, score)

        print(score, count)

        total_ += count

    print("{total_}/{total} items of key '{key}' have an integer priority".format(

            total_=total_, total=total_, key=key))

def zadd_with_new_score(startstop, total_items):

    data, ori_score = redis_server.zrange(key, startstop, startstop, withscores=True)[0]

    for pattern, score in pattern_score:

        # data eg: b'\\x80\\x02}q\\x00(X\\x03\\x00\\x00\\x00urlq\\x01X\\x13\\x00\\x00\\x00http://httpbin.org/q\\x02X\\x08\\x00\\x00\\x00callbackq\\x03X\\x

        # See /site-packages/scrapy_redis/queue.py

            # We don't use zadd method as the order of arguments change depending on

            # whether the class is Redis or StrictRedis, and the option of using

            # kwargs only accepts strings, not bytes.

        m = pattern.search(data.decode('utf-8', 'replace'))

        if m:

            redis_server.execute_command('ZADD', newkey, score, data)

            break

    else:

        redis_server.execute_command('ZADD', newkey, ori_score, data)

    print('{startstop} / {total_items}'.format(

            startstop=startstop+1, total_items=total_items))

if __name__ == '__main__':

    password = 'password'

    host = '127.0.0.1'

    port = ''

    database_num = 0

    key = 'test:requests'

    newkey = 'temp'

    # Request whose url matching any key of keyword_score would be updated with the corresponding value as its score

    # Smaller value/score means higher request.priority

    keyword_score = {'httpbin': -12, 'apps/details': 1}

    pattern_score = [(re.compile(r'url.*?%s.*?callback'%k), v)for (k, v) in keyword_score.items()]

    threads_amount = 10

    redis_server = redis.StrictRedis.from_url('redis://:{password}@{host}:{port}/{database_num}'.format(

                                                password=password, host=host,

                                                port=port, database_num=database_num))

    print_line('Step 0: pre check')

    check_key_scores(key)

    print_line('Step 1: copy items and update score')

    # total_items = redis_server.zlexcount(key, '-', '+')

    total_items = redis_server.zcard(key)

    input("Press Enter to copy {total_items} items of '{key}' into '{newkey}' with new score".format(

            total_items=total_items, key=key, newkey=newkey))

    p = ThreadPool(threads_amount)

    p.map(partial(zadd_with_new_score, total_items=total_items), range(total_items))

    p.close()   #Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

    p.join()    #Wait for the worker processes to exit. One must call close() or terminate() before using join().

    # For py3

    # https://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments

    # with ThreadPool(threads_amount) as pool:

        # pool.map(partial(zadd_with_new_score, total_items=total_items), range(total_items))

    # print('zadd_with_new_score done')

    print_line('Step 2: check copy result')

    check_key_scores(key)

    check_key_scores(newkey)

    print_line('Step 3: delete, rename and check key')

    input("Press Enter to DELETE '{key}' and RENAME '{newkey}' to '{key}'".format(

            key=key, newkey=newkey))

    print(redis_server.delete(key))

    print(redis_server.rename(newkey, key))

    check_key_scores(key)

    check_key_scores(newkey)

2.运行结果

scrapy_redis 相关: 多线程更新 score/request.priority的更多相关文章

拒绝卡顿——在WPF中使用多线程更新UI
原文:拒绝卡顿--在WPF中使用多线程更新UI 有经验的程序员们都知道:不能在UI线程上进行耗时操作,那样会造成界面卡顿,如下就是一个简单的示例: public partial class MainW ...
Oracle E-Business Suite并发请求的优先级（Concurrent Request Priority）
不少用户抱怨自己的Oracle E-Business Suite并发请求(Concurrent Request)提交了好久,但还是一直在排队,等了好久还没有执行.用户希望对于一些重要性程度高.响应要求 ...
WPF多线程更新UI的一个解决途径
那么该如何解决这一问题呢?通常的做法是把耗时的函数放在线程池执行,然后切回主线程更新UI显示.前面的updateTime函数改写如下: private async void updateTime() ...
DataGridView 多线程更新数据解决卡顿问题
使用多线程更新DataGridView,防止页面卡顿和卡死的问题 private delegate void UpdateDataGridView(DataTable dt); private voi ...
多线程更新已排序的Datagridview数据，造成数据错位
多线程更新已排序的Datagridview数据,触发Datagridview的auto-sort时间,数据重新排序,造成后面更新数据的更新错误. 解决方法: 方法一.设置Datagridview的表头 ...
Android多线程更新UI的方式
Android下,对于耗时的操作要放到子线程中,要不然会残生ANR,本次我们就来学习一下Android多线程更新UI的方式. 首先我们来认识一下anr: anr:application not rep ...
C# 通过委托控制进度条以及多线程更新控件
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; usin ...
scrapy_redis 相关: 将 jobdir 保存的爬虫进度转移到 Redis
0.参考 Scrapy 隐含 bug: 强制关闭爬虫后从 requests.queue 读取的已保存 request 数量可能有误 1.说明 Scrapy 设置 jobdir,停止爬虫后,保存文件目录 ...
富客户端 wpf, Winform 多线程更新UI控件
前言在富客户端的app中,如果在主线程中运行一些长时间的任务,那么应用程序的UI就不能正常相应.因为主线程要负责消息循环,相应鼠标等事件还有展现UI. 因此我们可以开启一个线程来格外处理需要长时间的 ...

随机推荐

牛客小白月赛13-J小A的数学题（莫比乌斯反演）
链接:https://ac.nowcoder.com/acm/contest/549/J来源:牛客网题目描述小A最近开始研究数论题了,这一次他随手写出来一个式子,∑ni=1∑mj=1gcd(i,j ...
查看macOS下正在使用的zsh
使用dscl . -read /Users/$USER UserShell查看如果你的结果是/bin/zsh,又恰巧用brew安装了zsh的话,那么你可能就白安装了将brew安装的zsh添加到/e ...
Spring -bean的装配和注解的使用
一,bean的装配 bean是依赖注入的,通过spring容器取对象的. 装配方法有: 前面两种没什么好讲的,就改改参数就好了. 这里重要讲注解. 注解的主要类型见图,其中component是bean ...
使用graphviz画图
安装: 要使用Graphviz,先要在系统上安装Graphviz. 在Ubuntu上安装可以使用命令: sudo apt-get install graphviz 在其他系统安装的方法可以查看Grap ...
leveldb实现原理
LevelDb日知录之一:LevelDb 101 说起LevelDb也许您不清楚,但是如果作为IT工程师,不知道下面两位大神级别的工程师,那您的领导估计会Hold不住了:Jeff Dean和Sanja ...
Linux性能优化实战：系统的swap变高（09）
一.实验环境 1.操作系统 root@openstack:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu ...
2019全国大学生信息安全竞赛部分Web writeup
JustSoso 0x01 审查元素发现了提示,伪协议拿源码 /index.php?file=php://filter/read=convert.base64-encode/resource=inde ...
全排列递归算法(元素有重复与无重复，C++实现)
元素无重复: 如:2,5,8,9. 思路:用递归的方法解决,对于2589,先输出所有以2开头的排列,然后输出5开头的排列.....(此处称为递归操作A).以2开头的排列中,第一位是2,后面的是589, ...
Python 数据分析4
本章概要数据加载.存储与文件格式数据加载.存储与文件格式读取文本格式数据 read_csv 默认是按照逗号分割,也可设定其他分割符 df = pd.read_csv('file', sep='| ...
vueSSR全栈(项目实战 mac)
1.准备安装及指定版本参考安装类中的安装部分(node,npm,webpack) nuxt 官网下载nuxt脚手架(可以自定义版本) 需要下载MongoDB redis 以及数据库可视化工具具 ...