0.目录

2.构建URL
3.新建数据库
4.新建汇总表
5.定义连接数据库函数:connect_db(db=None, cursorclass=DictCursor)
6.汇总表填充必要数据
7.新建各省份子表
8.完整代码

1.参考

2.构建URL

python之多线程 queue 实践 筛选有效url

3.新建数据库

  1. mysql> CREATE DATABASE mobile
  2. -> CHARACTER SET 'utf8'
  3. -> COLLATE 'utf8_general_ci';
  4. Query OK, 1 row affected (0.00 sec)
  5.  
  6. mysql> use mobile;
  7. Database changed

4.新建汇总表

int(M) M是显示宽度,无需设置。

int 带符号表示范围 [-2147483648,2147483647] ,手机号码11位数字已超。

  1. mysql> use mobile
  2. Database changed
  3. mysql> CREATE TABLE china(
  4. -> id INT NOT NULL auto_increment,
  5. -> province VARCHAR(100) NOT NULL,
  6. -> city VARCHAR(100) NOT NULL,
  7. -> num_count INT NULL,
  8. -> new_time DATETIME NULL,
  9. -> update_time DATETIME NULL,
  10. -> latest_num BIGINT NULL,
  11. -> province_zh VARCHAR(100) NOT NULL,
  12. -> city_zh VARCHAR(100) NOT NULL,
  13. -> url VARCHAR(255) NOT NULL,
  14. ->
  15. -> PRIMARY KEY(id)
  16. -> )engine=InnoDB DEFAULT CHARSET=utf8;
  17. Query OK, 0 rows affected (0.01 sec)

5.定义连接数据库函数:connect_db(db=None, cursorclass=DictCursor)

  1. import MySQLdb
  2. from MySQLdb.cursors import Cursor, DictCursor
  3. def connect_db(db=None, cursorclass=DictCursor):
  4. conn= MySQLdb.connect(
  5. host='127.0.0.1', #'localhost'
  6. port = 3306,
  7. user='root',
  8. passwd='root',
  9. # db ='mobile',
  10. charset='utf8'
  11. )
  12. curs = conn.cursor(cursorclass)
  13. if db is not None:
  14. curs.execute("USE %s"%db)
  15. return conn, curs
  1. conn, curs = connect_db('mobile')

6.汇总表填充必要数据

info数据构成:

  1. In [105]: info = pickle.load(open(''))
  2.  
  3. In [106]: info[0]
  4. Out[106]:
  5. ('18xxxxx3664', #此处隐藏号码细节
  6. u'\u6cb3\u5317',
  7. u'\u79e6\u7687\u5c9b\u5e02',
  8. u'hebei',
  9. u'qinhuangdaoshi',
  10. 'https://m.1xxxx.com/xxxxxxxxxx') #此处隐藏url细节
  11.  
  12. In [107]: info = [dict(province_zh=i[1], city_zh=i[2], province=i[3], city=i[4], url=i[-1]) for i in info]
  13.  
  14. In [108]: info[0]
  15. Out[108]:
  16. {'city': u'qinhuangdaoshi',
  17. 'city_zh': u'\u79e6\u7687\u5c9b\u5e02',
  18. 'province': u'hebei',
  19. 'province_zh': u'\u6cb3\u5317',
  20. 'url': 'https://m.1xxxx.com/xxxxxxxxxx'} #此处隐藏url细节
  21.  
  22. # python中的排序问题——多属性排序
    # https://www.2cto.com/kf/201312/265675.html
    In [114]: info = sorted(info, key=lambda x:(x.get('province'), x.get('city')))

插入必要数据:

  1. In [115]: curs.executemany("""
  2. ...: INSERT INTO china(province, city, province_zh, city_zh, url)
  3. ...: values(%(province)s,%(city)s,%(province_zh)s,%(city_zh)s,%(url)s)""", info)
  4. ...: conn.commit()

山西和陕西的拼音相同

  1. mysql> select count(distinct province),count(distinct city),count(distinct province_zh),count(distinct city_zh),count(distinct url) from china;
  2. +--------------------------+----------------------+-----------------------------+-------------------------+---------------------+
  3. | count(distinct province) | count(distinct city) | count(distinct province_zh) | count(distinct city_zh) | count(distinct url) |
  4. +--------------------------+----------------------+-----------------------------+-------------------------+---------------------+
  5. | 30 | 326 | 31 | 331 | 339 |
  6. +--------------------------+----------------------+-----------------------------+-------------------------+---------------------+
  7. 1 row in set (0.00 sec)

修正 陕西 为 shanxi3

  1. mysql> set character_set_client = gbk;
    Query OK, 0 rows affected (0.00 sec)
  2.  
  3. mysql> set character_set_results = gbk;
    Query OK, 0 rows affected (0.00 sec)
  4.  
  5. mysql> UPDATE china
  6. -> SET province = 'shanxi3'
  7. -> WHERE province_zh = '陕西';
  8. Query OK, 10 rows affected (0.01 sec)
  9. Rows matched: 10 Changed: 10 Warnings: 0 

如果使用curs 进行 update 操作需要 conn.commit()

  1. In [99]: curs.execute("""
  2. ...: UPDATE china
  3. ...: SET province = 'shanxi3'
  4. ...: WHERE province_zh = '陕西';
  5. ...: """)
  6. "\nUPDATE china\nSET province = 'shanxi3'\nWHERE province_zh = '\xe9\x99\x95\xe8\xa5\xbf';\n"
  7. Out[99]: 10L
  8.  
  9. In [100]: conn.commit()
  10.  
  11. In [101]: curs.fetchall()
  12. Out[101]: ()

7.新建各省份子表

  1. curs.execute("""
  2. SELECT DISTINCT province
  3. FROM china
  4. """)
  5. table_names = [i.get('province') for i in curs]
  6.  
  7. # id 字段在这里并不是必要
  8. # UNIQUE 用于后续插入行时对已有号码只更新原记录的部分属性 on duplicate
  9. # FOREIGN KEY 用于同步跟随汇总表
  10.  
  11. for table_name in table_names:
  12. curs.execute("""
  13. CREATE TABLE %s(
  14. id INT NOT NULL AUTO_INCREMENT,
  15. china_id INT NOT NULL,
  16. num BIGINT NOT NULL,
  17. times INT NULL DEFAULT 1,
  18. update_time datetime NULL,
  19. head INT(3) NULL,
  20. mid INT(4) ZEROFILL NULL,
  21. tail INT(4) ZEROFILL NULL,
  22. mid_match CHAR(4) NULL,
  23. tail_match CHAR(4) NULL,
  24.  
  25. PRIMARY KEY(id),
  26. UNIQUE KEY(num),
  27. FOREIGN KEY (china_id)
  28. REFERENCES china(id)
  29. ON DELETE CASCADE
  30. ON UPDATE CASCADE
  31. )ENGINE=InnoDB DEFAULT CHARSET=utf8;"""%(table_name))

参看索引 index

  1. mysql> show index from anhui;
  2. +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
  3. | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
  4. +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
  5. | anhui | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | | |
  6. | anhui | 0 | num | 1 | num | A | 0 | NULL | NULL | | BTREE | | |
  7. | anhui | 1 | china_id | 1 | china_id | A | 0 | NULL | NULL | | BTREE | | |
  8. +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

8.完整代码

  1. #!/usr/bin/env python
  2. # -*- coding: UTF-8 -*
  3. import time
  4. import re
  5. import MySQLdb
  6. from MySQLdb.cursors import Cursor, DictCursor
  7. import json
  8. import traceback
  9.  
  10. import threading
  11. lock = threading.Lock()
  12. import Queue
  13. task_queue = Queue.Queue()
  14. result_queue = Queue.Queue()
  15.  
  16. import requests
  17. from requests.exceptions import (ConnectionError, ConnectTimeout, ReadTimeout, SSLError,
  18. ProxyError, RetryError, InvalidSchema)
  19. s = requests.Session()
  20. s.headers.update({'user-agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_5 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13G36 MicroMessenger/6.5.12 NetType/4G'})
  21. # 此处隐藏 Referer 细节,也可不用
  22. # s.headers.update({'Referer':'https://servicewechat.com/xxxxxxxxxxx'})
  23. s.verify = False
  24. s.mount('https://', requests.adapters.HTTPAdapter(pool_connections=1000, pool_maxsize=1000))
  25.  
  26. import copy
  27. sp = copy.deepcopy(s)
  28. proxies = {'http': 'http://127.0.0.1:3128', 'https': 'https://127.0.0.1:3128'}
  29. sp.proxies = proxies
  30.  
  31. from urllib3.exceptions import InsecureRequestWarning
  32. from warnings import filterwarnings
  33. filterwarnings('ignore', category = InsecureRequestWarning)
  34.  
  35. import logging
  36. def get_logger():
  37. logger = logging.getLogger("threading_example")
  38. logger.setLevel(logging.DEBUG)
  39.  
  40. # fh = logging.FileHandler("d:/threading.log")
  41. fh = logging.StreamHandler()
  42. fmt = '%(asctime)s - %(threadName)-10s - %(levelname)s - %(message)s'
  43. formatter = logging.Formatter(fmt)
  44. fh.setFormatter(formatter)
  45.  
  46. logger.addHandler(fh)
  47. return logger
  48.  
  49. logger = get_logger()
  50.  
  51. def connect_db(db=None, cursorclass=DictCursor):
  52. conn= MySQLdb.connect(
  53. host='127.0.0.1', #'localhost'
  54. port = 3306,
  55. user='root',
  56. passwd='root',
  57. # db ='mobile',
  58. charset='utf8'
  59. )
  60. curs = conn.cursor(cursorclass)
  61. if db is not None:
  62. curs.execute("USE %s"%db)
  63. return conn, curs
  64.  
  65. def read_table_china():
  66. dict_conn, dict_curs = connect_db('mobile',DictCursor)
  67. dict_curs.execute("""
  68. SELECT id, province, province_zh, city_zh, url
  69. FROM china
  70. """)
  71. table_china = dict_curs.fetchall()
  72. dict_conn.close()
  73.  
  74. return table_china
  75.  
  76. def add_task():
  77. while True:
  78. if task_queue.qsize() < 1000:
  79. for t in table_china:
  80. task_queue.put(t) #队列里面依旧只有300多个id,重复引用,后续get还是得dict
  81.  
  82. loop = 0
  83. def do_task():
  84. global loop
  85. while True:
  86. task = dict(task_queue.get()) ###########
  87. nums = get_nums(task.get('url'))
  88. if nums is None:
  89. continue
  90.  
  91. task.update(dict(update_time=time.strftime('%y-%m-%d %H:%M:%S')))
  92.  
  93. # results用于后续更新子表,executemany insert 效率更高
  94. results = []
  95. for num in nums:
  96. result = dict(task) ###########
  97. match_dict = parse_num(num)
  98. result.update(match_dict)
  99. results.append(result)
  100.  
  101. result_queue.put(results)
  102. task_queue.task_done()
  103.  
  104. # 记得加u
  105. logger.debug(u'{} tasks:{} results:{} threads: {} loop: {} {}_{}'.format(nums[-1], task_queue.qsize(), result_queue.qsize(),
  106. threading.activeCount(), loop,
  107. task.get('province_zh'), task.get('city_zh')))
  108.  
  109. with lock:
  110. loop += 1
  111.  
  112. def get_nums(url):
  113. try:
  114. url = url + str(int(time.time()*1000))
  115. resp = sp.get(url)#, timeout=10) #加上超时,网络性能矩形陡降
  116. rst = resp.content
  117.  
  118. # rst = rst[rst.index('{'):rst.index('}')+1]
  119. m = re.search(r'({.*?})', rst)
  120. match = m.group()
  121. rst = json.loads(match)
  122. nums = [num for num in rst['numArray'] if num>10000]
  123. # nums_len = len(nums)
  124. # assert nums_len == 10
  125. assert nums != []
  126.  
  127. return nums
  128.  
  129. except (ConnectionError, ConnectTimeout, ReadTimeout, SSLError,
  130. ProxyError, RetryError, InvalidSchema) as err:
  131. pass
  132. except (ValueError, AttributeError, IndexError) as err:
  133. pass
  134. except AssertionError as err:
  135. pass
  136. except Exception as err:
  137. print err,traceback.format_exc()
  138.  
  139. # 解析号码特征
  140. def parse_num(num):
  141. # num = 18522223333
  142. num_str = str(num)
  143. head = num_str[:3]
  144. mid = num_str[3:7]
  145. tail = num_str[-4:]
  146.  
  147. match_dict = {'mid_match':mid, 'tail_match':tail}
  148. for k,v in match_dict.items():
  149. part_1, part_2, part_3, part_4 = [int(i) for i in v]
  150. if part_1-part_2==part_2-part_3==part_3-part_4== -1:
  151. match_dict[k] = 'ABCD'
  152. elif part_1-part_2==part_2-part_3==part_3-part_4== 1:
  153. match_dict[k] = 'DCBA'
  154. elif part_1==part_2==part_3==part_4:
  155. match_dict[k] = 'SSSS'
  156. elif part_2==part_3 and (part_1==part_2 or part_3==part_4):
  157. match_dict[k] = '3S'
  158. elif part_1==part_2 and part_3==part_4:
  159. match_dict[k] = 'XXYY'
  160. elif part_1==part_3 and part_2==part_4:
  161. match_dict[k] = 'XYXY'
  162. elif part_1==part_3 and k == 'mid_match':
  163. match_dict[k] = 'XYXZ'
  164. else:
  165. match_dict[k] = None
  166.  
  167. match_dict.update(dict(num=num, head=int(head), mid=int(mid), tail=int(tail)))
  168.  
  169. return match_dict
  170.  
  171. def update_table_province():
  172. conn, curs = connect_db('mobile', Cursor)
  173. while True:
  174. try:
  175. results = result_queue.get()
  176.  
  177. prefix = "insert into %s"%(results[0].get('province'))
  178. # 已经设置 num 字段为unique,如果可能导致重复,则更新 update_time , 同时 times 加1
  179. sql = (prefix+\
  180. """(china_id, num, update_time, head, mid, tail, mid_match, tail_match)
  181. values(%(id)s, %(num)s, %(update_time)s, %(head)s, %(mid)s, %(tail)s, %(mid_match)s, %(tail_match)s)
  182. ON DUPLICATE KEY UPDATE update_time=values(update_time), times=times+1""")
  183. curs.executemany(sql, results)
  184.  
  185. conn.commit()
  186. result_queue.task_done()
  187. except Exception as err:
  188. # pass
  189. print err,traceback.format_exc()
  190. result_queue.put(results)
  191. try:
  192. conn.close()
  193. except:
  194. pass
  195. conn, curs = connect_db('mobile', Cursor)
  196.  
  197. def update_table_china():
  198. dict_conn, dict_curs = connect_db('mobile', DictCursor)
  199. province_list = set([info.get('province') for info in table_china])
  200. while True:
  201. try:
  202. for province in province_list:
  203. # 先按照id降序,最后获取的新号码靠前
  204. dict_curs.execute("""
  205. SELECT china_id, count(*) AS num_count, update_time AS new_time
  206. FROM (select * from %s order by id desc) AS temp
  207. GROUP BY china_id;
  208. """%(province))
  209.  
  210. # dict_curs的复用?
  211. dict_curs.executemany("""
  212. UPDATE china
  213. SET num_count = %(num_count)s, new_time = %(new_time)s
  214. WHERE id = %(china_id)s;
  215. """, dict_curs)
  216. dict_conn.commit()
  217.  
  218. # 先按照更新时间降序
  219. dict_curs.execute("""
  220. SELECT china_id, update_time, num AS latest_num
  221. FROM (select * from %s order by update_time desc) AS temp
  222. GROUP BY china_id;
  223. """%(province))
  224.  
  225. dict_curs.executemany("""
  226. UPDATE china
  227. SET update_time = %(update_time)s, latest_num = %(latest_num)s
  228. WHERE id = %(china_id)s;
  229. """, dict_curs)
  230. dict_conn.commit()
  231.  
  232. time.sleep(300)
  233. except Exception as err:
  234. # pass
  235. print err,traceback.format_exc()
  236. try:
  237. dict_conn.close()
  238. except:
  239. pass
  240. dict_conn, dict_curs = connect_db('mobile', DictCursor)
  241.  
  242. if __name__ == '__main__':
  243.  
  244. table_china = read_table_china()
  245.  
  246. threads = []
  247.  
  248. t = threading.Thread(target=add_task) #args接收元组,至少(a,)
  249. threads.append(t)
  250.  
  251. for i in range(500):
  252. t = threading.Thread(target=do_task)
  253. threads.append(t)
  254.  
  255. for i in range(20):
  256. t = threading.Thread(target=update_table_province)
  257. threads.append(t)
  258.  
  259. t = threading.Thread(target=update_table_china)
  260. threads.append(t)
  261.  
  262. # for t in threads:
  263. # t.setDaemon(True)
  264. # t.start()
  265. # while True:
  266. # pass
  267.  
  268. for t in threads:
  269. t.start()
  270. # for t in threads:
  271. # t.join()
  272.  
  273. while True:
  274. pass

9.运行结果

python之 MySQLdb 实践 爬一爬号码的更多相关文章

  1. Python爬虫初学(二)—— 爬百度贴吧

    Python爬虫初学(二)-- 爬百度贴吧 昨天初步接触了爬虫,实现了爬取网络段子并逐条阅读等功能,详见Python爬虫初学(一). 今天准备对百度贴吧下手了,嘿嘿.依然是跟着这个博客学习的,这次仿照 ...

  2. python 网络爬虫(一)爬取天涯论坛评论

    我是一个大二的学生,也是刚接触python,接触了爬虫感觉爬虫很有趣就爬了爬天涯论坛,中途碰到了很多问题,就想把这些问题分享出来, 都是些简单的问题,希望大佬们以宽容的眼光来看一个小菜鸟

  3. python反反爬,爬取猫眼评分

    python反反爬,爬取猫眼评分.解决网站爬取时,内容类似:$#x12E0;样式,且每次字体文件变化.下载FontCreator . 用FontCreator打开base.woff.查看对应字体关系 ...

  4. 孤荷凌寒自学python第八十一天学习爬取图片1

    孤荷凌寒自学python第八十一天学习爬取图片1 (完整学习过程屏幕记录视频地址在文末) 通过前面十天的学习,我已经基本了解了通过requests模块来与网站服务器进行交互的方法,也知道了Beauti ...

  5. Python爬虫学习三------requests+BeautifulSoup爬取简单网页

    第一次第一次用MarkDown来写博客,先试试效果吧! 昨天2018俄罗斯世界杯拉开了大幕,作为一个伪球迷,当然也得为世界杯做出一点贡献啦. 于是今天就编写了一个爬虫程序将腾讯新闻下世界杯专题的相关新 ...

  6. 初识python 之 爬虫:使用正则表达式爬取“糗事百科 - 文字版”网页数据

    初识python 之 爬虫:使用正则表达式爬取"古诗文"网页数据 的兄弟篇. 详细代码如下: #!/user/bin env python # author:Simple-Sir ...

  7. itchat 爬了爬自己的微信通讯录

    参考 一件有趣的事: 爬了爬自己的微信朋友 忘记从谁那里看到的了,俺也来试试 首先在annconda prompt里面安装了itchat包 pip install itchat 目前对python这里 ...

  8. SharePoint如何将使列表不被爬网爬到。

    有一个项目,没有对表单进行严格的权限管理,虽然用户在自己的首页只能看到属于的单子,但是在搜索的时候,所有人的单子都能被搜到,所以客户造成了困惑. 那么问题来了,怎么让列表或者文档库不被爬网爬到. 有两 ...

  9. Redis的Python实践,以及四中常用应用场景详解——学习董伟明老师的《Python Web开发实践》

    首先,简单介绍:Redis是一个基于内存的键值对存储系统,常用作数据库.缓存和消息代理. 支持:字符串,字典,列表,集合,有序集合,位图(bitmaps),地理位置,HyperLogLog等多种数据结 ...

随机推荐

  1. hibernate框架学习之数据模型-POJO

    Hibernate数据模型用于封装数据,开发时候需要遵从如下规范:1)提供公共无参的构造方法(可使用自动生成的)如果使用投影技术,一定要显式声明公共无参的构造方法2)提供一个标识属性,作为对象的主键, ...

  2. sqlserver 2012 分页

    --2012的OFFSET分页方式 select number from spt_values where type='p' order by number offset 10 rows fetch ...

  3. 缓存系列之四:redis持久化与redis主从复制

    一:redis 虽然是一个内存级别的缓存程序,即redis 是使用内存进行数据的缓存的,但是其可以将内存的数据按照一定的策略保存到硬盘上,从而实现数据持久保存的目的,redis支持两种不同方式的数据持 ...

  4. Linux自动人机交互expect

    exp_test.sh文件 #!/bin/bash/expect ## exp_test.sh set timeout -; spawn ssh localhost; expect { "( ...

  5. 最新版Kali Linux虚拟机安装Open-vm-tools替代VMware tools

    自从Kali 2.0发布之后,会经常遇到安装vmware tools无法成功,或者提示安装成功了但是仍旧无法进行文件拖拽.复制和剪切的问题. 今天给新电脑装系统,重新下载了最新版,Kali 2017. ...

  6. [PHP]命名空间的一些要点

    1.命名空间前不能接"\": namespace MyProject\Sub\Level; // it's right; namespace \MyProject\Sub\Leve ...

  7. SpringCloud的版本

    Spring Cloud 项目目前仍然是快速迭代期,版本变化很快.这里整理一下版本相关的东西,备忘一下. 大版本 版本号规则 Spring Cloud并没有熟悉的数字版本号,而是对应一个开发代号. C ...

  8. leetcode(js)算法605之种花问题

    假设你有一个很长的花坛,一部分地块种植了花,另一部分却没有.可是,花卉不能种植在相邻的地块上,它们会争夺水源,两者都会死去. 给定一个花坛(表示为一个数组包含0和1,其中0表示没种植花,1表示种植了花 ...

  9. Java的动手动脑(四)

    日期:2018.10.18 星期四 博客期:019 Part1:回答为啥会报错 答案:当然会报错啦!因为平常的编程过程中,系统会对我们写的类自动生成一个默认无参形式的构造方法,类似于C++中的体制!这 ...

  10. Question Of AI Model Training

    1 模型训练基本步骤 准备原始数据,定义神经网络结构及前向传播算法 定义loss,选择反向传播优化算法 生成Session,在训练数据进行迭代训练,使loss到达最小 在测试集或者验证集上对准确率进行 ...