Python爬虫之多线程

详情点我跳转

关注公众号“轻松学编程”了解更多。

多线程

在介绍Python中的线程之前，先明确一个问题，Python中的多线程是假的多线程！
为什么这么说，我们先明确一个概念，全局解释器锁（GIL）

一、什么是GIL

Python代码的执行由Python虚拟机（解释器）来控制,同时只有一个线程在执行。对Python虚拟机的访问由全局解释器锁（GIL）来控制，正是这个锁能保证同时只有一个线程在运行。

二、为什么要用GIL

为了线程间数据的一致性和状态同步的完整性，（例如：线程2需要线程1执行完成的结果，然而线程2又比线程1执行时间短，线程2执行完成，线程1仍然还在执行，这就是数据的同步性）

三、GIL的影响

只有一个线程在运行，无法使用多核。

在多线程环境中，Python虚拟机按照以下方式执行。

1.设置GIL。
2.切换到一个线程去执行。
3.运行。
4.把线程设置为睡眠状态。
5.解锁GIL。
6.再次重复以上步骤。
比方我有一个4核的CPU，那么这样一来，在单位时间内每个核只能跑一个线程，然后时间片轮转切换。
但是Python不一样，它不管你有几个核，单位时间多个核只能跑一个线程，然后时间片轮转。
执行一段时间后让出，多线程在Python中只能交替执行，10核也只能用到1个核
例如： cpu --30%

from threading import Thread

def loop():

    while True:

        print("亲爱的，我错了，我能吃饭了吗?")

if __name__ == '__main__':

    for i in range(3):

        t = Thread(target=loop)

        t.start()

    while True:

        pass

而如果我们变成进程呢？cpu --100%

from multiprocessing import Process

def loop():

    while True:

        print("亲爱的，我错了，我能吃饭了吗?")

if __name__ == '__main__':

    for i in range(3):

        t = Process(target=loop)

        t.start()

    while True:

        pass

四、多线程怎么使用多核

1、重写python编译器(官方cpython) 如使用：PyPy解释器
2、调用C语言的链接库

五、cpu密集型(计算密集型)、I/O密集型

计算密集型任务由于主要消耗CPU资源，代码运行效率至关重要，C语言编写
IO密集型，涉及到网络、磁盘IO的任务都是IO密集型任务，这类任务的特点是CPU消耗很少，任务的大部分时间都在等待IO操作完成。99%的时间花费在IO上，脚本语言(如python)是首选，C语言最差。

六、创建多线程

#####1、使用_thread.start_new_thread开辟子线程

用这种方式创建的线程为【守护线程】（主线程死去“护卫”也随“主公”而去），主线程死掉，子线程也死掉(不管子线程是否执行完)。注意：python3以后已经放弃这种创建子线程的方式，所以在使用时可能会出错。

import _thread

import threading

import time

def doSth(arg):

    # 拿到当前线程的名称和线程号id

    threadName = threading.current_thread().getName()

    tid = threading.current_thread().ident

    for i in range(5):

        print("%s *%d @%s,tid=%d" % (arg, i, threadName, tid))

        time.sleep(2)

def simpleThread():

    # 创建子线程，执行doSth

    # 用这种方式创建的线程为【守护线程】

    #主线程死去“护卫”也随“主公”而去

    _thread.start_new_thread(doSth,("开启了子线程",))

    mainThreadName = threading.current_thread().getName()

    print(threading.current_thread())

    for i in range(5):

        print("我是主线程@%s" % (mainThreadName))

        time.sleep(1)

        # 阻塞主线程，以使【守护线程】能够执行完毕

    while True:

        pass

if __name__ == '__main__':

    simpleThread()

#####2、通过创建threading.Thread对象实现子线程

默认创建的不是守护进程，可以通过方法setDaemon(True)来修改。



import threading

import time

def doSth(arg):

    # 拿到当前线程的名称和线程号id

    threadName = threading.current_thread().getName()

    tid = threading.current_thread().ident

    for i in range(5):

        print("%s *%d @%s,tid=%d" % (arg, i, threadName, tid))

        time.sleep(2)

def threadingThread():

    # 默认不是【守护线程】

    # args=(,) 必须是元组

    t = threading.Thread(target=doSth,args=('我是子线程',))

    # t.setDaemon(True)  # 设置为守护线程

    # 设置主线程名称

    t.setName('线程')

    # 启动线程，调用run()方法

    t.start()

    # 等待子线程执行完

    t.join()

    # 获取线程名称

    print(t.getName(),'执行完毕')

if __name__ == '__main__':

    threadingThread()

3、通过继承threading.Thread类，进而创建对象实现子线程

覆写父类的run方法。



import threading

import time

def doSth(arg):

    # 拿到当前线程的名称和线程号id

    threadName = threading.current_thread().getName()

    tid = threading.current_thread().ident

    print("%s  @%s,tid=%d" % (arg, threadName, tid))

    time.sleep(2)

class MyThread(threading.Thread):

    def __init__(self,name):

        super().__init__()

        # 覆盖了父类的name

        self.name = name

    # 覆写父类的run方法，

    # run方法以内为【要跑在子线程内的业务逻辑】

    #thread.start()会触发的业务逻辑

    def run(self):

        print(threading.current_thread().getName())

        print(threading.current_thread().daemon)

        # 如果为True就是守护线程，

        #threading.current_thread().ident  线程id

        doSth("线程id为%d"%threading.current_thread().ident)

if __name__ == '__main__':

    for i in range(5):

        mt = MyThread('线程%d'%i)

        #  启动线程

        mt.start()

#####4、几个重要的Adef importantAPI():
print(threading.currentThread()) # 返回当前的线程变量
# 创建五条子线程
t1 = threading.Thread(target=doSth, args=(“巡山”,))
t2 = threading.Thread(target=doSth, args=(“巡水”,))
t3 = threading.Thread(target=doSth, args=(“巡鸟”,))

t1.start()  # 开启线程

t2.start()

t3.start()

print(t1.isAlive())  # 返回线程是否活动的

print(t2.isDaemon())  # 是否是守护线程

print(t3.getName())  # 返回线程名

t3.setName("巡鸟")  # 设置线程名

print(t3.getName())

print(t3.ident)  # 返回线程号

# 返回一个包含正在运行的线程的list

tlist = threading.enumerate()

print("当前活动线程：", tlist)

# 返回正在运行的线程数量（在数值上等于len(tlist)）

count = threading.active_count()

print("当前活动线程有%d条" % (count))`

七、线程冲突

1、示例：

import threading

money = 0

def addMoney():

    global money

    for i in range(10000000):

        money += 1

    print(money)

if __name__ == '__main__':

    # addMoney()

    for i in range(2):

        t = threading.Thread(target=addMoney)

        t.start()

输出：

11769218

12363994

输出应该为：

10000000

20000000

原因：CPU分配的时间片不足以完成一百万次加法运算，因此结果还没有被保存到内存中就被其它线程所打断。

由于多个线程并发访问同一个变量而互相干扰，所以造成输出结果不对。

2、使用互斥锁解决冲突

互斥锁
状态：锁定/非锁定
创建锁： lock = threading.Lock()

成对出现：

if lock.acquire():

	money +=1

	lock.release()

使用with来管理

with lock:

	money +=1

import threading

import time

money = 0

# 创建线程锁

lock = threading.Lock()

def addMoney():

    global money

    for i in range(10000000):

        money += 1

    print(money)

def addMoneyLock():

    global money

    if lock.acquire():

        # -----下面的代码只有拿到lock对象才能执行-----

        for i in range(10000000):

            money += 1

        # 释放线程锁，以使其它线程能够拿到并执行逻辑

        lock.release()

        # ----------------锁已被释放-----------------

    print(money)

def addMoneyWithLock():

    time.sleep(1)

    global money

    # 独占线程锁

    with lock:  # 阻塞直到拿到线程锁

        # -----下面的代码只有拿到lock对象才能执行-----

        for i in range(1000000):

            money += 1

        # 释放线程锁，以使其它线程能够拿到并执行逻辑

        # ----------------锁已被释放-----------------

    print(money)

# 5条线程同时访问money变量，导致结果不正确

def conflictDemo():

    for i in range(5):

        t = threading.Thread(target=addMoney)

        t.start()

# 通过依次独占线程锁解决线程冲突

def handleConflictByLock():

    # 并发5条线程

    for i in range(5):

        t = threading.Thread(target=addMoneyWithLock)

        t.start()

if __name__ == '__main__':

    time.clock()

    # conflictDemo()

    handleConflictByLock()

    print(time.clock())

3、使用递归锁解决冲突

由于线程中可能会出现互相锁住对方线程需要的资源，造成死锁局面，所以使用递归锁，用于解决死锁的问题,可重复锁。

import  threading

money = 0

# 创建线程锁

rlock = threading.RLock()

def addMoney():

    global money

    with rlock:

        for i in range(10000000):

            money += 1

    print(money)

if __name__ == '__main__':

    for i in range(5):

        t = threading.Thread(target=addMoney)

        t.start()

4、通过线程同步来解决冲突

使用t.join()函数阻塞：

import threading

import time

money = 0

def addMoney():

    global money

    for i in range(10000000):

        money += 1

    print(money)

# 通过线程同步（依次执行）解决线程冲突

def handleConflictBySync():

    for i in range(5):

        t = threading.Thread(target=addMoney)

        t.start()

        t.join()  # 一直阻塞到t运行完毕

if __name__ == '__main__':

    time.clock()

    handleConflictBySync()

    print(time.clock())

八、使用Semaphore调度线程：控制最大并发量

并行：多条一起运行

并发：伪并行，同一时间，启动了多个，轮循执行

import threading

#  value 控制的线程数

import time

sem = threading.Semaphore(3)

'''

sem.acquire() # 加锁

sem.release()

'''

def doSth(arg):

    with sem:

        tname = threading.current_thread().getName()

        print("%s正在执行【%s】" % (tname, arg))

        time.sleep(1)

        print("-----%s执行完毕!-----\n" % (tname))

        time.sleep(0.1)

if __name__ == '__main__':

    threadList = []

    for i in range(10):

        t = threading.Thread(target=doSth,args=(i,))

        t.start()

        threadList.append(t)

    # 保证子线程正常结束

    for t in threadList:

        t.join()

九、生产消费者模型

通过threading.Condition实现线程通信

'''

生产消费模型

'''

import random

import threading

# 线程通信信物

condition = threading.Condition()

# 产品容器

pList = []

class Product():

    '''

    产品类

    '''

    def __init__(self,name):

        self.name = name

    def __str__(self):

        return "%s个产品" %self.name

class Producer(threading.Thread):

    '''

    生产者

    '''

    def run(self):

        while True:

            # 生产产品

            with condition:

                p = Product(random.randint(100,1000))

                print("生产了：",p)

                # 存放到容器

                pList.append(p)

                # 通知消费者，谁wait()了就通知谁

                condition.notify()

                # 监听消费者通知，谁wait代表谁希望被notify

                #（wait中会释放condition）

                condition.wait()

            # with走完，交出condition

            # 此处condition已释放（condition.release()）

class Consumer(threading.Thread):

    '''

    消费者

    '''

    def run(self):

        while True:

            # 拿到产品

            with condition:

                try:

                    p = pList.pop()

                    print("消费者消费了：",p)

                    # 通知生产者生产,谁with了相同condition

                    #且wait就通知谁

                    condition.notify()

                    # 等候生产者消息（wait中会释放condition）

                    condition.wait()

                except:

                    print("没有产品")

            # 此处condition已释放（condition.release()）

if __name__ == '__main__':

    p = Producer()

    c = Consumer()

    p.start()

    c.start(

后记

【后记】为了让大家能够轻松学编程，我创建了一个公众号【轻松学编程】，里面有让你快速学会编程的文章，当然也有一些干货提高你的编程水平，也有一些编程项目适合做一些课程设计等课题。

也可加我微信【1257309054】，拉你进群，大家一起交流学习。
如果文章对您有帮助，请我喝杯咖啡吧！

公众号

关注我，我们一起成长~~