Python多进程应用

在我之前的一篇博文中详细介绍了Python多线程的应用：

进程，线程，GIL，Python多线程，生产者消费者模型都是什么鬼

但是由于GIL的存在，使得python多线程没有充分利用CPU的多核，为了利用多核，我可以采用多进程；

1. 父进程与子进程

wiki上对于父进程与子进程的定义：

a）Parent process

In Unix-like operating systems, every process except process 0 (the swapper) is created when another process executes the fork() system call. The process that invoked fork is the parent process and the newly created process is the child process. Every process (except process 0) has one parent process, but can have many child processes.

In the Linux kernel, in which there is a very slim difference between processes and POSIX threads, there are two kinds of parent processes, namely real parent and parent. Parent is the process that receives the SIGCHLD signal on child's termination, whereas real parent is the thread that actually created this child process in a multithreaded environment. For a normal process, both these two values are same, but for a POSIX thread which acts as a process, these two values may be different.^[1]

b）Child process

A child process in computing is a process created by another process (the parent process). This technique pertains to multitasking operating systems, and is sometimes called a subprocess or traditionally a subtask.

There are two major procedures for creating a child process: the fork system call (preferred in Unix-like systems and the POSIX standard) and the spawn (preferred in the modern (NT) kernel of Microsoft Windows, as well as in some historical operating systems).

即，Unix/Linux操作系统提供了一个fork()系统调用，用于创建子进程；fork()非常特殊。普通的函数调用，调用一次，返回一次，但是fork()调用一次，返回两次，因为操作系统自动把当前进程（称为父进程）复制了一份（称为子进程），然后，分别在父进程和子进程内返回。对于返回值，子进程永远返回0，而父进程返回子进程的ID。这样做的理由是，一个父进程可以fork出很多子进程，所以，父进程要记下每个子进程的ID；

python的os模块，就含有fork函数：

#!/bin/env python

#coding:utf-8

import os

import time

print('Process %s start...' % os.getpid())

pid = os.fork()

if pid == 0:

    print('i am child process %s and my parent is %s' % (os.getpid(), os.getppid()))

else:

    print('i %s just created a child process %s' % (os.getpid(), pid))

运行结果：

Process 3522 start...

i 3522 just created a child process 3523

i am child process 3523 and my parent is 3522

因为fork()调用一次，返回两次，所以得到上面的结果；这里注意：由于Windows没有fork调用，上面的代码在Windows上无法运行；有了fork调用，一个进程在接到新任务时就可以复制出一个子进程来处理新任务，常见的Apache服务器就是由父进程监听端口，每当有新的http请求时，就fork出子进程来处理新的http请求。

2. multiprocessing

上面说到windows没有fork调用，那么如何在windows上实现多进程呢？

通过multiprocess模块，由于Python是跨平台的，自然也应该提供一个跨平台的多进程支持。multiprocessing模块就是跨平台版本的多进程模块。

python中两个常用来处理进程的模块分别是subprocess和multiprocessing，其中subprocess通常用于执行外部程序，比如一些第三方应用程序，而不是Python程序。如果需要实现调用外部程序的功能，python的psutil模块是更好的选择，它不仅支持subprocess提供的功能，而且还能对当前主机或者启动的外部程序进行监控，比如获取网络、cpu、内存等信息使用情况，在做一些自动化运维工作时支持的更加全面。multiprocessing是python的多进程模块，主要通过启动python进程，调用target回调函数来处理任务。

注意：multiprocessing的方法与threading的方法类似，所以我们这里只给出示例代码，而不做详细介绍；

1）multiprocessing基本使用

与threading类似，也是有两种方式

a）直接调用

 from multiprocessing import Process, freeze_support

 import os

 processes = []

 def run(item):

     print('-'*50)

     print('child process %s id: %s'%(item, os.getpid()))

     print('child process %s parent id: %s' % (item, os.getppid()))

 def main():

     #打印主进程进程号

     print('main process id: ', os.getpid())

     #创建多个子进程

     for item in range(2):

         p = Process(target=run, args=(item, ))

         processes.append(p)

         print('child process %s name: %s' % (item, p.name))

         print('child process %s id: %s' % (item, p.pid))

     for item in processes:

         item.start()

     for item in processes:

         item.join()

 if __name__ == '__main__':

     main()

     freeze_support()

b）面向对象方式调用

 from multiprocessing import Process, freeze_support

 import os

 processes = []

 class MyProcess(Process):

     def __init__(self, func, item):

         super(MyProcess, self).__init__()

         self.__func = func

         self.__item = item

     def run(self):

         self.__func(self.__item)

 def proc(item):

     print('-'*50)

     print('child process %s id: %s'%(item, os.getpid()))

     print('child process %s parent id: %s' % (item, os.getppid()))

 def main():

     #打印主进程进程号

     print('main process id: ', os.getpid())

     #创建多个子进程

     for item in range(2):

         p = MyProcess(proc, item)

         processes.append(p)

         print('child process %s name: %s' % (item, p.name))

         print('child process %s id: %s' % (item, p.pid))

     for item in processes:

         item.start()

     for item in processes:

         item.join()

 if __name__ == '__main__':

     main()

     freeze_support()

注：2.7中，if __name__ == '__main__'的代码块中必须加上freeze_support()，python3好像不需要了

结果：

main process id:  10972

child process 0 name: MyProcess-1

child process 0 id: None

child process 1 name: MyProcess-2

child process 1 id: None

--------------------------------------------------

child process 0 id: 10636

child process 0 parent id: 10972

--------------------------------------------------

child process 1 id: 8076

child process 1 parent id: 10972

2）daemon属性设置

 from multiprocessing import Process

 import time

 processes = []

 def run(item):

     time.sleep(1)

     print('item: ', item)

 def main():

     #创建多个子进程

     for item in range(2):

         p = Process(target=run, args=(item, ))

         processes.append(p)

         p.daemon = True

     for item in processes:

         item.start()

     print('all done')

 if __name__ == '__main__':

     main()

结果：

all done

注意daemon和threading的方式不同，这里是直接设置属性，而不是调用方法；另外要在start前设置daemon；

3）进程同步

既然进程之间不共享数据，为什么还有进程同步问题呢？如果多个进程打开同一个文件，在同一个屏幕输出呢？这些还是需要进程同步的，通过Lock

4）Semaphore

同threading.Semaphore()用法相同，只是创建的Semaphore需要作为参数传入子进程，因为进程间不共享资源

5）Event

同threading.Event()用法相同，只是创建的Event需要作为参数传入子进程

6）进程间通讯

因为进程之间不共享资源，我们先看一个例子证明一下：

 from multiprocessing import Process

 processes = []

 data_list = []

 def run(lst, item):

     lst.append(item)

     print('%s : %s' % (item, lst))

 def main():

     for item in range(4):

         p = Process(target=run, args=(data_list, item))

         processes.append(p)

     for item in processes:

         item.start()

     for item in processes:

         item.join()

     print('final lst: ', data_list)

 if __name__ == '__main__':

     main()

结果：

1 : [1]

2 : [2]

0 : [0]

3 : [3]

final lst:  []

所以必须通过第三方实现进程间通讯，下面介绍3种方法

a）Queue

用法与queue.Queue在多线程中的应用相同，只是创建的queue要作为参数传入子进程

 from multiprocessing import Process, Queue

 import time

 q = Queue(10)

 def put(q):

     for i in range(3):

         q.put(i)

     print('queue size after put: %s' % q.qsize())

 def get(q):

     print('queue size before get: %s' % q.qsize())

     while not q.empty():

         print('queue get: ', q.get())

 def main():

     p_put = Process(target=put, args=(q,))

     p_get = Process(target=get, args=(q,))

     p_put.start()

     time.sleep(1)

     p_get.start()

     p_get.join()

     print('all done')

 if __name__ == '__main__':

     main()

结果：

queue size after put: 3

queue size before get: 3

queue get:  0

queue get:  1

queue get:  2

all done

b）Pipe

Pipe方法返回(conn1, conn2)代表一个管道的两个端。Pipe方法有duplex参数，如果duplex参数为True(默认值)，那么这个管道是全双工模式，也就是说conn1和conn2均可收发。duplex为False，conn1只负责接受消息，conn2只负责发送消息。

send和recv方法分别是发送和接受消息的方法。例如，在全双工模式下，可以调用conn1.send发送消息，conn1.recv接收消息。如果没有消息可接收，recv方法会一直阻塞。如果管道已经被关闭，那么recv方法会抛出EOFError。

 import multiprocessing

 import time

 pipe = multiprocessing.Pipe()

 def send(pipe):

     for i in range(5):

         print("send: %s" % (i,))

         pipe.send(i)

         time.sleep(0.2)

 def recv_1(pipe):

     while True:

         print("rev_1:", pipe.recv())

         time.sleep(1)

 def recv_2(pipe):

     while True:

         print("rev_2:", pipe.recv())

         time.sleep(1)

 def main():

     p_send = multiprocessing.Process(target=send, args=(pipe[0],))

     p_recv_1 = multiprocessing.Process(target=recv_1, args=(pipe[1],))

     p_recv_2 = multiprocessing.Process(target=recv_2, args=(pipe[1],))

     p_send.start()

     p_recv_1.start()

     p_recv_2.start()

     p_send.join()

     p_recv_1.join()

     p_recv_2.join()

 if __name__ == "__main__":

     main()

结果：

send: 0

rev_1: 0

send: 1

rev_2: 1

send: 2

send: 3

send: 4

rev_1: 2

rev_2: 3

rev_1: 4

c）Manager

相当相当给力，上面的Queue，Pipe仅仅可以传递数据，而不能做到数据共享（不同进程修改同一份数据），但是Manger可以做到数据共享

看一下官方文档：

A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.

A manager returned by Manager() will support types list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array.

from multiprocessing import Process, Manager

def run(d, l):

    d['name'] = 'winter'

    l.reverse()

def main():

    p = Process(target=run, args=(d, l, ))

    p.start()

    p.join()

    print('final dict: ', d)

    print('final list: ', l)

if __name__ == "__main__":

    mgmt = Manager()

    d = mgmt.dict()

    l = mgmt.list(range(10))

    main()

注意:mgmt = Manger()必须放在if __name__ == "__main__"的代码块中，不然报freeze_support()的错误

而且，注意这里：

Server process managers are more flexible than using shared memory objects because they can be made to support arbitrary object types. Also, a single manager can be shared by processes on different computers over a network. They are, however, slower than using shared memory.

还可以在不同主机之间共享数据；

7）进程池Pool

如果要启动大量的子进程，可以用进程池pool批量创建子进程：Pool可以提供指定数量的进程，供用户调用，当有新的请求提交到pool中时，如果池还没有满，那么就会创建一个新的进程用来执行该请求；但如果池中的进程数已经达到规定最大值，那么该请求就会等待，直到池中有进程结束，才会创建新的进程来执行。

有两种方法：阻塞方法Pool.apply()和非阻塞方法Pool.apply_async()

a）阻塞方法Pool.apply()

import multiprocessing

import time

def func(name):

    print("start: %s" % name)

    time.sleep(2)

    return 'end: %s' % name

if __name__ == "__main__":

    name_list = ['winter', 'elly', 'james', 'yule']

    res_list = []

    # 创建一个进程总数为3的进程池

    pool = multiprocessing.Pool(3)

    for member in name_list:

        # 创建子进程，并执行，不需要start

        res = pool.apply(func, (member,))

        print(res)

    pool.close()

    # 调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool

    pool.join()

    print("all done...")

结果：

start: winter

end: winter

start: elly

end: elly

start: james

end: james

start: yule

end: yule

all done...

发现，阻塞方式下，进程是一个一个执行的，还是串行，所以apply用的少；

注意两点：

1. 进程池执行子进程不需要start；

2. 调用join()之前必须先调用close()，调用close()之后就不能继续添加新的Process了；

b）非阻塞方法Pool.apply_async()

import multiprocessing

import time

def func(name):

    print("start: %s" % name)

    time.sleep(2)

    return 'end: %s' % name

def func_exp(msg):

    print('callback: %s' % msg)

if __name__ == "__main__":

    name_list = ['winter', 'elly', 'james', 'yule']

    res_list = []

    # 创建一个进程总数为3的进程池

    pool = multiprocessing.Pool()

    for member in name_list:

        # 创建子进程，并执行，不需要start

        res = pool.apply_async(func, (member,), callback=func_exp)

        #注意这里是append了res，不是res.get()，不然又要阻塞了

        res_list.append(res)

    for res_mem in res_list:

        print(res_mem.get())

    pool.close()

    # 调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool

    pool.join()

    print("all done...")

结果：

start: winter

start: elly

start: james

start: yule

callback: end: winter

end: winter

callback: end: elly

end: elly

callback: end: james

end: james

callback: end: yule

end: yule

all done...

结果分析：

1. 可以看到非阻塞情况下，充分利用了多核，实现了并行；

2. apply_async方法含有callback参数，可以用于回调

3.为什么apply方法是阻塞的呢？到底阻塞在了哪里呢？同时apply_async方法做了什么改进呢？

查看apply方法源码：

def apply(self, func, args=(), kwds={}):

    '''

    Equivalent of `func(*args, **kwds)`.

    '''

    assert self._state == RUN

    return self.apply_async(func, args, kwds).get()

apply方法最终执行了self.apply_async(func, args, kwds).get()，同样调用了apply_async()方法，只是对结果执行了get()方法；阻塞就是阻塞在了这里；

那我修改一下apply_async()的代码是不是可以让apply_async()可以变成阻塞的呢？试一下

 import multiprocessing

 import time

 def func(name):

     print("start: %s" % name)

     time.sleep(2)

     return 'end: %s' % name

 def func_exp(msg):

     print('callback: %s' % msg)

 if __name__ == "__main__":

     name_list = ['winter', 'elly', 'james', 'yule']

     # 创建一个进程总数为3的进程池

     pool = multiprocessing.Pool()

     for member in name_list:

         # 创建子进程，并执行，不需要start

         res = pool.apply_async(func, (member,), callback=func_exp)

         print(res.get())

     pool.close()

     # 调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool

     pool.join()

     print("all done...")

注意红色部分是我修改的编码，结果果然变成了阻塞状态：

start: winter

callback: end: winter

end: winter

start: elly

callback: end: elly

end: elly

start: james

callback: end: james

end: james

start: yule

callback: end: yule

end: yule

all done...

c）进程池该设置多少个进程数？

既然多进程可以利用多核，那么是不是创建越多的进程越好呢？不是的，因为进程的切换成本高，所以数量太多的进程来回切换反而会降低效率！

进程数是一个经验值，和系统的硬件资源有很大关系；最优的进程数需要通过不断调整得出；

Pool创建时，进程池的进程数默认大小为CPU的逻辑CPU数目（内核线程数）；

经验上来说：

进程数与CPU核数比例1:1比较好，对于支持多线程的模型，线程数一般推荐的至少是1:1.5，这样可以留一部分线程来做IO。Python的多进程一般要么是做纯计算，要么是协程模型（没有IO等待时间，或者等待时间很少），要么在进程内再使用多线程（非常不推荐，需要了解fork机制），这样每个核一个进程一般足够了，进程切换的开销略大，数量太多的话来回切换反而会降低效率。不过有种情况例外，如果磁盘IO比较多，一般即使是协程，磁盘IO也是同步的，这时候多增加一些进程数也许有帮助。