PEP地址:

https://peps.python.org/pep-0703/

PEP 703 – Making the Global Interpreter Lock Optional in CPython

================================================

Abstract

CPython’s global interpreter lock (“GIL”) prevents multiple threads from executing Python code at the same time. The GIL is an obstacle to using multi-core CPUs from Python efficiently. This PEP proposes adding a build configuration (--disable-gil) to CPython to let it run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe.

Motivation

The GIL is a major obstacle to concurrency. For scientific computing tasks, this lack of concurrency is often a bigger issue than speed of executing Python code, since most of the processor cycles are spent in optimized CPU or GPU kernels. The GIL introduces a global bottleneck that can prevent other threads from making progress if they call any Python code. There are existing ways to enable parallelism in CPython today, but those techniques come with significant limitations (see Alternatives).

This section focuses on the GIL’s impact on scientific computing, particular AI/ML workloads because that is the area with which this author has the most experience, but the GIL also affects other users of Python.

The GIL Makes Many Types of Parallelism Difficult to Express

Neural network-based AI models expose multiple opportunities for parallelism. For example, individual operations may be parallelized internally (“intra-operator”), multiple operations may be executed simultaneously (“inter-operator”), and requests (spanning multiple operations) may also be parallelized. Efficient execution requires exploiting multiple types of parallelism [1].

The GIL makes it difficult to express inter-operator parallelism, as well as some forms of request parallelism, efficiently in Python. In other programming languages, a system might use threads to run different parts of a neural network on separate CPU cores, but this is inefficient in Python due to the GIL. Similarly, latency-sensitive inference workloads frequently use threads to parallelize across requests, but face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Python frequently come up in reinforcement learning. Heinrich Kuttler, author of the NetHack Learning Environment and Member of Technical Staff at Inflection AI, writes:

Recent breakthroughs in reinforcement learning, such as on Dota 2, StarCraft, and NetHack rely on running multiple environments (simulated games) in parallel using asynchronous actor-critic methods. Straightforward multithreaded implementations in Python don’t scale beyond more than a few parallel environments due to GIL contention. Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.

Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:

We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers.

Projects that involve interfacing with multiple hardware devices face similar challenges: efficient communication requires use of multiple CPU cores. The Dose-3D project aims to improve cancer radiotherapy with precise dose planning. It uses medical phantoms (stand-ins for human tissue) together with custom hardware and a server application written in Python. Paweł Jurgielewicz, lead software architect for the data acquisition system on the Dose-3D project, describes the scaling challenges posed by the GIL and how using a fork of Python without the GIL simplified the project:

In the Dose-3D project, the key challenge was to maintain a stable, non-trivial concurrent communication link with hardware units while utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally, we started with the multiprocessing package, but at some point, it became clear that most CPU time was consumed by the data transfers between the data processing stages, not by data processing itself. The CPython multithreading implementation based on GIL was a dead end too. When we found out about the “nogil” fork of Python it took a single person less than half a working day to adjust the codebase to use this fork and the results were astonishing. Now we can focus on data acquisition system development rather than fine-tuning data exchange algorithms.

Allen Goodman, author of CellProfiler and staff engineer at Prescient Design and Genentech, describes how the GIL makes biological methods research more difficult in Python:

Issues with Python’s global interpreter lock are a frequent source of frustration throughout biological methods research.

I wanted to better understand the current multithreading situation so I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs. HMMER is a relatively elementary bioinformatics method and newer methods have far bigger multi-threading demands.

Method researchers are begging to use Python (myself included), because of its ease of use, the Python ecosystem, and because “it’s what people know.” Many biologists only know a little bit of programming (and that’s almost always Python). Until Python’s multithreading situation is addressed, C and C++ will remain the lingua franca of the biological methods research community.

The GIL Affects Python Library Usability

The GIL is a CPython implementation detail that limits multithreaded parallelism, so it might seem unintuitive to think of it as a usability issue. However, library authors frequently care a great deal about performance and will design APIs that support working around the GIL. These workaround frequently lead to APIs that are more difficult to use. Consequently, users of these APIs may experience the GIL as a usability issue and not just a performance issue.

For example, PyTorch exposes a multiprocessing-based API called DataLoader for building data input pipelines. It uses fork() on Linux because it is generally faster and uses less memory than spawn(), but this leads to additional challenges for users: creating a DataLoader after accessing a GPU can lead to confusing CUDA errors. Accessing GPUs within a DataLoader worker quickly leads to out-of-memory errors because processes do not share CUDA contexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria, describes how having to work around the GIL in scikit-learn related libraries leads to a more complex and confusing user experience:

Over the years, scikit-learn developers have maintained ancillary libraries such as joblib and loky to try to work around some of the limitations of multiprocessing: extra memory usage partially mitigated via semi-automated memory mapping of large data buffers, slow worker startup by transparently reusing a pool of long running workers, fork-safety problems of third-party native runtime libraries such as GNU OpenMP by never using the fork-only start-method, ability to perform parallel calls of interactively defined functions in notebooks and REPLs in cross-platform manner via cloudpickle. Despite our efforts, this multiprocessing-based solution is still brittle, complex to maintain and confusing to datascientists with limited understanding of system-level constraints. Furthermore, there are still irreducible limitations such as the overhead caused by the pickle-based serialization/deserialization steps required for inter-process communication. A lot of this extra work and complexity would not be needed anymore if we could use threads without contention on multicore hosts (sometimes with 64 physical cores or more) to run data science pipelines that alternate between Python-level operations and calls to native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy maintainer, describes how the GIL affects the user experience of NumPy and numeric Python libraries:

A key problem in NumPy and the stack of packages built around it is that NumPy is still (mostly) single-threaded — and that has shaped significant parts of the user experience and projects built around it. NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g., dask.array which wraps numpy.ndarray. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package, threadpoolctl. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.

Coordinating on APIs and design decisions to control parallelism is still a major amount of work, and one of the harder challenges across the PyData ecosystem. It would have looked a lot different (better, easier) without a GIL.

GPU-Heavy Workloads Require Multi-Core Processing

Many high-performance computing (HPC) and AI workloads make heavy use of GPUs. These applications frequently require efficient multi-core CPU execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR (Meta AI), describes how the GIL makes multithreaded scaling inefficient even when the bulk of computation is performed outside of Python:

In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.

The use of many processes (instead of threads) makes common tasks more difficult. Zachary DeVito continues:

On three separate occasions in the past couple of months (reducing redundant compute in data loaders, writing model checkpoints asynchronously, and parallelizing compiler optimizations), I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component. For example, computer vision tasks typically require multiple “pre-processing” steps in the data input pipeline, like image decoding, cropping, and resizing. These tasks are commonly performed on the CPU and may use Python libraries like Pillow or Pillow-SIMD. It is necessary to run the data input pipeline on multiple CPU cores in order to keep the GPU “fed” with data.

The increase in GPU performance compared to individual CPU cores makes multi-core performance more important. It is progressively more difficult to keep the GPUs fully occupied. To do so requires efficient use of multiple CPU cores, especially on multi-GPU systems. For example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to keep the GPUs “fed” with data.

The GIL Makes Deploying Python AI Models Difficult

Python is widely used to develop neural network-based AI models. In PyTorch, models are frequently deployed as part of multi-threaded, mostly C++, environments. Python is often viewed skeptically because the GIL can be a global bottleneck, preventing efficient scaling even though the vast majority of the computations occur “outside” of Python with the GIL released. The torchdeploy paper [2] shows experimental evidence for these scaling bottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AI models that avoid or work around the GIL, but they all come with substantial limitations. For example, TorchScript captures a representation of the model that can be executed from C++ without any Python dependencies, but it only supports a limited subset of Python and often requires rewriting some of the model’s code. The torch::deploy API allows multiple Python interpreters, each with its own GIL, in the same process(similar to PEP 684). However, torch::deploy has limited support for Python modules that use C-API extensions.

Motivation Summary

Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.

The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.

以下略。

================================================

相关:

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit?pli=1#heading=h.jcxfoklnvp0i

关于python的GIL的解除——PEP 703 – Making the Global Interpreter Lock Optional in CPython的更多相关文章

  1. Python GIL(Global Interpreter Lock)

    一,介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  2. python之GIL官方文档 global interpreter lock 全局解释器锁

    0.目录 2. 术语 global interpreter lock 全局解释器锁3. C-API 还有更多没有仔细看4. 定期切换线程5. wiki.python6. python.doc FAQ ...

  3. Python GIL(Global Interpreter Lock)

    一.介绍 In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threa ...

  4. python之GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  5. python GIL(Global Interpreter Lock)

    一 介绍 ''' 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple nati ...

  6. Python解释器是单线程应用 IO 密集型 计算密集型 GIL global interpreter lock

    [Python解释器是单线程应用] [任意时刻,仅执行一个线程] 尽管Python解释器中可以运行多个线程,但是在任意给定的时刻只有一个线程会被解释器执行. [GIL锁 保证同时只有一个线程运行] 对 ...

  7. Python3 GIL(Global Interpreter Lock)与多线程

    GIL(Global Interpreter Lock)与多线程 GIL介绍 GIL与Lock GIL与多线程 多线程性能测试 在Cpython解释器中,同一个进程下开启的多线程,同一时刻只能有一个线 ...

  8. 基于Cpython的 GIL(Global Interpreter Lock)

    一 介绍 定义: In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native t ...

  9. GIL - global interpreter lock

    python是一个解释型语言,但是可以使用多个解释器.比如C++,但是可以用不同的编译器来编译成可执行代码.有名的编译器例如GCC,INTEL C++,Visual C++等.Python也一样,同样 ...

  10. [转载] Python的GIL是什么鬼,多线程性能究竟如何

    原文: http://cenalulu.github.io/python/gil-in-python/ GIL是什么 首先需要明确的一点是GIL并不是Python的特性,它是在实现Python解析器( ...

随机推荐

  1. 实时数据同步Inofity、sersync、lsyncd

    数据备份方案 企业网站和应用都得有完全的数据备份方案确保数据不丢失,通常企业有如下的数据备份方案 定时任务定期备份 需要周期性备份的数据可以分两类: 后台程序代码.运维配置文件修改,一般会定时任务执行 ...

  2. H5弹窗底层滑动

    H5弹窗底层滑动 背景 产品提出H5 弹出窗滑动时,底层页面也会跟随滑动,需要调整禁止底层滑动,增加用户体验. 问题产生原因 ios 滑动时有回弹效果 顶层元素滑动默认行为 解决办法 阻止元素的默认( ...

  3. 【仿真】Carla之Docker 运行 及 渲染相关 [6]

    参考与前言 carla官方对于docker 运行的描述: CARLA in Docker Docker的使用:[暂时没贴] 相关已知issue,欢迎补充 https://github.com/carl ...

  4. CosyVoice多语言、音色和情感控制模型,one-shot零样本语音克隆模型本地部署(Win/Mac),通义实验室开源

    近日,阿里通义实验室开源了CosyVoice语音模型,它支持自然语音生成,支持多语言.音色和情感控制,在多语言语音生成.零样本语音生成.跨语言声音合成和指令执行能力方面表现卓越. CosyVoice采 ...

  5. ISCTF2023

    ISCTF 2023 Misc 签到题 公众号发送:小蓝鲨,我想打ctf ISCTF{W3lcom3_7O_2023ISCTF&BlueShark} 你说爱我?尊嘟假嘟 你说爱我替换.,真嘟替 ...

  6. C#winform软件移植上linux的秘密,用GTK开发System.Windows.Forms

    国产系统大势所趋,如果你公司的winform界面软件需要在linux上运行,如果软件是用C#开发的,现在我有一个好的快速解决方案. 世界第一的微软的Microsoft Visual Studio,确实 ...

  7. 解码技术债:AI代码助手与智能体的革新之道

    技术债 技术债可能来源于多种原因,比如时间压力.资源限制.技术选型不当等.它可以表现为代码中的临时性修补.未能彻底解决的设计问题.缺乏文档或测试覆盖等.虽然技术债可以帮助快速推进项目进度,但长期来看, ...

  8. Express手稿

  9. node.js 增删改查(原始)

    index.js  连接数据库 const mongoose = require('mongoose') //数据库连接27017是mongodb数据库的默认端口 mongoose.connect(' ...

  10. SpringBoot 1.x 2.x配置文件指定服务项目名

    SpringBoot版本1.x: server.context-path=/demo SpringBoot版本2.x: server.servlet.context-path=/demo