pytorch的显存释放机制torch.cuda.empty

参考：

https://cloud.tencent.com/developer/article/1626387

据说在pytorch中使用torch.cuda.empty_cache()可以释放缓存空间，于是做了些尝试：

上代码：

import torch

import time

import os

#os.environ["CUDA_VISIBLE_DEVICES"] = "3"

device='cuda:2'

dummy_tensor_4 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第一阶段：")

print("变量类型：", dummy_tensor_4.dtype)

print("变量实际占用内存空间：", 120*3*512*512*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第二阶段：")

print("释放缓存后:", "."*100)

print("变量实际占用内存空间：", 120*3*512*512*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

del dummy_tensor_4

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第三阶段：")

print("删除变量后释放缓存后:", "."*100)

print("变量实际占用内存空间：", 0, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

time.sleep(60)

运行结果：

第一阶段：

第二阶段：

第三阶段：

===================================================

可以看到在pytorch中显存创建360M的变量其实总占有了1321M空间，其中变量自身占了360M空间，缓存也占了360M空间，中间多出了那1321-360*2=601M空间却无法解释，十分诡异。

总的来说 torch.cuda.empty_cache() 操作有一定用处，但是用处不太大。

===================================================

更改代码：

import torch

import time

import os

#os.environ["CUDA_VISIBLE_DEVICES"] = "3"

device='cuda:2'

dummy_tensor_4 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

dummy_tensor_5 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第一阶段：")

print("变量类型：", dummy_tensor_4.dtype)

print("变量实际占用内存空间：", 2*120*3*512*512*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第二阶段：")

print("释放缓存后:", "."*100)

print("变量实际占用内存空间：", 2*120*3*512*512*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

del dummy_tensor_4

del dummy_tensor_5

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第三阶段：")

print("删除变量后释放缓存后:", "."*100)

print("变量实际占用内存空间：", 0, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

time.sleep(60)

第一阶段：

第二阶段：

第三阶段：

发现依然有显存空间无法解释。

=============================================

上面的操作都是在24G显存的titan上进行的，最后决定用1060显卡试验下，6G显存比较好尝试。

代码：

import torch

import time

import os

import functools

#os.environ["CUDA_VISIBLE_DEVICES"] = "3"

device='cuda:0'

shape_ = (4, 1024, 512, 512)     # 4GB

# dummy_tensor_4 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

# dummy_tensor_5 = torch.randn(10, 120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

dummy_tensor_6 = torch.randn(*shape_).float().to(device)   

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第一阶段：")

print("变量类型：", dummy_tensor_6.dtype)

print("变量实际占用内存空间：", functools.reduce(lambda x, y: x*y, shape_)*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第二阶段：")

print("释放缓存后:", "."*100)

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

del dummy_tensor_6

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第三阶段：")

print("删除变量后释放缓存后:", "."*100)

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

time.sleep(60)

输出结果：

第一阶段:

第二阶段:

第三阶段:

由于显卡总共6G显存，所以

memory_allocated

memory_reserved

这两部分应该是指的相同显存空间，因为这两个部分都是显示4G空间，总共6G空间。

可以看到单独执行：torch.cuda.empty_cache()

并没有释放显存，还是4775MB，但是执行：

del dummy_tensor_6

torch.cuda.empty_cache()

显存就进行了释放，为679MB。

更改代码：

import torch

import time

import os

import functools

#os.environ["CUDA_VISIBLE_DEVICES"] = "3"

device='cuda:0'

shape_ = (4, 1024, 512, 512)     # 4GB

# dummy_tensor_4 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

# dummy_tensor_5 = torch.randn(10, 120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.0M

dummy_tensor_6 = torch.randn(*shape_).float().to(device)   

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第一阶段：")

print("生成变量后:", "."*100)

print("变量类型：", dummy_tensor_6.dtype)

print("变量实际占用内存空间：", functools.reduce(lambda x, y: x*y, shape_)*4/1024/1024, "M")

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

torch.cuda.empty_cache()

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第二阶段：")

print("释放缓存后:", "."*100)

print("变量类型：", dummy_tensor_6.dtype)

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

# for _ in range(10000):

#     dummy_tensor_6 += 0.001

# print(torch.sum(dummy_tensor_6))

del dummy_tensor_6

time.sleep(15)

memory_allocated = torch.cuda.memory_allocated(device)/1024/1024

memory_reserved = torch.cuda.memory_reserved(device)/1024/1024

print("第三阶段：")

print("删除变量后释放缓存后:", "."*100)

print("GPU实际分配给的可用内存", memory_allocated, "M")

print("GPU实际分配给的缓存", memory_reserved, "M")

time.sleep(60)

运行结果：

NVIDIA显存显示第一，二，，三阶段均为：

如果没有执行torch.cuda.empty_cache()，即使删除GPU上的变量显存空间也不会被释放，该部分显存还为缓存空间所占。

================================================

总结：

torch.cuda.memory_reserved() 表示进程所获得分配到总显存大小（包括变量显存和缓存等）

torch.cuda.memory_allocated 表示进程为变量所分配的显存大小

torch.cuda.memory_reserved() - torch.cuda.memory_allocated

表示进程中空闲的显存空间，一般是指进程显存中缓存空间的大小。（不是GPU空闲显存空间，而是进程已获得的显存中未被使用的空间）

================================================

pytorch的显存释放机制torch.cuda.empty_cache()的更多相关文章

显卡、显卡驱动、显存、GPU、CUDA、cuDNN
显卡 Video card,Graphics card,又叫显示接口卡,是一个硬件概念(相似的还有网卡),执行计算机到显示设备的数模信号转换任务,安装在计算机的主板上,将计算机的数字信号转换成模拟 ...
【原创】Linux环境下的图形系统和AMD R600显卡编程(4)——AMD显卡显存管理机制
显卡使用的内存分为两部分,一部分是显卡自带的显存称为VRAM内存,另外一部分是系统主存称为GTT内存(graphics translation table和后面的GART含义相同,都是指显卡的页表,G ...
GPU 显存释放
我们在使用tensorflow 的时候, 有时候会在控制台终止掉正在运行的程序,但是有时候程序已经结束了,nvidia-smi也看到没有程序了,但是GPU的内存并没有释放,那么怎么解决该问题呢? 首先 ...
GPU显存释放
一.当程序没有运行,但GPU仍被占用, 可通过nvidia-smi查看,被占用的pid是什么或通过sudo fuser -v /dev/nvidia* #查找占用GPU资源的PID 然后采用kill ...
Pytorch显存动态分配规律探索
下面通过实验来探索Pytorch分配显存的方式. 实验显存到主存我使用VSCode的jupyter来进行实验,首先只导入pytorch,代码如下: import torch 打开任务管理器查看主存 ...
Ubuntu-Tensorflow 程序结束掉GPU显存没有释放的问题
笔者在ubuntu上跑Tensorflow的程序的时候,中途使用了Win+C键结束了程序的进行,但是GPU的显存却显示没有释放,一直处于被占用状态. 使用命令 nvidia-smi 显示如下两个GP ...
TensorFlow中的显存管理器——BFC Allocator
背景作者:DeepLearningStack,阿里巴巴算法工程师,开源TensorFlow Contributor] 使用GPU训练时,一次训练任务无论是模型参数还是中间结果都需要占用大量显存.为了 ...
[Pytorch]深度模型的显存计算以及优化
原文链接:https://oldpan.me/archives/how-to-calculate-gpu-memory 前言亲,显存炸了,你的显卡快冒烟了! torch.FatalError: cu ...
Pytorch训练时显存分配过程探究
对于显存不充足的炼丹研究者来说,弄清楚Pytorch显存的分配机制是很有必要的.下面直接通过实验来推出Pytorch显存的分配过程. 实验实验代码如下: import torch from torch ...
解决GPU显存未释放问题
前言今早我想用多块GPU测试模型,于是就用了PyTorch里的torch.nn.parallel.DistributedDataParallel来支持用多块GPU的同时使用(下面简称其为Dist). ...

随机推荐

【Java异常】Variable used in lambda expression should be final or effectively final
[Java异常]Variable used in lambda expression should be final or effectively final 从字面上来理解这句话,意思是:*lamb ...
Elasticsearch之Nested Query nestedQuery查询数组
es是通过符合条件的json记录找出来,本身并不是将数据中的记录filter过滤.es nestedQuery不是过滤的结果,是匹配的这条es记录,所以数组中的其他的记录也会查询出来1.方法1:可以在 ...
windows server 安装.net framework 3.5失败
windows server如果高版本的.net framework 那么在安装.net framework3.5时会提示已安装高版本的不能安装低版本的了 ---------------------- ...
Navicat 连接SQL Server LocalDB的方法
截止2021年11月,Sql Server LocalDB的资料网上并不多见,出来了其实也有一段年头了. SqlServerManagerStudio自带的工具进行查询使用体验并不好,Navicat是 ...
天翼云centos7.6安装redis6.2.6
以下部分的具体略: 1.wget获取源码 2.make 这里重点说下,如何使用 utils/install_server.sh脚本使用install_service.sh添加服务有了这个脚本,那么 ...
admission-controllers
WebHook是什么官方文档: https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/admission-controller ...
hive第一课：# hive-3.1.2分布式搭建文档
hive-3.1.2分布式搭建文档谷歌浏览器下载网址:Google Chrome – Download the fast, secure browser from Google 华为云镜像站:htt ...
ZYNQ：提取PetaLinux中Linux和UBoot配置、源码
说明默认情况下,PetaLinux在编译完成后会删除源代码,以节省硬盘空间. 在project-spec/meta-user/conf/petalinuxbsp.conf里,添加如下内容,可以保留L ...
【资料分享】Xilinx XCZU7EV工业核心板规格书（四核ARM Cortex-A53 + 双核ARM Cortex-R5 + FPGA，主频1.5GHz）
1 核心板简介创龙科技SOM-TLZU是一款基于Xilinx UltraScale+ MPSoC系列XCZU7EV高性能处理器设计的高端异构多核SoC工业核心板,处理器集成PS端(四核ARM Cor ...
FreeRDP使用，快速找出账户密码不正确的服务器地址
最近有个需求,需要找出服务器未统一设置账户密码的服务器,进行统一设置,一共有一百多台服务器,一个个远程登录看,那得都费劲啊,这时候就可以用到FreeRDP这个远程桌面协议工具,FreeRDP下载,根据 ...

pytorch的显存释放机制torch.cuda.empty_cache()

pytorch的显存释放机制torch.cuda.empty_cache()的更多相关文章

随机推荐

热门专题