前段时间做英伟达硬解得时候,显卡总是莫名挂掉,后来发现是因为显卡温度过高掉了。这几天找到CUDA中有NVML工具可以查看显卡信息,nvidia-smi也是基于这个工具包。

使用的CUDA版本为CUDA 8.0 。

1.给程序添加NVML

安装CUDA之后可以找到如下:

图1.NVML的例子

这里面包含的是NVML的一个例子。我的系统是64位的,可以找到NVML的lib和头文件如下:

图2.NVML的lib文件

图3.NVML头文件

在工程中包含NVML。我是新建的CUDA 8.0 Runtime工程,因为NVML包含在CUDA中,建CUDA 8.0 Runtime工程可以省去CUDA的配置工作,工程建立方法参见VS2013 VC++的.cpp文件调用CUDA的.cu文件中的函数

,CUDA 8.0为默认安装,系统为win10 64位。

在程序中直接包含NVML的头文件和lib文件即可:

#include "nvml.h"

#pragma  comment(lib,"nvml.lib")

注意64位系统应该建立x64工程,因为在安装的CUDA中没有win32的nvml.lib。

2.NVML查询显卡信息

常用函数:

·nvmlInit()函数初始化NVML;

·nvmlDeviceGetCount(unsigned int *deviceCount)函数可以获得显卡数;

·nvmlDeviceGetHandleByIndex(unsigned int index, nvmlDevice_t *device)获取设备;

·nvmlDeviceGetName(nvmlDevice_t device, char *name, unsigned int length)查询设备的名称;

·nvmlDeviceGetPciInfo(nvmlDevice_t device, nvmlPciInfo_t *pci)获取PCI信息,对这个函数的重要性,例子中是这么说的

// pci.busId is very useful to know which device physically you're talking to
            // Using PCI identifier you can also match nvmlDevice handle to CUDA device.

·nvmlDeviceGetComputeMode(nvmlDevice_t device, nvmlComputeMode_t *mode)得到显卡当前所处的模式,模式由以下:

typedef enum nvmlComputeMode_enum
{
    NVML_COMPUTEMODE_DEFAULT           = 0,  //!< Default compute mode -- multiple contexts per device
    NVML_COMPUTEMODE_EXCLUSIVE_THREAD  = 1,  //!< Support Removed
    NVML_COMPUTEMODE_PROHIBITED        = 2,  //!< Compute-prohibited mode -- no contexts per device
    NVML_COMPUTEMODE_EXCLUSIVE_PROCESS = 3,  //!< Compute-exclusive-process mode -- only one context per device, usable from multiple threads at a time
   
    // Keep this last
    NVML_COMPUTEMODE_COUNT
} nvmlComputeMode_t;

·nvmlDeviceSetComputeMode(nvmlDevice_t device, nvmlComputeMode_t mode)可以修改显卡的模式;

·nvmlDeviceGetTemperatureThreshold(nvmlDevice_t device, nvmlTemperatureThresholds_t thresholdType, unsigned int *temp)查询温度阈值,具体有两种:

typedef enum nvmlTemperatureThresholds_enum
{
    NVML_TEMPERATURE_THRESHOLD_SHUTDOWN = 0,    // Temperature at which the GPU will shut down for HW protection
    NVML_TEMPERATURE_THRESHOLD_SLOWDOWN = 1,    // Temperature at which the GPU will begin slowdown
    // Keep this last
    NVML_TEMPERATURE_THRESHOLD_COUNT
} nvmlTemperatureThresholds_t;

当温度达到NVML_TEMPERATURE_THRESHOLD_SHUTDOWN 参数获取的温度时,显卡将自动关闭以保护硬件;当温度达到NVML_TEMPERATURE_THRESHOLD_SLOWDOWN参数获取的温度时,显卡的性能将下降。

·nvmlDeviceGetTemperature(nvmlDevice_t device, nvmlTemperatureSensors_t sensorType, unsigned int *temp)获取显卡当前温度;

·nvmlDeviceGetUtilizationRates(nvmlDevice_t device, nvmlUtilization_t *utilization)获取设备的使用率(原注释:Retrieves the current utilization rates for the device's major subsystems。不知道理解错了没有),使用率包括以下:

typedef struct nvmlUtilization_st
{
    unsigned int gpu;                //!< Percent of time over the past sample period during which one or more kernels was executing on the GPU
    unsigned int memory;             //!< Percent of time over the past sample period during which global (device) memory was being read or written
} nvmlUtilization_t;

·nvmlDeviceGetMemoryInfo(nvmlDevice_t device, nvmlMemory_t *memory)    Retrieves the amount of used, free and total memory available on the device, in bytes。

·nvmlDeviceGetBAR1MemoryInfo(nvmlDevice_t device, nvmlBAR1Memory_t *bar1Memory)   Gets Total, Available and Used size of BAR1 memory.(不知道这种与上一种有什么区别,有待后续学习)

·nvmlDeviceGetComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_t *infos)    Get information about processes with a compute context on a device。应该是获取当前在使用显卡的程序信息。

·nvmlDeviceGetMaxClockInfo(nvmlDevice_t device, nvmlClockType_t type, unsigned int *clock)   Retrieves the maximum clock speeds for the device。包括以下:

typedef enum nvmlClockType_enum
{
    NVML_CLOCK_GRAPHICS  = 0,        //!< Graphics clock domain
    NVML_CLOCK_SM        = 1,        //!< SM clock domain
    NVML_CLOCK_MEM       = 2,        //!< Memory clock domain
    NVML_CLOCK_VIDEO     = 3,        //!< Video encoder/decoder clock domain
   
    // Keep this last
    NVML_CLOCK_COUNT //<! Count of clock types
} nvmlClockType_t;

·nvmlDeviceGetClockInfo(nvmlDevice_t device, nvmlClockType_t type, unsigned int *clock)   Retrieves the current clock speeds for the device.上面是获取最大的,这个是获取当前的。

代码示例:

#include "cuda_kernels.h"

#include "nvml.h"

#include <stdio.h>
#include <windows.h>
#include <winbase.h>
#include <tlhelp32.h>
#include <psapi.h>   #pragma comment(lib,"kernel32.lib")
#pragma comment(lib,"advapi32.lib") #pragma comment(lib,"nvml.lib") const char * convertToComputeModeString(nvmlComputeMode_t mode)
{
switch (mode)
{
case NVML_COMPUTEMODE_DEFAULT:
return "Default";
case NVML_COMPUTEMODE_EXCLUSIVE_THREAD:
return "Exclusive_Thread";
case NVML_COMPUTEMODE_PROHIBITED:
return "Prohibited";
case NVML_COMPUTEMODE_EXCLUSIVE_PROCESS:
return "Exclusive Process";
default:
return "Unknown";
}
} int main()
{
cuAdd(); nvmlReturn_t result;
unsigned int device_count, i; // First initialize NVML library
result = nvmlInit();
if (NVML_SUCCESS != result)
{
printf("Failed to initialize NVML: %s\n", nvmlErrorString(result)); printf("Press ENTER to continue...\n");
getchar();
return 1;
} result = nvmlDeviceGetCount(&device_count);
if (NVML_SUCCESS != result)
{
printf("Failed to query device count: %s\n", nvmlErrorString(result));
goto Error;
}
printf("Found %d device%s\n\n", device_count, device_count != 1 ? "s" : ""); printf("Listing devices:\n");
while (true)
{
for (i = 0; i < device_count; i++)
{
nvmlDevice_t device;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
nvmlPciInfo_t pci;
nvmlComputeMode_t compute_mode; // Query for device handle to perform operations on a device
// You can also query device handle by other features like:
// nvmlDeviceGetHandleBySerial
// nvmlDeviceGetHandleByPciBusId
result = nvmlDeviceGetHandleByIndex(i, &device);
if (NVML_SUCCESS != result)
{
printf("Failed to get handle for device %i: %s\n", i, nvmlErrorString(result));
goto Error;
} result = nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE);
if (NVML_SUCCESS != result)
{
printf("Failed to get name of device %i: %s\n", i, nvmlErrorString(result));
goto Error;
} // pci.busId is very useful to know which device physically you're talking to
// Using PCI identifier you can also match nvmlDevice handle to CUDA device.
result = nvmlDeviceGetPciInfo(device, &pci);
if (NVML_SUCCESS != result)
{
printf("Failed to get pci info for device %i: %s\n", i, nvmlErrorString(result));
goto Error;
} printf("%d. %s [%s]\n", i, name, pci.busId); // This is a simple example on how you can modify GPU's state
result = nvmlDeviceGetComputeMode(device, &compute_mode);
if (NVML_ERROR_NOT_SUPPORTED == result)
printf("\t This is not CUDA capable device\n");
else if (NVML_SUCCESS != result)
{
printf("Failed to get compute mode for device %i: %s\n", i, nvmlErrorString(result));
goto Error;
}
else
{
// try to change compute mode
printf("\t Changing device's compute mode from '%s' to '%s'\n",
convertToComputeModeString(compute_mode),
convertToComputeModeString(NVML_COMPUTEMODE_PROHIBITED)); result = nvmlDeviceSetComputeMode(device, NVML_COMPUTEMODE_PROHIBITED);
if (NVML_ERROR_NO_PERMISSION == result)
printf("\t\t Need root privileges to do that: %s\n", nvmlErrorString(result));
else if (NVML_ERROR_NOT_SUPPORTED == result)
printf("\t\t Compute mode prohibited not supported. You might be running on\n"
"\t\t windows in WDDM driver model or on non-CUDA capable GPU.\n");
else if (NVML_SUCCESS != result)
{
printf("\t\t Failed to set compute mode for device %i: %s\n", i, nvmlErrorString(result));
goto Error;
}
else
{
printf("\t Restoring device's compute mode back to '%s'\n",
convertToComputeModeString(compute_mode));
result = nvmlDeviceSetComputeMode(device, compute_mode);
if (NVML_SUCCESS != result)
{
printf("\t\t Failed to restore compute mode for device %i: %s\n", i, nvmlErrorString(result));
goto Error;
}
}
} printf("\n");
printf("----- 温度 ----- \n");
unsigned int temperature_threshold = 100;
result = nvmlDeviceGetTemperatureThreshold(device, NVML_TEMPERATURE_THRESHOLD_SHUTDOWN, &temperature_threshold);
if (NVML_SUCCESS != result)
{
printf("device %i Failed to get NVML_TEMPERATURE_THRESHOLD_SHUTDOWN: %s\n", i, nvmlErrorString(result));
}
else
printf("截止温度: %d 摄氏度 (Temperature at which the GPU will shut down for HW protection)\n", temperature_threshold); result = nvmlDeviceGetTemperatureThreshold(device, NVML_TEMPERATURE_THRESHOLD_SLOWDOWN, &temperature_threshold);
if (NVML_SUCCESS != result)
{
printf("device %i Failed NVML_TEMPERATURE_THRESHOLD_SLOWDOWN: %s\n", i, nvmlErrorString(result));
}
else
printf("上限温度: %d 摄氏度 (Temperature at which the GPU will begin slowdown)\n", temperature_threshold); unsigned int temperature = 0;
result = nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature);
if (NVML_SUCCESS != result)
{
printf("device %i NVML_TEMPERATURE_GPU Failed: %s\n", i, nvmlErrorString(result));
}
else
printf("当前温度: %d 摄氏度 \n", temperature); //使用率
printf("\n");
nvmlUtilization_t utilization;
result = nvmlDeviceGetUtilizationRates(device, &utilization);
if (NVML_SUCCESS != result)
{
printf(" device %i nvmlDeviceGetUtilizationRates Failed : %s\n", i, nvmlErrorString(result));
}
else
{
printf("----- 使用率 ----- \n");
printf("GPU 使用率: %lld %% \n", utilization.gpu);
printf("显存使用率: %lld %% \n", utilization.memory);
} //FB memory
printf("\n");
nvmlMemory_t memory;
result = nvmlDeviceGetMemoryInfo(device, &memory);
if (NVML_SUCCESS != result)
{
printf("device %i nvmlDeviceGetMemoryInfo Failed : %s\n", i, nvmlErrorString(result));
}
else
{
printf("------ FB memory ------- \n");
printf("Total installed FB memory: %lld bytes \n", memory.total);
printf("Unallocated FB memory: %lld bytes \n", memory.free);
printf("Allocated FB memory: %lld bytes \n", memory.used);
} //BAR1 memory
printf("\n");
nvmlBAR1Memory_t bar1Memory;
result = nvmlDeviceGetBAR1MemoryInfo(device, &bar1Memory);
if (NVML_SUCCESS != result)
{
printf("device %i nvmlDeviceGetBAR1MemoryInfo Failed : %s\n", i, nvmlErrorString(result));
}
else
{
printf("------ BAR1 memory ------- \n");
printf("Total BAR1 memory: %lld bytes \n", bar1Memory.bar1Total);
printf("Unallocated BAR1 memory: %lld bytes \n", bar1Memory.bar1Free);
printf("Allocated BAR1 memory: %lld bytes \n", bar1Memory.bar1Used);
} //Information about running compute processes on the GPU
printf("\n");
unsigned int infoCount;
nvmlProcessInfo_t infos[999];
result = nvmlDeviceGetComputeRunningProcesses(device, &infoCount, infos);
if (NVML_SUCCESS != result)
{
printf("Failed to get ComputeRunningProcesses for device %i: %s\n", i, nvmlErrorString(result));
}
else
{
HANDLE handle; //定义CreateToolhelp32Snapshot系统快照句柄
handle = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);//获得系统快照句柄
PROCESSENTRY32 *info; //定义PROCESSENTRY32结构字指
//PROCESSENTRY32 结构的 dwSize 成员设置成 sizeof(PROCESSENTRY32)
info = new PROCESSENTRY32;
info->dwSize = sizeof(PROCESSENTRY32);
//调用一次 Process32First 函数,从快照中获取进程列表
Process32First(handle, info);
//重复调用 Process32Next,直到函数返回 FALSE 为止 printf("------ Information about running compute processes on the GPU ------- \n");
for (int i = 0; i < infoCount; i++)
{
printf("PID: %d 显存占用:%lld bytes ", infos[i].pid, infos[i].usedGpuMemory); while (Process32Next(handle, info) != FALSE)
{
if (info->th32ProcessID == infos[i].pid)
{
//printf(" %s\n", info->szExeFile); HANDLE hProcess = NULL;
//打开目标进程
hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, info->th32ProcessID);
if (hProcess == NULL) {
printf("\nOpen Process fAiled:%d\n", GetLastError());
break;
} char strFilePath[MAX_PATH];
GetModuleFileNameEx(hProcess, NULL, strFilePath, MAX_PATH);
printf(" %s\n", strFilePath); CloseHandle(hProcess); break;
}
}
} delete info;
CloseHandle(handle);
} //BAR1 memory
printf("\n");
printf("------ Clocks ------- \n");
unsigned int max_clock;
result = nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_GRAPHICS, &max_clock);
if (NVML_SUCCESS != result)
{
printf("device %i nvmlDeviceGetMaxClockInfo Failed : %s\n", i, nvmlErrorString(result));
} unsigned int clock;
result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_GRAPHICS, &clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get NVML_CLOCK_GRAPHICS info for device %i: %s\n", i, nvmlErrorString(result));
}
else
{
printf("GRAPHICS: %6d Mhz max clock :%d \n", clock, max_clock);
} result = nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_SM, &max_clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get max NVML_CLOCK_SM for device %i: %s\n", i, nvmlErrorString(result));
} result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get current NVML_CLOCK_SM for device %i: %s\n", i, nvmlErrorString(result));
}
else
{
printf(" SM: %6d Mhz max clock :%d \n", clock, max_clock);
} result = nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_MEM, &max_clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get max NVML_CLOCK_MEM for device %i: %s\n", i, nvmlErrorString(result));
} result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_MEM, &clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get current NVML_CLOCK_MEM for device %i: %s\n", i, nvmlErrorString(result));
}
else
{
printf(" MEM: %6d Mhz max clock :%d \n", clock, max_clock);
} result = nvmlDeviceGetMaxClockInfo(device, NVML_CLOCK_VIDEO, &max_clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get max NVML_CLOCK_VIDEO for device %i: %s\n", i, nvmlErrorString(result));
} result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_VIDEO, &clock);
if (NVML_SUCCESS != result)
{
printf("Failed to get current NVML_CLOCK_VIDEO for device %i: %s\n", i, nvmlErrorString(result));
}
else
{
printf(" VIDEO: %6d Mhz max clock :%d \n", clock, max_clock);
}
} printf("-------------------------------------------------------------------- \n"); Sleep(1000);
} Error:
result = nvmlShutdown();
if (NVML_SUCCESS != result)
printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result)); system("pause"); return 0;
}

虽然我已经把nvml.dll拷贝到运行目录,程序应该是可以正常运行了。也做一下nvidia-smi的环境配置,参考NVIDIA 显卡信息(CUDA信息的查看),我把他的复制到下面来:

1. nvidia-smi 查看显卡信息

nvidia-smi 指的是 NVIDIA System Management Interface;

在安装完成 NVIDIA 显卡驱动之后,对于 windows 用户而言,cmd 命令行界面还无法识别 nvidia-smi 命令,需要将相关环境变量添加进去。如将 NVIDIA 显卡驱动安装在默认位置,nvidia-smi 命令所在的完整路径应当为:

C:\Program Files\NVIDIA Corporation\NVSMI

也即将上述路径添加进 Path 系统环境变量中。

2. 查看 CUDA 信息

  • CUDA 的版本:

    • 进入命令行:nvcc -V

3.运行结果

图4.GeForce 940M查询结果

图5.Tesla P4查询结果

NVML对GeForce 940M的支持不怎么好,对Tesla P4支持得比较好。

工程源码:http://download.csdn.net/download/qq_33892166/9841800

NVML查询显卡信息的更多相关文章

  1. windows平台下 c++获取 系统版本 网卡 内存 CPU 硬盘 显卡信息<转>

    GetsysInfo.h: #ifndef _H_GETSYSINFO #define _H_GETSYSINFO #pragma once #include <afxtempl.h> c ...

  2. SQL Server2016 新功能实时查询统计信息

    SQL Server2016 新功能实时查询统计信息 很多时候有这样的场景,开发抱怨DBA没有调优好数据库,DBA抱怨开发写的程序代码差,因此,DBA和开发都成为了死对头,无法真正排查问题. DBA只 ...

  3. 16进制ascii码转化为对应的字符,付ipmitool查询硬件信息

    最近工作需要在用ipmitool查询服务器硬件信息.ipmitool查询硬件信息 比如电源,使用命令: 获取PSU0信息:Ipmitool raw 0x3a 0x71 0x00: 获取PSU1信息:I ...

  4. 【Gerrit】Gerrit cmd query (gerrit命令行查询change信息)

    本文仅展现个人使用情况和理解,英文原址:https://review.openstack.org/Documentation/cmd-query.html 基本使用格式: ssh -p <por ...

  5. Asp.Net MVC4入门指南(9):查询详细信息和删除记录

    在本教程中,您将查看自动生成的Details和Delete方法. 查询详细信息和删除记录 打开Movie控制器并查看Details方法. public ActionResult Details(int ...

  6. WMI技术介绍和应用——查询硬件信息

    //查询得到系统盘所在硬盘的ID SELECT DiskIndex FROM Win32_DiskPartition WHERE Bootable = TRUE //如何使用WMI查询系统盘所在硬盘的 ...

  7. SQL查询数据库信息, 数据库表名, 数据库表信息

    SQL查询数据库信息, 数据库表名, 数据库表信息 ---------------------------------------------- -- 以下例子, 在sql_server 中可以直接运 ...

  8. 使用HQL语句的按照参数名字查询数据库信息的时候 “=:”和参数之间不能存在空格,否则会报错

    问题描述: 今天在使用HQL的按照参数的名字查询数据库信息的时候报错如下: org.hibernate.QueryException: Space is not allowed after param ...

  9. hosts文件的作用 whois查询域名信息

      Whois查询域名信息 在操作系统中的路径:Window98—在Windows目录下Windows 2000/XP—在C:\WINDOWS\system32\drivers\etc目录下 内容:包 ...

随机推荐

  1. PKU 1932 XYZZY(Floyd+Bellman||Spfa+Floyd)

    题目大意:原题链接 给你一张图,初始你在房间1,初始生命值为100,进入每个房间会加上那个房间的生命(可能为负),问是否能到达房间n.(要求进入每个房间后生命值都大于0) 解题思路: 解法一:Floy ...

  2. 基于HTML5 FileSystem API的使用介绍(转)

    FileSystem提供了文件夹和文件的创建.移动.删除等操作,大大方便了数据的本地处理, 而且所有的数据都是在沙盒(sandboxed)中,不同的web程序不能互相访问,这就保证了数据 的完整和安全 ...

  3. Winter-1-F Number Sequence 解题报告及测试数据

    Time Limit:1000MS     Memory Limit:32768KB Description ​A number sequence is defined as follows:f(1) ...

  4. HDFS datanode心跳与运维中的实际案例

    分布式系统的节点之间常采用心跳来维护节点的健康状态,如yarn的rm与nm之间,hdfs的nn与dn之间.DataNode会定期(dfs.heartbeat.interval配置项配置,默认是3秒)向 ...

  5. JavaWeb 如何在web.xml中配置多个servlet

    15:34:42 <servlet> <description></description> <display-name>ListMusicServle ...

  6. ansible playbook基本操作

    一.ansible playbook简单使用 相当于是把模块写入到配置文件里面 vim /etc/ansible/test.yml //写入如下内容: --- - hosts: 127.0.0.1 r ...

  7. Linux数据备份与恢复

    Linux数据备份及服务器重要数据类别分析 对 Linux 服务器来讲,当然最理想的就是把整块硬盘中的数据都备份,甚至连分区和文件系统都备份,这样如果硬盘损坏,那么我们可以直接把备份硬盘中的数据导入损 ...

  8. Visual Studio 2015 初体验

    据微软介绍每次发布的新版本,都承载着为开发者提供最高效的Visual Studio开发体验的使命.Visual Studio 2015亦延续了这一趋势,为开发者带来了进一步的生产力创新,包括调试和诊断 ...

  9. Send2MyKindle使用说明文档

    软件下载地址为:Send2MyKindle 功能简介 该软件主要功能为在Windows下将Kindle电子书发送到亚马逊中国网站注册的Kindle账户.整个软件界面如下图所示: 使用步骤 使用前的准备 ...

  10. Job流程:决定map个数的因素

    此文紧接Job流程:提交MR-Job过程.上一篇分析可以看出,MR-Job提交过程的核心代码在于 JobSubmitter 类的 submitJobInternal()方法.本文就由此方法的这一句代码 ...