5 Things You Should Know About the New Maxwell GPU Architecture
The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture.
This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.
1. The Heart of Maxwell: More Efficient Multiprocessors
Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.
Improved Instruction Scheduling
The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.
As with SMX, each SMM has four warp schedulers, but unlike SMX, all core SMM functional units are assigned to a particular scheduler, with no shared units. The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM’s warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory
operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.
Increased Occupancy for Existing Code
In terms of CUDA compute capability, Maxwell’s SM is CC 5.0. SMM is
similar in many respects to the Kepler architecture’s SMX, with key
enhancements geared toward improving efficiency without requiring
significant increases in available parallelism per SM from the
application. The register file size and the maximum number of concurrent
warps in SMM are the same as in SMX (64k 32-bit registers and 64 warps,
respectively), as is the maximum number of registers per thread (255).
However the maximum number of active thread blocks per multiprocessor
has been doubled over SMX to 32, which should result in an automatic
occupancy improvement for kernels that use small thread blocks of 64 or
fewer threads (assuming available registers and shared memory are not
the occupancy limiter). Table 1 provides a comparison between key
characteristics of Maxwell GM107 and its predecessor Kepler GK107.
Reduced Arithmetic Instruction Latency
Another major improvement of SMM is that dependent arithmetic
instruction latencies have been significantly reduced. Because occupancy
(which translates to available warp-level parallelism) is the same or
better on SMM than on SMX, these reduced latencies improve utilization
and throughput.
GPU | GK107 (Kepler) | GM107 (Maxwell) |
CUDA Cores | 384 | 640 |
Base Clock | 1058 MHz | 1020 MHz |
GPU Boost Clock | N/A | 1085 MHz |
GFLOP/s | 812.5 | 1305.6 |
Compute Capability | 3.0 | 5.0 |
Shared Memory / SM | 16KB / 48 KB | 64 KB |
Register File Size / SM | 256 KB | 256 KB |
Active Blocks / SM | 16 | 32 |
Memory Clock | 5000 MHz | 5400 MHz |
Memory Bandwidth | 80 GB/s | 86.4 GB/s |
L2 Cache Size | 256 KB | 2048 KB |
TDP | 64W | 60W |
Transistors | 1.3 Billion | 1.87 Billion |
Die Size | 118 mm2 | 148 mm2 |
Manufactoring Process | 28 nm | 28 nm |
2. Larger, Dedicated Shared Memory
A significant improvement in SMM is that it provides 64KB of
dedicated shared memory per SM—unlike Fermi and Kepler, which
partitioned the 64KB of memory between L1 cache and shared memory. The
per-thread-block limit remains 48KB on Maxwell, but the increase in
total available shared memory can lead to occupancy
improvements. Dedicated shared memory is made possible in Maxwell by
combining the functionality of the L1 and texture caches into a single
unit.
3. Fast Shared Memory Atomics
Maxwell provides native shared memory atomic operations for 32-bit
integers and native shared memory 32-bit and 64-bit compare-and-swap
(CAS), which can be used to implement other atomic functions. In
contrast, the Fermi and Kepler architectures implemented shared memory
atomics using a lock/update/unlock pattern that could be expensive in
the presence of high contention for updates to particular locations in
shared memory.
4. Support for Dynamic Parallelism
Kepler GK110 introduced a new architectural feature called Dynamic
Parallelism, which allows the GPU to create additional work for itself. A
programming model enhancement leveraging this feature was introduced in
CUDA 5.0 to enable threads running on GK110 to launch additional
kernels onto the same GPU.
SMM brings Dynamic Parallelism into the mainstream by supporting it
across the product line, even in lower-power chips such as GM107. This
will benefit developers, because it means that applications will no
longer need special-case algorithm implementations for high-end GPUs
that differ from those usable in more power constrained environments.
5. Learn More about Programming Maxwell
For more architecture details and guidance on optimizing your code for Maxwell, I encourage you to check out the Maxwell Tuning Guide and Maxwell Compatibility Guide, which are available now to CUDA Registered Developers.
5 Things You Should Know About the New Maxwell GPU Architecture的更多相关文章
- Angular2学习笔记(1)
Angular2学习笔记(1) 1. 写在前面 之前基于Electron写过一个Markdown编辑器.就其功能而言,主要功能已经实现,一些小的不影响使用的功能由于时间关系还没有完成:但就代码而言,之 ...
- 动画requestAnimationFrame
前言 在研究canvas的2D pixi.js库的时候,其动画的刷新都用requestAnimationFrame替代了setTimeout 或 setInterval 但是jQuery中还是采用了s ...
- 【AR实验室】OpenGL ES绘制相机(OpenGL ES 1.0版本)
0x00 - 前言 之前做一些移动端的AR应用以及目前看到的一些AR应用,基本上都是这样一个套路:手机背景显示现实场景,然后在该背景上进行图形学绘制.至于图形学绘制时,相机外参的解算使用的是V-SLA ...
- iOS代码规范(OC和Swift)
下面说下iOS的代码规范问题,如果大家觉得还不错,可以直接用到项目中,有不同意见 可以在下面讨论下. 相信很多人工作中最烦的就是代码不规范,命名不规范,曾经见过一个VC里有3个按钮被命名为button ...
- 梅须逊雪三分白,雪却输梅一段香——CSS动画与JavaScript动画
CSS动画并不是绝对比JavaScript动画性能更优越,开源动画库Velocity.js等就展现了强劲的性能. 一.两者的主要区别 先开门见山的说说两者之间的区别. 1)CSS动画: 基于CSS的动 ...
- 阿里巴巴直播内容风险防控中的AI力量
直播作为近来新兴的互动形态和今年阿里巴巴双十一的一大亮点,其内容风险监控是一个全新的课题,技术的挑战非常大,管控难点主要包括业界缺乏成熟方案和标准.主播行为.直播内容不可控.峰值期间数千路高并发处理. ...
- 虾扯蛋:Android View动画 Animation不完全解析
本文结合一些周知的概念和源码片段,对View动画的工作原理进行挖掘和分析.以下不是对源码一丝不苟的分析过程,只是以搞清楚Animation的执行过程.如何被周期性调用为目标粗略分析下相关方法的执行细节 ...
- 【探索】无形验证码 —— PoW 算力验证
先来思考一个问题:如何写一个能消耗对方时间的程序? 消耗时间还不简单,休眠一下就可以了: Sleep(1000) 这确实消耗了时间,但并没有消耗 CPU.如果对方开了变速齿轮,这瞬间就能完成. 不过要 ...
- 对抗密码破解 —— Web 前端慢 Hash
(更新:https://www.cnblogs.com/index-html/p/frontend_kdf.html ) 0x00 前言 天下武功,唯快不破.但在密码学中则不同.算法越快,越容易破. ...
- H5单页面手势滑屏切换原理
H5单页面手势滑屏切换是采用HTML5 触摸事件(Touch) 和 CSS3动画(Transform,Transition)来实现的,效果图如下所示,本文简单说一下其实现原理和主要思路. 1.实现原理 ...
随机推荐
- linux usb简介
参考书:<linux device drivers>.<usb 2.0规范> <usb3.1规范><usb白皮书> 以linux为例来说明usb系统. ...
- UITableViewCell使用时注意事项
1,注意使用重用机制(有利于提高效率) 2,做到通过改变模型去间接改变UI样式(做到永久改变,无论怎样拖动刷新,都不会恢复改变) 3,在通过传递模型给Cell控件布局时,记得完全覆盖(嗯,不好解释,主 ...
- Tomcat虚拟目录的配置
Tomcat可以作为应用服务器部署Java应用,同时可以创建虚拟目录存放图片,相当于一个图片服务器使用1. 创建目录 /usr/images/2. 编辑TOMCAT_HOME/conf/server. ...
- 1028: [JSOI2007]麻将
1028: [JSOI2007]麻将 Time Limit: 1 Sec Memory Limit: 162 MBSubmit: 2638 Solved: 1168[Submit][Status] ...
- 发现程序bug思路
大家有没有遇到过项目,程序出现个bug,但花了好久(真的是a long long time啊)才发现引发这个问题的原因,心想原来就这个原因导致的啊,要是早想到就好了! 其实我们确实的是方法,希望我的抛 ...
- js限制输入内容
1,限制输入正整数 onkeyup="value=value.replace(/[^\d]/g,'')" 2,限制输入为数字,包括小数(有瑕疵) onkeyup="val ...
- 泛型学习第一天:List与IList的区别 (二)
原文: 探讨Ilist<>与List<> 首先要了解一点的是关于接口的基础知识: 接口不能直接实例化但是接口派生出来的抽象类可以实例化所有派生出来的抽象类都可以强制转换成接口的 ...
- json对象与字符串互转方法
字符串转json对象: var data = eval( '(' + str + ')' ); json对象转字符串: var jsonStr = JSON.stringify( obj );
- linux-RabbitMQ安装命令
一.RabbitMQ 1.安装配置epel源 $ rpm -ivh http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.no ...
- Springboot- Caused by: org.hibernate.AnnotationException: No identifier specified for entity:
错误与异常: Caused by: org.hibernate.AnnotationException: No identifier specified for entity: 原因:引用了不对的包, ...