《Programming Massively Parallel Processors》Chapter5 习题解答
自己做的部分习题解答,因为时间关系,有些马虎,也不全面,欢迎探讨或指出错误
5.1 Consider the matrixaddition in Exercise 3.1. Can one use shared memory to reduce theglobal memory bandwidth consumption?
Hint: analyze the elementsaccessed by each thread and see if there is any commonality betweenthreads.
Answer:I think there is no need to use shared memory in Exercise3.1, becauseall threads only use their variables once and no variables need to beshared between threads.
5.2 Draw the equivalent ofFigure 5.6 for a 8*8 matrix multiplication with 2*2 tiling and 4*4tiling. Verify that the reduction in global memory bandwidth isindeed proportional to the dimension size of the tiles.
Answer:
1.A 8*8matrix multiplication with 2*2tiling
Block0,0
Phase1 |
Phase2 |
|||||||
thread0,0 |
M0,0 ↓ Mds0,0 |
N0,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 |
M0,2 ↓ Mds0,0 |
N2,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 |
||
thread0,1 |
M0,1 ↓ Mds0,1 |
N0,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 |
M0,3 ↓ Mds0,1 |
N2,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 |
||
thread1,0 |
M1,0 ↓ Mds1,0 |
N1,0 ↓ Nds1,0 |
Pvalue1,0+= Mds1,0*Nds0,0 +Mds1,1*Nds1,0 |
M1,2 ↓ Mds1,0 |
N3,0 ↓ Nds1,0 |
Pvalue1,0+= Mds1,0*Nds0,0 +Mds1,1*Nds1,0 |
||
thread1,1 |
M1,1 ↓ Mds1,1 |
N1,1 ↓ Nds1,1 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
M1,3 ↓ Mds1,1 |
N3,1 ↓ Nds1,1 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
Phase3 |
Phase4 |
|||||||
thread0,0 |
M0,4 ↓ Mds0,0 |
N4,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 |
M0,6 ↓ Mds0,0 |
N6,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 |
||
thread0,1 |
M0,5 ↓ Mds0,1 |
N4,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 |
M0,7 ↓ Mds0,1 |
N6,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 |
||
thread1,0 |
M1,4 ↓ Mds1,0 |
N5,0 ↓ Nds1,0 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
M1,6 ↓ Mds1,0 |
N7,0 ↓ Nds1,0 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
||
thread1,1 |
M1,5 ↓ Mds1,1 |
N5,1 ↓ Nds1,1 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
M1,7 ↓ Mds1,1 |
N7,1 ↓ Nds1,1 |
Pvalue1,1+= Mds1,0*Nds0,1 +Mds1,1*Nds1,1 |
2.A 8*8matrix multiplication with 4*4tiling
Block0,0
Phase1 |
Phase2 |
|||||
thread0,0 |
M0,0 ↓ Mds0,0 |
N0,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 +Mds0,2*Nds2,0 +Mds0,3*Nds3,0 |
M0,4 ↓ Mds0,0 |
N4,0 ↓ Nds0,0 |
Pvalue0,0+= Mds0,0*Nds0,0 +Mds0,1*Nds1,0 +Mds0,2*Nds2,0 +Mds0,3*Nds3,0 |
thread0,1 |
M0,1 ↓ Mds0,1 |
N0,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 +Mds0,2*Nds2,1 +Mds0,3*Nds3,1 |
M0,5 ↓ Mds0,1 |
N4,1 ↓ Nds0,1 |
Pvalue0,1+= Mds0,0*Nds0,1 +Mds0,1*Nds1,1 +Mds0,2*Nds2,1 +Mds0,3*Nds3,1 |
thread0,2 |
M0,2 ↓ Mds0,2 |
N0,2 ↓ Nds0,2 |
Pvalue0,2+= Mds0,0*Nds0,2 +Mds0,1*Nds1,2 +Mds0,2*Nds2,2 +Mds0,3*Nds3,2 |
M0,6 ↓ Mds0,2 |
N4,2 ↓ Nds0,2 |
Pvalue0,2+= Mds0,0*Nds0,2 +Mds0,1*Nds1,2 +Mds0,2*Nds2,2 +Mds0,3*Nds3,2 |
thread0,3 |
M0,3 ↓ Mds0,3 |
N0,3 ↓ Nds0,3 |
Pvalue0,3+= Mds0,0*Nds0,3 +Mds0,1*Nds1,3 +Mds0,2*Nds2,3 +Mds0,3*Nds3,3 |
M0,7 ↓ Mds0,3 |
N4,3 ↓ Nds0,3 |
Pvalue0,3+= Mds0,0*Nds0,3 +Mds0,1*Nds1,3 +Mds0,2*Nds2,3 +Mds0,3*Nds3,3 |
Thread1.x-thread3.xellipsis |
As shown in the tables,the reduction in global memory bandwidth is indeed proportional tothe dimension size of the tiles, cause the if the tile is bigger, thethread used is proportional bigger, the phase of read data fromglobal memory is proportional smaller, so the reduction in globalmemory bandwidth is proportional to the dimension size of the tiles.
5.3 What type of incorrectexecution behavior can happen if one forgot to use syncthreads() inthe kernel of Figure 5.12?
Answer: The barrier__syncthreads() in line 11 ensures that all threads have finishedloading the tiles of d_M and d_N into Mds and Nds before any of themcan move forward. The barrier __syncthread() in line 14 ensures thatall threads have finished using the d_M and d_N elements in theshared memory before any of them move on to the next iteration andload the elements in the next tiles. Without synthreads() in thekernel, the threads would load the elements too early and corrupt theinput values for other threads.
5.4 Assuming capacity was notan issue for register or shared memory, give one case that it wouldbe valuable to use shared memory instead of registers to hold valuesfetched from global memory?
Explain your answer?
Answer: Without concerningthe capacity of register or shared memory. The biggest differencebetween them is that a register is made for a single thread, butshared memory can be shared by all threads in one block.
So the matrixmultiplication maybe a good example because the data read by onethread may be useful to other threads.
5.5 For our tiledmatrix-matrix multiplication kernel, if we use a 32*32 tile, what isthe reduction of memory bandwidth usage for input matrices M andN?
a. 1/8 of the original usage
b. 1/16 of the originalusage
c. 1/32 of the originalusage
d. 1/64 of the originalusage
Answer: c
5.6 Assume that a kernel islaunched with 1000 tread blocks each of which has 512 threads. If avariable is declared as a local variable in the kernel, how manyversions of the variable will be created through the life time of theexecution of the kernel?
a.1
b.1000
c.512
d.512000
Answer: d
5.7 In the previous question,if a variable is declared as a shared memory variable, how manyversions of the variable will be created through the life time of theexecution of the kernel?
a.1
b.1000
c.512
d.51200
Answer: b
5.9 Consider performing amatrix multiplication of two input matrices with dimensions N*N. Howmany times is each element in the input matrices request form globalmemory when:
a. There is no tiling?
b. Tiles of size T*T areused?
Answer: a. N
b. N/T
《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章
- Coursera公开课Functional Programming Principles in Scala习题解答:Week 2
引言 OK.时间非常快又过去了一周.第一周有五一假期所以感觉时间绰绰有余,这周中间没有假期仅仅能靠晚上加周末的时间来消化,事实上还是有点紧张呢! 后来发现每堂课的视频还有相应的课件(Slide).字幕 ...
- Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
- (搬运)《算法导论》习题解答 Chapter 22.1-1(入度和出度)
(搬运)<算法导论>习题解答 Chapter 22.1-1(入度和出度) 思路:遍历邻接列表即可; 伪代码: for u 属于 Vertex for v属于 Adj[u] outdegre ...
- DirectX 11游戏编程学习笔记之8: 第6章Drawing in Direct3D(在Direct3D中绘制)(习题解答)
本文由哈利_蜘蛛侠原创,转载请注明出处.有问题欢迎联系2024958085@qq.com 注:我给的电子版是700多页,而实体书是800多页,所以我在提到相关概念的时候 ...
- 现代控制理论习题解答与Matlab程序示例
现代控制理论习题解答与Matlab程序示例 现代控制理论 第三版 课后习题参考解答: http://download.csdn.net/detail/zhangrelay/9544934 下面给出部分 ...
- 【AI】Exponential Stochastic Cellular Automata for Massively Parallel Inference - 大规模并行推理的指数随机元胞自动机
[论文标题]Exponential Stochastic Cellular Automata for Massively Parallel Inference (19th-ICAIS,PMLR ...
- P4: Programming Protocol-Independent Packet Processors
P4: Programming Protocol-Independent Packet Processors 摘要 P4是一门高级语言,用于编程与协议无关的数据包处理器.P4与SDN控制协议相关联,类 ...
- 機器學習基石(Machine Learning Foundations) 机器学习基石 作业三 课后习题解答
今天和大家分享coursera-NTU-機器學習基石(Machine Learning Foundations)-作业三的习题解答.笔者在做这些题目时遇到非常多困难,当我在网上寻找答案时却找不到,而林 ...
- 《C++编程思想》第四章 初始化与清除(原书代码+习题+解答)
相关代码: 1. #include <stdio.h> class tree { int height; public: tree(int initialHeight); ~tree(); ...
随机推荐
- 9、第九节课jquery选择器jq2,20151007
1.表单选择器 2.not 里面不能加其他标签 $div p:not(not:disable) 错误的 $div p:not(:disable) 正确的 3.选择设置相应属性的标签项 $(&quo ...
- Avi视频生成缩略图时,提示“尝试读取或写入受保护的内存。这通常指示其他内存已损坏”
需求:录制Avi格式视频成功后,使用DirectShow生成缩略图,由于视频录制时,宽高分辨率可调节,所以有些情况下,生成缩略图会抛出异常“尝试读取或写入受保护的内存.这通常指示其他内存已损坏”. 异 ...
- (转)jquery.validate.js 的 remote 后台验证
之前已经有一篇关于jquery.validate.js验证的文章,还不太理解的可以先看看:jQuery Validate 表单验证(这篇文章只是介绍了一下如何实现前台验证,并没有涉及后台验证remot ...
- Android Design Support Library: 学习CoordinatorLayout
简述 CoordinatorLayout字面意思是"协调器布局",它是Design Support Library中提供的一个超级帧布局,帮助我们实现Material Design ...
- Win8节省C盘空间攻略
问题分析: 1.系统页面文件(虚拟内存)占用空间 2.自动更新的缓存文件 3.系统保护的备份文件(系统还原用的) 4.休眠文件 5.索引文件 6.桌面文件 解决办法: 1.机器是8G内存,完全不需要虚 ...
- C#中结构的使用
//声明结构 结构与枚举区别,一个不用声明类型,一个要声明类型 public struct Person { //这里叫字段,做用也是存储内容,变量只可以存一个值,字段可以存多个值 //声明字段前最好 ...
- jQuery上传插件Uploadify 3.2使用
Uploadify下载地址:http://www.uploadify.com/download/ 这里下载最新版的3.2的. 常用API描述: $(document).ready(function() ...
- vs2012远程调试
不知道大家有没有遇到过这种情况,刚开发完的程序,明明在本机能够好好的运行,可是部署到服务器过分发给用户时,总是出现莫名其妙的错误. 一时半会又看不出问题来,怎么办呢?难道只能在服务器或是客户电脑上装一 ...
- php global范例
Example #1 $GLOBALS 范例 <?phpfunction test() { $foo = "local variable"; echo '$foo in ...
- echarts.制作中国地图,点击对应的省市链接到该省份的详细介绍
今天花了一天的时间,用echart弄了一个效果,是从中国地图点进去身份并把改省份的数据渲染出来的效果,刚开始完全没有头绪,只能硬着头皮去看百度echart的api,和博客,看了半天,好家伙,终于给我找 ...