《Programming Massively Parallel Processors》Chapter5 习题解答

自己做的部分习题解答，因为时间关系，有些马虎，也不全面，欢迎探讨或指出错误

5.1 Consider the matrixaddition in Exercise 3.1. Can one use shared memory to reduce theglobal memory bandwidth consumption?

Hint: analyze the elementsaccessed by each thread and see if there is any commonality betweenthreads.

Answer:I think there is no need to use shared memory in Exercise3.1, becauseall threads only use their variables once and no variables need to beshared between threads.

5.2 Draw the equivalent ofFigure 5.6 for a 8*8 matrix multiplication with 2*2 tiling and 4*4tiling. Verify that the reduction in global memory bandwidth isindeed proportional to the dimension size of the tiles.

Answer:

1.A 8*8matrix multiplication with 2*2tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,2

↓

Mds0,0

N2,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,3

↓

Mds0,1

N2,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,0

↓

Mds1,0

N1,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

M1,2

↓

Mds1,0

N3,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

thread1,1

M1,1

↓

Mds1,1

N1,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,3

↓

Mds1,1

N3,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

Phase3

Phase4

thread0,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,6

↓

Mds0,0

N6,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,7

↓

Mds0,1

N6,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,4

↓

Mds1,0

N5,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,6

↓

Mds1,0

N7,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

thread1,1

M1,5

↓

Mds1,1

N5,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,7

↓

Mds1,1

N7,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

2.A 8*8matrix multiplication with 4*4tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

thread0,2

M0,2

↓

Mds0,2

N0,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

M0,6

↓

Mds0,2

N4,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

thread0,3

M0,3

↓

Mds0,3

N0,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

M0,7

↓

Mds0,3

N4,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

Thread1.x-thread3.xellipsis

As shown in the tables,the reduction in global memory bandwidth is indeed proportional tothe dimension size of the tiles, cause the if the tile is bigger, thethread used is proportional bigger, the phase of read data fromglobal memory is proportional smaller, so the reduction in globalmemory bandwidth is proportional to the dimension size of the tiles.

5.3 What type of incorrectexecution behavior can happen if one forgot to use syncthreads() inthe kernel of Figure 5.12?

Answer: The barrier__syncthreads() in line 11 ensures that all threads have finishedloading the tiles of d_M and d_N into Mds and Nds before any of themcan move forward. The barrier __syncthread() in line 14 ensures thatall threads have finished using the d_M and d_N elements in theshared memory before any of them move on to the next iteration andload the elements in the next tiles. Without synthreads() in thekernel, the threads would load the elements too early and corrupt theinput values for other threads.

5.4 Assuming capacity was notan issue for register or shared memory, give one case that it wouldbe valuable to use shared memory instead of registers to hold valuesfetched from global memory?

Explain your answer?

Answer: Without concerningthe capacity of register or shared memory. The biggest differencebetween them is that a register is made for a single thread, butshared memory can be shared by all threads in one block.

So the matrixmultiplication maybe a good example because the data read by onethread may be useful to other threads.

5.5 For our tiledmatrix-matrix multiplication kernel, if we use a 32*32 tile, what isthe reduction of memory bandwidth usage for input matrices M andN?
a. 1/8 of the original usage

b. 1/16 of the originalusage

c. 1/32 of the originalusage

d. 1/64 of the originalusage

Answer: c

5.6 Assume that a kernel islaunched with 1000 tread blocks each of which has 512 threads. If avariable is declared as a local variable in the kernel, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.512000

Answer: d

5.7 In the previous question,if a variable is declared as a shared memory variable, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.51200

Answer: b

5.9 Consider performing amatrix multiplication of two input matrices with dimensions N*N. Howmany times is each element in the input matrices request form globalmemory when:

a. There is no tiling?

b. Tiles of size T*T areused?

Answer: a. N

b. N/T

《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章

Coursera公开课Functional Programming Principles in Scala习题解答：Week 2
引言 OK.时间非常快又过去了一周.第一周有五一假期所以感觉时间绰绰有余,这周中间没有假期仅仅能靠晚上加周末的时间来消化,事实上还是有点紧张呢! 后来发现每堂课的视频还有相应的课件(Slide).字幕 ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...
（搬运）《算法导论》习题解答 Chapter 22.1-1（入度和出度）
(搬运)<算法导论>习题解答 Chapter 22.1-1(入度和出度) 思路:遍历邻接列表即可; 伪代码: for u 属于 Vertex for v属于 Adj[u] outdegre ...
DirectX 11游戏编程学习笔记之8: 第6章Drawing in Direct3D(在Direct3D中绘制)(习题解答)
本文由哈利_蜘蛛侠原创,转载请注明出处.有问题欢迎联系2024958085@qq.com 注:我给的电子版是700多页,而实体书是800多页,所以我在提到相关概念的时候 ...
现代控制理论习题解答与Matlab程序示例
现代控制理论习题解答与Matlab程序示例现代控制理论第三版课后习题参考解答: http://download.csdn.net/detail/zhangrelay/9544934 下面给出部分 ...
【AI】Exponential Stochastic Cellular Automata for Massively Parallel Inference - 大规模并行推理的指数随机元胞自动机
[论文标题]Exponential Stochastic Cellular Automata for Massively Parallel Inference (19th-ICAIS,PMLR ...
P4: Programming Protocol-Independent Packet Processors
P4: Programming Protocol-Independent Packet Processors 摘要 P4是一门高级语言,用于编程与协议无关的数据包处理器.P4与SDN控制协议相关联,类 ...
機器學習基石(Machine Learning Foundations) 机器学习基石作业三课后习题解答
今天和大家分享coursera-NTU-機器學習基石(Machine Learning Foundations)-作业三的习题解答.笔者在做这些题目时遇到非常多困难,当我在网上寻找答案时却找不到,而林 ...
《C++编程思想》第四章初始化与清除（原书代码+习题+解答）
相关代码: 1. #include <stdio.h> class tree { int height; public: tree(int initialHeight); ~tree(); ...

随机推荐

react.js 你应知道的9件事
React.js 初学者应该知道的 9 件事本文假定你已经有了一下基本的概念.如果你不熟悉 component.props 或者 state 这些名词,你最好先去阅读下官方起步和手册.下面的代码 ...
nyoj 36
//这一题是 nyoj 36 是一道求最长公共子序列的题,也是用dp做出来的核心代码也就是一句,题目大概思路是先找到两组字符串里面相同的字母在二维数组里面更新每次比较过后dp的值,空想很难理解 ...
apache的500错误是写到哪个文件里面
apache的500错误是写到哪个文件里面
C#线程访问资源同步简介
在多线程应用(一个或多个处理器)的计算中会使用到同步这个词.实际上,这些应用程序的特点就是它们拥有多个执行单元,而这些单元在访问资源的时候可能会发生冲突.线程间会共享同步对象,而同步对象的目的在于能够 ...
fopen,file_get_contents,curl的区别
1. fopen /file_get_contents 每次请求都会重新做DNS查询,并不对DNS信息进行缓存.但是CURL会自动对DNS信息进行缓存.对同一域名下的网页或者图片的请求只需 ...
Nginx环境下常见的开源项目重写汇总
我们做PHP开发的,作者寒冰我觉得大部分时候都在跟开源的系统打交道.比如:Discuz.PHPCMS.ecshop.wordpress等开源系统.一般我们都是在本地搭建测试环境,用的web服务器都是a ...
mouseover,mouseout和mouseenter,mouseleave
mouseover和mouseout 鼠标指针进入或者离开被选元素或其子元素,都会触发相应事件. 非IE浏览器支持该事件. mouseenter和mouseleave 只有在鼠标指针进入或者离开被选元 ...
你好，C++（17）0.1*10不等于1.0——4.1.4 关系操作符4.1.5 逻辑操作符
4.1.4 关系操作符在C++中,除了需要用算术操作符对数据进行加减乘除的算术操作之外,我们有时候还需要对数据之间的关系进行操作,也就是对两个数据进行大小比较,得出它们之间的大小关系.在现实世界中 ...
网页通用的测试用例（出处：: 51Testing-- lxp1119216）
此题的考察目的:面试者是否熟悉各种测试方法,是否有丰富的Web测试经验, 是否了解Web开发,以及设计Test case的能力这个题目还是相当有难度的, 一般的人很难把这个题目回答好. 首先,你要了 ...
ext中处理Combobox组件点击触发后台事件的问题
ext的Combobox组件在绑定数据的时候需要一个Store来绑定数据,在store里面我们可以设置autoLoad属性,这个属性表示Store可以自动的到后台获取数据,ext实质上就是封装好的ja ...

《Programming Massively Parallel Processors》Chapter5 习题解答

《Programming Massively Parallel Processors》Chapter5 习题解答的更多相关文章

随机推荐

热门专题