vulkan asynchronous compute
https://www.youtube.com/watch?v=XOGIDMJThto
https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf
https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization
https://gpuopen.com/concurrent-execution-asynchronous-queues/
通过queue的并行 增加GPU的并行
并发性 concurrency
Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a “wavefront”.
Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.
GPU有64个CU
每个CU 4个SIMD
每个SIMD 64blocks ----- 一个wavefront
ps的计算在里面
GPU提升并发性 减小GPU idel
async compute
- Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
- Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
- Direct Queue (DirectX 12) / Graphics Queue (Vulkan): this queue can do anything, so it is similar to the main device in legacy APIs
这三种queue对应metal里面三种encoder 是为了增加上文所述并发性
对GPU底层的 操作这种可行性是通过这里的queue体现的
vulkan对queue的个数有限制 可以query
dx12没有这种个数限制
更多部分拿出来用cs做异步计算
看图--技能点还没点
problem shooting
- If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
- Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations 带宽限制 数据onchip
- Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
- Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance
Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:
- Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
- Depth only rendering passes are usually good candidates to have some compute tasks run next to it
- A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
- Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
- Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of “for each light clear shadow map, render shadow, compute VSM” do “clear all shadow maps, render all shadow maps, compute VSM for all shadow maps”)
然后给异步计算的功能加上开关
看vulkan这个意思 它似乎没有metal2 那种persistent thread group 维持数据cs ps之间传递时还可以 on tile
vulkan asynchronous compute的更多相关文章
- Vulkan在Android使用Compute shader
oeip 相关功能只能运行在window平台,想移植到android平台,暂时选择vulkan做为图像处理,主要一是里面有单独的计算管线且支持好,二是熟悉下最新的渲染技术思路. 这个 demo(git ...
- android下vulkan与opengles纹理互通
先放demo源码地址:https://github.com/xxxzhou/aoce 06_mediaplayer 效果图: 主要几个点: 用ffmpeg打开rtmp流. 使用vulkan Compu ...
- 剖析虚幻渲染体系(13)- RHI补充篇:现代图形API之奥义与指南
目录 13.1 本篇概述 13.1.1 本篇内容 13.1.2 概念总览 13.1.3 现代图形API特点 13.2 设备上下文 13.2.1 启动流程 13.2.2 Device 13.2.3 Sw ...
- GPUImage移植总结
项目github地址: aoce 我是去年年底才知道有GPUImage这个项目,以前也一直没有在移动平台开发过,但是我在win平台有编写一个类似的项目oeip(不要关注了,所有功能都移植或快移植到ao ...
- Compute Resource Consolidation Pattern 计算资源整合模式
Consolidate multiple tasks or operations into a single computational unit. This pattern can increase ...
- 论文笔记之:Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning ICML 2016 深度强化学习最近被人发现貌似不太稳定,有人提出很多改善的方法,这些方法有很 ...
- Vulkan Tutorial 13 Render passes
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 Setup 在我们完成管线的创建工作,我们接下来需要告诉Vulkan渲染时候使用的f ...
- Vulkan Tutorial 16 Command buffers
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 诸如绘制和内存操作相关命令,在Vulkan中不是通过函数直接调用的.我们需要在命令缓 ...
- Vulkan Tutorial 29 Loading models
操作系统:Windows8.1 显卡:Nivida GTX965M 开发工具:Visual Studio 2017 Introduction 应用程序现在已经可以渲染纹理3D模型,但是 vertice ...
随机推荐
- idea把maven依赖树输出到控制台
第一步 选中红色方框 第二步 点进去 输入命令:mvn dependency:tree 如果要输出到文件,找到pom文件的位置 进入命令行 输入: mvn dependency:tree >d: ...
- GrapeCity Documents (服务端文档API组件) V3.0 正式发布
近日,葡萄城GrapeCity Documents(服务端文档API组件)V3.0 正式发布! 该版本针对 Excel 文档.PDF 文档和 Word 文档的 API 全面更新,加入了用于生成 Exc ...
- Httpwatch教程
启动Httpwatch 从IE的“查看”—“浏览器栏”—“HttpWatch”启动HttpWatch.如下图所示: 以下是HttpWatch程序界面 以下用登录我的邮箱mail.163.com例子来展 ...
- Django中的图片加载不出来解决方式记录
背景:Python3.6 + Django2.2 在模板中的html文件中引用图片时,在浏览器中图片总是显示不出来,上网查了很多解决方式,但是都没有解决问题,最终尝试了多次后得以解决,但不清楚原理: ...
- 【51nod】2591 最终讨伐
[51nod]2591 最终讨伐 敲51nod是啥评测机啊,好几次都编译超时然后同一份代码莫名奇妙在众多0ms中忽然超时 这道题很简单就是\(M\)名既被诅咒也有石头的人,要么就把石头给没有石头被诅咒 ...
- 【AC自动机】Censoring
[题目链接] https://loj.ac/problem/10059 [题意] 有一个长度不超过 1e5 的字符串 .Farmer John 希望在 T 中删掉 n 个屏蔽词(一个屏蔽词可能出现多 ...
- Codeforces 1097E. Egor and an RPG game
传送门 首先考虑怎么算 $f(n)$ (就是题目里面那个 $f(n)$) 发现可以构造一组序列大概长这样: ${1,3,2,6,5,4,10,9,8,7,15,14,13,12,11,...,n(n+ ...
- 牛客 P21336 和与或 (数位dp)
大意: 给定数组$R$, 求有多少个数组$A$, 满足$0\le A_i \le R_i$且$A_0+...+A_{N-1}=A_0\space or ...\space or \space A_{N ...
- .Net下二进制形式的文件存储与读取
.Net下图片的常见存储与读取凡是有以下几种:存储图片:以二进制的形式存储图片时,要把数据库中的字段设置为Image数据类型(SQL Server),存储的数据是Byte[].1.参数是图片路径:返回 ...
- python selenium5 模拟点击+拖动+按照指定相对坐标拖动 58同城验证码
#!/usr/bin/python # -*- coding: UTF-8 -*- # @Time : 2019年12月9日11:41:08 # @Author : shenghao/10347899 ...