Visionworks OpenVX
Visionworks OpenVX
heterogeneous computation framework
- Intel Computer Vision SDK
- AMD OVX : -->
- Nvidia Vision Works:
以上是有通過conformance test的廠商,另外ARM 也有類似的SDK(compute library)而且初期開發時在架構上也是參考OpenVX。
雖然一開始OpenVX是針對電腦視覺運算設計的軟體框架,但由於類神經網路的編程模式(programming model)跟熱門程度讓Khronos OpenVX工作小組也特別訂定了Neural Network Extension使得OpenVX也加入了深度學習的戰場。
NVIDIA VisionWorks toolkit is a software development package for computer vision (CV) and image processing. VisionWorks™ implements and extends the Khronos OpenVX standard, and it is optimized for CUDA-capable GPUs and SOCs enabling developers to realize CV applications on a scalable and flexible platform.
VisionWorks includes the following primitives:
- Absolute Difference
- Accumulate Image
- Accumulate Squared
- Accumulate Weighted
- Add / Subtract / Multiply +
- Channel Combine
- Channel Extract
- Color Convert +
- CopyImage
- Convert Depth
- Magnitude
- MultiplyByScalar
- Not / Or / And / Xor
- Phase
- Table Lookup
- Threshold
- Median Flow
- Optical Flow (LK) +
- Semi-Global Matching
- Stereo Block Matching
- IME Create Motion Field
- IME Refine Motion Field
- IME Partition Motion Field
- Affine Warp +
- Warp Perspective +
- Flip Image
- Remap
- Scale Image +
- BoxFilter
- Convolution
- Dilation Filter
- Erosion Filter
- Gaussian Filter
- Gaussian Pyramid
- Laplacian3x3
- Median Filter
- Scharr3x3
- Sobel 3x3
- Canny Edge Detector
- FAST Corners +
- FAST Track +
- Harris Corners +
- Harris Track
- Hough Circles
- Hough Lines
- Histogram
- Histogram Equalization
- Integral Image
- Mean Std Deviation
- Min Max Locations
OpenVX for us
- [x] Support user defined processing
- [ ] Support optimization of duplicate processing
- [ ] Open source framework (if available)
User defined processing
Yes. user node, base it on the Advanced Tiling Extensions (see the Intel's Extensions to the OpenVX* API: Advanced Tiling chapter)
Support optimization of duplicate processing
- Use virtual images whenever possible, as this unlocks many graph compiler optimizations.
- Whenever possible, prefer standard nodes and/or extensions over user kernel nodes (which serve as memory and execution barriers, hindering performance). This gives the Pipeline Manager much more flexibility to optimize the graph execution.
- If you still need to implement a user node, base it on the Advanced Tiling Extensions (see the Intel's Extensions to the OpenVX* API: Advanced Tiling chapter)
- If the application has independent graphs, run these graphs in parallel using
API call. - Provide enough parallel slack to the scheduler- do not break work (for example, images) into too many tiny pieces. Consider kernel fusion.
- For images, use smallest data type that fits the application accuracy needs (for example, 32->16->8 bits).
- Consider heterogeneous execution (see the Heterogeneous Computing with OpenVINO™ toolkit chapter).
- You can create an OpenVX image object that references a memory that was externally allocated (
). To enable zero-copy with the GPU the externally allocated memory should be aligned. For more details, refer to - Beware of the (often prohibitive)
latency costs. For example, construct the graph in a way it would not require the verification upon the parameters updates. Notice that unlike Map/Unmap for the input images (see the Map/Unmap for OpenVX* Images section), setting new images with different meta-data (size, type, etc) almost certainly triggers the verification, potentially adding significant overhead.
Open source framework (if available)
Software Requirements
A Windows build environment needs these components:
- Intel® HD Graphics Driver (latest version)†
- OpenCV 3.4 or higher
- Intel® C++ Compiler 2017 Update 4
- CMake* 2.8 or higher
- Python* 3.5 or higher
- Visual Studio* 2015 or 2017
Get the Software
Your license includes the full version of the product. To access the toolkit:
- Make sure your system meets the minimum requirements listed on this page.
- Complete the registration form.
- Download the product.
- The code is highly optimized for both x86 CPU and OpenCL for GPU
- Supported hardware spans the range from low power embedded APUs (like the new G series) to laptop, desktop and workstation graphics
- Supports Windows, Linux, and OS X
- Includes a “graph optimizer” that looks at the entire processing pipeline and removes/replaces/merges functions to improve performance and minimize bandwidth at runtime
- Scripting support allows for rapid prototyping, without re-compiling at production performance levels
CPU: SSE4.1 or above CPU, 64-bit.
GPU: Radeon Professional Graphics Cards or Vega Family of Products (16GB required for vx_loomsl and vx_nn libraries)
OpenCV 3 (optional)
for RunVX
- Set OpenCV_DIR environment variable to OpenCV/build folder
Build Instructions
Build this project to generate AMD OpenVX library and RunVX executable.
- Refer to openvx/include/VX for Khronos OpenVX standard header files.
- Refer to openvx/include/vx_ext_amd.h for vendor extensions in AMD OpenVX library.
- Refer to runvx/ for RunVX details.
- Refer to runcl/ for RunCL details.
Build using Visual Studio Professional 2013 on 64-bit Windows 10/8.1/7
- Install OpenCV 3 with contrib download for RunVX tool to support camera capture and image display (optional)
- OpenCV_DIR environment variable should point to OpenCV/build folder
- Use amdovx-core/amdovx.sln to build for x64 platform
- If AMD GPU (or OpenCL) is not available, set build flag ENABLE_OPENCL=0 in openvx/openvx.vcxproj and runvx/runvx.vcxproj.
Download to C:\Users\aeejshe\Downloads
- C:\Users\aeejshe\Downloads\amdovx-core-0.9-beta2
- C:\Users\aeejshe\Downloads\opencv
Build SW according to guidelines, especially
- modify lib to C:\Users\aeejshe\Downloads\opencv\build\x64\vc12\lib\opencv_world310d.lib
C:\Users\aeejshe\Downloads\amdovx-core-0.9-beta2\amdovx-core-0.9-beta2>runvx exa
***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****
runvx.exe 0.9.7
OK: using AMD OpenVX 0.9.7
OK: enabled graph scheduling in separate threads
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms
OK: capturing 480x360 image(s) into 480x360 RGB image buffer
csv,OVERALL, PASS, 1, , 8.60, 8.60, 0.00, 0.00, 0.00, 0.00 (medi
an 8.598)
> total elapsed time: 0.11 sec
Abort: Press any key to exit...
# create input and output images
data input = image:480,360,RGB2
data output = image:480,360,U008
# specify input source for input image and request for displaying input and output images
read input examples/images/face1.jpg
view input inputWindow
view output edgesWindow
# compute luma image channel from input RGB image
data yuv = image-virtual:0,0,IYUV
data luma = image-virtual:0,0,U008
node org.khronos.openvx.color_convert input yuv
node org.khronos.openvx.channel_extract yuv !CHANNEL_Y luma
# compute edges in luma image using Canny edge detector
data hyst = threshold:RANGE,UINT8:INIT,80,100
data gradient_size = scalar:INT32,3
node org.khronos.openvx.canny_edge_detector luma hyst gradient_size !NORM_L1 output
input --> |color_convert| yuv
yuv --> |channel_extract| luma
luma --> |merge| merged
hyst --> merged
gradient_size --> merged
merged --> |canny_edge_detector| output
***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****
runvx.exe 0.9.7
runvx.exe [options] [file] <file.gdf> [argument(s)]
runvx.exe [options] node <kernelName> [argument(s)]
runvx.exe [options] shell [argument(s)]
The argument(s) are data objects created using <data-description> syntax.
These arguments can be accessed from inside GDF as $1, $2, etc.
The available command-line options are:
Show full help.
Turn on verbose logs.
Replace ~ in filenames with <directory> in the command-line and
GDF file. The default value of '~' is current working directory.
Run the graph/node for specified frames or until eof or just as live.
Use live to indicate that input is live until aborted by user.
Set context affinity to CPU or GPU.
Print performance profiling information after graph launch.
use directive VX_DIRECTIVE_AMD_ENABLE_PROFILE_CAPTURE when graph is create
Continue graph processing even if compare mismatches occur.
Replace all virtual data types in GDF with non-virtual data types.
Use of this flag (i.e. for debugging) can make a graph run slower.
dump profile
C:\Users\aeejshe\Downloads\amdovx-core-0.9-beta2\amdovx-core-0.9-beta2>runvx -du
mp-profile examples\gdf\canny.gdf
***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****
runvx.exe 0.9.7
OK: using AMD OpenVX 0.9.7
OK: enabled graph scheduling in separate threads
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms
OK: capturing 480x360 image(s) into 480x360 RGB image buffer
csv,OVERALL, PASS, 1, , 8.62, 8.62, 0.00, 0.00, 0.00, 0.00 (medi
an 8.621)
> total elapsed time: 0.07 sec
> graph profile:
1, 8.621, 8.621, 8.621, 8.621,CPU,GRAPH
1, 1.196, 1.196, 1.196, 1.196,CPU,com.amd.openvx.ColorConvert_Y_RGB
1, 4.905, 4.905, 4.905, 4.905,CPU,com.amd.openvx.CannySobel_U16_U8_3x3_
1, 2.305, 2.305, 2.305, 2.305,CPU,com.amd.openvx.CannySuppThreshold_U8X
1, 0.208, 0.208, 0.208, 0.208,CPU,com.amd.openvx.CannyEdgeTrace_U8_U8XY
Abort: Press any key to exit...
Test if CSE works
# create input and output images
data input = image:480,360,RGB2
data output = image:480,360,U008
data output2 = image:480,360,U008
# specify input source for input image and request for displaying input and output images
read input examples/images/face1.jpg
view input inputWindow
view output edgesWindow
# compute luma image channel from input RGB image
data yuv = image-virtual:0,0,IYUV
data yuv2 = image-virtual:0,0,IYUV
data luma = image-virtual:0,0,U008
data luma2 = image-virtual:0,0,U008
node org.khronos.openvx.color_convert input yuv
node org.khronos.openvx.color_convert input yuv2
node org.khronos.openvx.channel_extract yuv !CHANNEL_Y luma
node org.khronos.openvx.channel_extract yuv2 !CHANNEL_Y luma2
# compute edges in luma image using Canny edge detector
data hyst = threshold:RANGE,UINT8:INIT,80,100
data gradient_size = scalar:INT32,3
node org.khronos.openvx.canny_edge_detector luma hyst gradient_size !NORM_L1 output
node org.khronos.openvx.canny_edge_detector luma2 hyst gradient_size !NORM_L1 output2
C:\Users\aeejshe\Downloads\amdovx-core-0.9-beta2\amdovx-core-0.9-beta2>runvx -du
mp-profile examples\gdf\canny.gdf
***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****
runvx.exe 0.9.7
OK: using AMD OpenVX 0.9.7
OK: enabled graph scheduling in separate threads
csv,HEADER ,STATUS, COUNT,cur-ms,avg-ms,min-ms,clenqueue-ms,clwait-ms,clwrite-ms
OK: capturing 480x360 image(s) into 480x360 RGB image buffer
csv,OVERALL, PASS, 1, , 17.13, 17.13, 0.00, 0.00, 0.00, 0.00 (medi
an 17.127)
> total elapsed time: 0.07 sec
> graph profile:
1, 17.127, 17.127, 17.127, 17.127,CPU,GRAPH
1, 1.202, 1.202, 1.202, 1.202,CPU,com.amd.openvx.ColorConvert_Y_RGB
1, 1.192, 1.192, 1.192, 1.192,CPU,com.amd.openvx.ColorConvert_Y_RGB
1, 4.857, 4.857, 4.857, 4.857,CPU,com.amd.openvx.CannySobel_U16_U8_3x3_
1, 4.838, 4.838, 4.838, 4.838,CPU,com.amd.openvx.CannySobel_U16_U8_3x3_
1, 2.312, 2.312, 2.312, 2.312,CPU,com.amd.openvx.CannySuppThreshold_U8X
1, 2.302, 2.302, 2.302, 2.302,CPU,com.amd.openvx.CannySuppThreshold_U8X
1, 0.209, 0.209, 0.209, 0.209,CPU,com.amd.openvx.CannyEdgeTrace_U8_U8XY
1, 0.207, 0.207, 0.207, 0.207,CPU,com.amd.openvx.CannyEdgeTrace_U8_U8XY
Abort: Press any key to exit...
Q: Why CSE not work?
VX_API_ENTRY vx_graph VX_API_CALL vxCreateGraph(vx_context context);
VX_API_ENTRY vx_status VX_API_CALL vxVerifyGraph(vx_graph graph);
VX_API_ENTRY vx_status VX_API_CALL vxProcessGraph(vx_graph graph);
VX_API_ENTRY vx_image VX_API_CALL vxCreateVirtualImage(vx_graph graph, vx_uint32 width, vx_uint32 height, vx_df_image color);
VX_API_ENTRY vx_node VX_API_CALL vxColorConvertNode(vx_graph graph, vx_image input, vx_image output);
[G-API Intro](file:///C:/Users/aeejshe/Downloads/2018-12-24-GAPI_Overview.pdf)
GAPI_EXPORTS GMat resize(const GMat& src, const Size& dsize, double fx = 0, double fy = 0, int interpolation = INTER_LINEAR);
class GComputation{
GComputation(GProtoInputArgs &&ins,
GProtoOutputArgs &&outs); // Arg-to-arg overload
void apply(GRunArgs &&ins, GRunArgsP &&outs, GCompileArgs &&args = {});
of G-API apply function
GComputation -> GComputation2: apply
GComputation2 -> GCompiler: compile
GCompiler -> Graph: build graph
Graph --> GComputation2: return ade::Graph
GComputation2 -> Graph: exec the graph
Vision grab post processing
Study if OpenVINO or OpenCV supports
- CSE(common-subexpression elimination)
- feed partially inputs
Lib | CSE | partially inputs |
OpenVINO | x | x |
AMDOVX | x | x |
OpenCV G-API | x | x |
Intel TBB | x | v behavior: the ready nodes are called then exit Code: C:\jshe\codes\lua\src\tbbtest\test_tbb_behavior.cpp |
Tensorflow | v |
Test if can be called multiples like following
while true
modify input
high level | low level | |
ovx | strong typed eg VX_API_ENTRY vx_node VX_API_CALL vxColorConvertNode(vx_graph graph, vx_image input, vx_image output); |
weak typed, eg OpenVX.dll!agoCreateNode(_vx_graph * graph, int kernel_id) |
tbb | strong typed make_edge(tbb::flow::output_port<1>(gpu_slm_split_n), tbb::flow::input_port<1>(gpu_slm_mat_mult_n)) tbb::flow::function_node< validation_args_type > mat_validation_n(g, tbb::flow::unlimited, [](const validation_args_type& result) { // Get references to matrixes const tbb::flow::gfx_buffer& GPU_SLM_MAT = std::get<0>(result); const tbb::flow::gfx_buffer& CPU_SLM_MAT = std::get<1>(result); const tbb::flow::gfx_buffer& CPU_NAIVE_MAT = std::get<2>(result); // Verify results |
Not sure |
G-API | strong typed | TODO |
// ovx: \vis_bep_12\C\Users\aeejshe\Downloads\amdovx-core-0.9-beta2\amdovx-core-0.9-beta2
// tbb: C:\Users\aeejshe\Downloads\tbb2017_20170604oss_win\tbb2017_20170604oss
How to register Kernel
Define a enum
OVX_KERNEL_ENTRY( VX_KERNEL_COLOR_CONVERT , ColorConvert, "color_convert", AIN_AOUT, ATYPE_II , false ),
the parameters meaning
#define OVX_KERNEL_ENTRY(kernel_id,name,kname,argCfg,argType,validRectReset) \
- AIN_AOUT: 1 in, 1 out
- ATYPE_II: 2 image types
Implement "DramaDivideNode" operation, it is used to select the best suited for this PC architecture
int agoDramaDivideNode(AgoNodeList * nodeList, AgoNode * anode)
// save parameter list
vx_uint32 paramCount = anode->paramCount;
AgoData * paramList[AGO_MAX_PARAMS]; memcpy(paramList, anode->paramList, sizeof(paramList));
// divide the node depending on the type
int status = -1;
switch (anode->akernel->id)
status = agoDramaDivideColorConvertNode(nodeList, anode);
the function is called by optimize function
> OpenVX.dll!agoCreateNode(_vx_graph * graph, int kernel_id) Line 2699 C++
OpenVX.dll!agoDramaDivideAppend(AgoNodeList * nodeList, _vx_node * anode, int new_kernel_id, _vx_reference * * paramList, unsigned int paramCount) Line 37 C++
OpenVX.dll!agoDramaDivideAppend(AgoNodeList * nodeList, _vx_node * anode, int new_kernel_id) Line 56 C++
OpenVX.dll!agoDramaDivideColorConvertNode(AgoNodeList * nodeList, _vx_node * anode) Line 244 C++
OpenVX.dll!agoDramaDivideNode(AgoNodeList * nodeList, _vx_node * anode) Line 1818 C++
OpenVX.dll!agoOptimizeDramaDivide(_vx_graph * agraph) Line 1962 C++
OpenVX.dll!agoOptimizeDrama(_vx_graph * agraph) Line 522 C++
OpenVX.dll!agoOptimizeGraph(_vx_graph * agraph) Line 209 C++
OpenVX.dll!vxVerifyGraph(_vx_graph * graph) Line 2450 C++
runvx.exe!CVxEngine::ProcessGraph(std::vector<char const *,std::allocator<char const *> > * graphNameList, unsigned __int64 beginIndex) Line 285 C++
How to schedule graph?
What optimization is done in optimize()?
Choose the best
Visionworks OpenVX的更多相关文章
- OpenVX
OpenVX openvx 1. 编译 尝试编译openvx_sample,下载相关代码. 下载的sample code直接使用make可以生成 使用python Buil ...
- 【ARM-Linux开发】【CUDA开发】【深度学习与神经网络】Jetson Tx2安装相关之三
JetPack(Jetson SDK)是一个按需的一体化软件包,捆绑了NVIDIA®Jetson嵌入式平台的开发人员软件.JetPack 3.0包括对Jetson TX2 , Jetson TX1和J ...
- Jetson TX2
NVIDIA Jetson TX2作为一个嵌入式平台的深度学习端,具备不错的GPU性能,可以发现TX2的GPU的计算能力是6.2.这意味着TX2对半精度运算有着良好的支持,因此,完全可以在桌面端训练好 ...
- Jetson TX2介绍
Jetson TX2是NIVDIA瞄准人工智能在Jetson TK1和TX1推出后的升级 TX2的GPU和CPU都进行了升级,内存增加到了8GB.存储增加到了32GB,支持Wifi和蓝牙,编解码支持H ...
- 基于GPU的图像处理平台
基于GPU的图像处理平台 1. (309)英伟达推Jetson TX1 GPU模块力推人工智能 1.1 产品概述 Jetson TX1 GPU模块,主要针对近年来蓬勃发展的人工智能市场,包括无人机. ...
- NVIDIA Jetson™ TX1 Module
NVIDIA® Jetson TX1 是一台模块式计算机,代表了视觉计算领域近20年的研发成就,其尺寸仅有信用卡大小.Jetson TX1 基于NVIDIA Maxwell™ 架构,配有256个 NV ...
- NVIDIA Jetson™ TX1
NVIDIA® Jetson TX1 是一台模块式计算机,代表了视觉计算领域近20年的研发成就,其尺寸仅有信用卡大小.Jetson TX1 基于崭新 NVIDIA Maxwell™ 架构,配有256个 ...
- Ubuntu1404 (1)
0.初始设置 (1)开户root账号并重启系统: sudo gedit /usr/share/lightdm/lightdm.conf.d/50-ubuntu.conf, 添加greeter-show ...
- 人工智能AI芯片与Maker创意接轨(下)
继「人工智能AI芯片与Maker创意接轨」的(上)篇中,认识了人工智能.深度学习,以及深度学习技术的应用,以及(中)篇对市面上AI芯片的类型及解决方案现况做了完整剖析后,系列文到了最后一篇,将带领各位 ...
- java版ftp简易客户端(可以获取文件的名称及文件大小)
java版ftp简易客户端(可以获取文件的名称及文件大小) package com.ccb.ftp; import; import ...
- prometheus学习系列十一: Prometheus exporter详解
exporter详解 前面的系列中,我们在主机上面安装了node_exporter程序,该程序对外暴露一个用于获取当前监控样本数据的http的访问地址, 这个的一个程序成为exporter,Expor ...
- springboot 打包太大,打包瘦身,打包thin
pom文件修改: <build> <resources> <resource> <directory>src/main/resources</di ...
- springboot 单元测试 指定启动类
问题 在做单元测试时,写了一个工具类,用于注入spring的上下文. public class AppBeanUtil implements ApplicationContextAware { pri ...
- web服务器-apache
一.apache详解 1. 概述 apache是世界上使用排名第一的web服务器软件.它可以运行在几乎所有广泛使用的计算机平台上,由于其跨平台和安全性被广泛使用,是最流行的web服务器端软件之一.它快 ...
- linux命令当前文件夹下面模糊搜索文件
在当前文件夹下面模糊搜索文件: find . -type f | xargs grep 'boot',"boot"表示文件名中包含的字符串
- Vagrant+VirtualBox虚拟环境
Vagrant+VirtualBox虚拟环境 VagrantVirtualBox 软件安装 虚拟机基础配置 虚拟机创建 共享目录 配置网络 配置私有网络 配置公有网络 打包box与添加box 打包bo ...
- 折腾deepin修改终端语言
原创作品,作者是博客园sogeisetsu,转载请注明来源 唉-都怪当初没学扎实,改个终端语言花费了半天. 首先,介绍一下我的情况 有两个用户,一个是roo ...
- js 正则表达式 贪婪与惰性
首先引入一个介绍比较详细的网站 接下来是本人的简介 其实贪婪和惰性很容易理解,从字面意思我们就可以知道,所谓的"贪 ...