主要参考英文帖子。我就不翻译了哈。很容易懂的。

　　先说明我的运行平台：

　　1、IDE：Visual Studio 2012 C# .Net Framework4.5，使用默认安装路径；

　　2、显卡类型：NVIDIA GeForce GT 755M（笔记本用移动显卡）,CUDA Toolkit版本：cuda_6.5.14_windows_general_64，使用默认安装路径。

　　3、使用的managedCUDA版本和下载链接：managedCUDA。作者：kunzmi, version 15。郑重声明，版权属于原作者。在此，对kunzmi表示感谢。

——————————————————————————————————————————————————————————————

　　C# .Net Framework4.5中配置和使用managedCUDA

　　一、About managedCuda

　　　ManagedCuda provides an intuitive access to the Cuda driver API for any .net language. It is kind of an equivalent to the runtime API (= a comfortable wrapper of the driver API for C/C++) but written entirely in C# for .net. In contrast to the runtime API, managedCUDA takes a different approach to represent CUDA specifics: managedCuda is object oriented. In general you can find C# classes for each Cuda handle in the driver API. For example, instead of a handle CUContext, managedCUDA provides a CudaContext class. This design allows an intuitive and simple access to all API calls by providing correspondent methods per class. A good example for this wrapping approach is a device variable. In the original Cuda driver API those are given by standard C pointers. In managedCuda these are represented by the class Cuda[Pitched]DeviceVariable<T>. It is a generic class allowing type safe and object oriented access to the Cuda driver API. As a CudaDeviceVariable instance knows about its wrapped data type, array sizes, dimensions and eventually a memory alignment pitch, a simple call to CopyToHost(“hostArray”) is enough. The user doesn’t need to handle the entire C like function arguments, this is all done automatically. Further managedCuda provides specific exceptions in case something goes wrong, i.e. you don’t need to check API call return values, you only need to catch the CudaException just as any other exception.

　　　　But still, as a developer using managedCuda you need to know Cuda. You must know how to use contexts, set kernel launch grid configurations etc.

　　　　I will shortly describe in the following the main classes used to implement a fully functional Cuda application in C#:

　　　　The CudaContext class: This is one of the three main classes and represents a Cuda context. From Cuda 4.0 on, the Cuda API demands (at least) one context per process per device. So for each device you want to use, you need to create a CudaContext instance. In the different constructors you can define several properties, e.g. the deviceID to use. As nearly all managedCuda classes, CudaContext implements IDisposable and the wrapped Cuda context is valid until Dispose() is called. Further CudaContext defines a bunch of static methods to retrieve general information about (possible) Cuda devices. Important for multi threaded applications: In order to use any cuda object related to a context, you must activate the cudaContext by calling the SetCurrent() method from the current thread. This holds for all thread switches. (See the Cuda programming guide for more information).

　　　　CudaKernel: Cuda kernels are load from cubin or ptx files. You can load a kernel using the LoadKernel…() methods of a CudaContext using a byte array representation of the kernel file (e.g. an embedded resource) or by specifying the file name where the kernel is stored. Further you need the kernel name as defined in the source *.cu file. The LoadKernel methods return a CudaKernel object bound to the given context. CudaKernel does not implement IDisposable, as the kernels are automatically destroyed as soon as the corresponding context is destroyed.

　　　　CudaDeviceVariable and its variations: A CudaDeviceVariable object represents allocated memory on the device. The class knows about the exact memory layout (as array length, array dimension, memory pitch, etc.). As the class is a generic, it also knows about its type and type size. All this simplifies dramatically any data copying as no size parameters are needed. Only the source or destination array must be defined (either a default C# host array or another device variable). Device memory is freed as soon as the CudaDeviceVariable object is disposed.

　　　　With these three main classes one can create an entire Cuda accelerated application in C# using only very few code lines.

　　　　Other managedCuda classes:

　　　　CudaPagelockedHostMemory: In order to use asynchron copy methods (host to device or device to host) the host array must be allocated as pinned or page-locked memory. To realize this, CudaPagelockedHostMemory[2D,3D] allocates the memory using cuda’s cuMemHostAlloc. To simplify access per element, the class provides an index property to get or set single values. When implementing large datasets you must know that each single per element access trespasses the managed/unmanaged memory barrier and must be marshaled. Access is therefore not really fast. To handle large amount of data, a copy of a managed array to the unmanaged memory in one block would be faster.

　　　　CudaPagelockedHostMemory_[Type]: As the previous approach using generics and marshalling was not satisfying in terms of speed and direct pointer arithmetic with generics is not possible in C#, I tried something new, what I would call "templates with C#" using T4: A T4 template creates all possible variants like 'float', 'int4', etc. which then access memory directly via pointers. The achieved performance of this approach is close to native arrays. In case you want to use CudaPagelockedHostMemory with your own datatypes, simply copy the tt-file to your project and modify the list of types to process (but be aware of the license: managedCUDA is LGPL!).

　　　　CudaManagedMemory_[Type]: Using the same approach as for page locked memory, CudaManagedMemory gives access to the full feature set of managed memory introduced with Cuda 6.5 in .net.

　　　　CudaRegisteredHostMemory: In C++, registered host memory is normally allocated memory but with registration it gets usable for asynchron copies. But in the .net world this doesn’t work as expected: Also CudaRegisteredHostMemory is part of ManagedCUDA it shouldn’t be used. Use CudaPagelockedHostMemory instead.

　　　　CudaArray[1D,2D,3D]: Represents a CUArray. Either you specify an already existing CUArray as storage location, e.g. from graphics interop, or a new CUArray is created internally. Only if the inner CUArray was allocated by the constructor, it will be freed while disposing.

　　　　CudaTextureFoo: Represents a Cuda texture reference. The device memory to bind this texture to can either be created internally by the constructor or passed as an argument. Only if memory is allocated by the constructor it will be freed while disposing.

　　　　GraphicsInterop: Several graphics interop resource classes exist, one for every graphics API (DirectX or OpenGL). All these resources must be registered and can be mapped to cuda variables, cuda textures or cuda arrays, depending on their type. For efficient mapping, all resources can be grouped in a CudaGraphicsInteropResourceCollection, so that one single Map() call is enough to finish the task. Have a look at the sample applications to see how to use the collection.　　

　　二、Additional libraries:

CudaFFT: Managed access to cufft*.dll
CudaRand: Managed access to curand*.dll
CudaSparse: Managed access to cusparse*.dll
CudaBlas: Managed access to cublas*.dll
CudaSolve: Managed access to cusolve*.dll
NPP: Managed access to npp*.dll
NVRTC: Managed access to nvrtc*.dll

　　　　All libraries have in common that they compile either to 32 or 64 bit in order to handle different wrapped dll names for 32 or 64 bit. They include a basic representation called *NativeMethods to call directly the API functions and wrap handles with C# classes.

　　　　CudaBitmapSource is a simple try to use Cuda device memory as a BitmapSource in WPF. It is more like a proof of concept than a ready to use library, especially the fact that BitmapSource is a sealed class makes a proper implementation difficult. If you have ideas for improvements or a better design, please let me know ;-)

　　　　三、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):

　　(My Visual Studio is a German edition, some “translated” menu entries might therefor differ slightly from the original English menu entries.)

　　You need: Microsoft Visual Studio 201x, Nvidia Cuda Toolkit 7.0, Nvidia Parallel Nsight 4.0 for debugging and of course managedCuda.(注意：本文中CUDA版本为6.5，以下，7.0统一替换为6.5)

Create a normal C# project，此处选择C#控制台应用程序 (ConsoleApplication、ibrary、WinForms、WPF,、etc.)。

　　　　操作为:打开VS IDE——文件-——新建——项目——Visual C#——控制台应用程序，在名称中输入“vectorAdd”，点击“确定”按钮，结束。

在同一解决方案中，添加一个新的CudaRuntime项目。Add a new CudaRuntime 6.5.0 project to the solution.

　　　　操作为:在解决方案资源管理器中，右键点击解决方案“vectorAdd”，右键菜单：添加-——新建——项目——NVIDIA——CUDA6.5——Cuda 6.5 Runtime——在名称中输入“vectorAddKernel”，点击“确定”按钮，结束。可将新创建的项目vectorAddKernel中自动创建的名称为kernel.cu的CUDA源文件改名为:vectorAdd.cu。

Delete the Cuda sample code. To enable proper IntelliSense functionality you need to include the following header files to your *.cu file (from toolkit-include folder):
#include <cuda.h>
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include <builtin_types.h>
#include <vector_functions.h>
#include “float.h”

为了便于IDE找到这些.h文件需要添加库文件和头文件路径，操作为:右键点击项目“vectorAddKernel”属性-——配置属性——VC++目录，依次进行以下设置：

包含目录：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include

库目录：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\lib\x64

此处也可以通过设置环境变量，一劳永逸地解决这个问题，不用再每一个工程单独添加库目录和包含目录。设置环境变量的方法如下：

　　安装完毕后，可以看到系统中多了CUDA_PATH和CUDA_PATH_V6_0两个环境变量，接下来，还要在系统中添加以下几个环境变量：

　　CUDA_SDK_PATH = C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.0

　　CUDA_LIB_PATH = %CUDA_PATH%\lib\x64

　　CUDA_BIN_PATH = %CUDA_PATH%\bin

　　CUDA_SDK_BIN_PATH = %CUDA_SDK_PATH%\bin\x64

　　CUDA_SDK_LIB_PATH = %CUDA_SDK_PATH%\common\lib\x64

　　然后，在系统变量 PATH 的末尾添加：

　　;%CUDA_LIB_PATH%;%CUDA_BIN_PATH%;%CUDA_SDK_LIB_PATH%;%CUDA_SDK_BIN_PATH%;

Also add the following defines:
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif
Write your kernel code in an “extern C{}” scope:

//Includes for IntelliSense

#define _SIZE_T_DEFINED

#ifndef __CUDACC__

#define __CUDACC__

#endif

#ifndef __cplusplus

#define __cplusplus

#endif

#include <cuda.h>

#include <device_launch_parameters.h>

#include <texture_fetch_functions.h>

#include "float.h"

#include <builtin_types.h>

#include <vector_functions.h>

// Texture reference

texture<float2, > texref;

extern "C"

{

    //kernel code

    __global__ void kernel(/* parameters */)

    {

    }

}

You can also omit ‘extern “C”’ in order to use templated kernels. But then kernel names get mangled (“_Z18GMMReductionKernelILi4ELb1EEviPfiPK6uchar4iPhiiiPj” instead of “GMMReductionKernel”, to look up the right mangled name open the compiled ptx file with a text editor). To load a kernel you need the full mangled name.
Change the following project properties of the CudaRuntime 7.0 project:

General:
* Output directory: Set it to the source file directory of the C# project ，即vectorAdd\vectorAdd目录下。前一个vectorAdd是解决方案名称，后一个vectorAdd是默认创建的 C#控制台应用程序名称。
* Application type: 实用工具. This avoids a call to the VisualC++ compiler, no C++ output will be created.

　　　　　　　CUDA C/C++:

　　　　　　　　　　*Compiler Output: $(OutDir)%(FileName)_x64.ptx 或者.cubin 。注意：此处的_x64必须明确指出，否则编译不通过。如果想编译输出32位平台，请将编译器输出设置为：$(OutDir)%(FileName)_x86.ptx 或者.cubin 。
*NVCC Compilation Type: “Generate .ptx file (-ptx)” 或者 “Generate .cubin file (-cubin)” respectively 。需要与前一步骤保持一致。

　　　　　　　　　　*Target Merchine Platform:64-bit (--machine 64)。

　　　　　　　You need to set these properties for all possible targets and configurations (x86/x64, Debug/Release). To handle mixed mode platform kernels, give a different kernel name for x86 and x64, for example $(OutDir)%(FileName)_x86.ptx and $(OutDir)%(FileName)_x64.ptx.

- Delete the post build event: We don’t need the CUDA runtime libraries copied.

Build the Cuda project once for each platform。编译CUDA项目需要的设置：操作为：右键点击项目“vectorAddKernel”——生成自定义——勾选CUDA(.target,.props)，点击“确定”按钮，结束。

In the C# project, add the newly build kernel files in the C# project source directory to the project.

Set the file properties either to embedded resource (access files by stream (byte[]) when loading kernel images) or set “copy to output directory” to “always” and load the kernel image from file.

注意：此处，除了需要将前一步中生成的vectorAdd_x64.ptx文件添加到项目vectorAdd（方法：右键点击项目“vectorAdd”——添加——现有项-选中vectorAdd_x64.ptx，并添加）之外，还需要将vectorAdd_x64.ptx文件属性设置为“嵌入的资源”，以便可以通过文件流，获取该资源中的核函数（方法：右键点击文件“vectorAdd_x64.ptx”——属性——生成操作-嵌入的资源，或者设置复制到输出目录——始终复制）。

Add a reference to the managedCuda assembly。添加对managedCuda 程序集的引用。

　　　　　　四、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)

　　　　　　Using pre-build events:

　　　　　　In the project properties-page of your C# project, add the following pre-build event:

　　　　　　call "%VS100COMNTOOLS%vsvars32.bat"
　　　　　　for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 64 -o "$(ProjectDir)PTX\%%~na_64.ptx" "$(ProjectDir)Kernels\%%~na.cu"
　　　　　　for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 32 -o "$(ProjectDir)PTX\%%~na.ptx" "$(ProjectDir)Kernels\%%~na.cu"

　　　　　　This builds a x86 and x64 version of each file in the .\Kernels directory, outputs it to the .\PTX directory.

　　　　五、常见问题解决办法

　　1、Assembly.GetManifestResourceStream总返回 null。

　　运行或调试代码，发现Assembly.GetManifestResourceStream总是返回null。

　　明明文件资源都在，后来发现因为我仅仅是项目包括了文件，而Assembly.GetManifestResourceStream是对应用的资源进行检索，所以这个文件需要右键点击，在属性中选择“生成操作——嵌入的资源”即可。

　　另：发现 Assembly.GetManifestResourceStream(type,name)时，前面的type所在的namespace必须和name所指的资源的namespace(实际上namespace由资源所在的路径决定)相同。

　　参考链接：解决 GetManifestResourceStream 得到的 Stream 是 null 的方法

　　2、异常：System.BadImageFormatException，未能加载正确的程序集XXX。

　　　　一般是由于目标程序的目标平台与其某一依赖项的目标编译平台不一致导致，把所有的项目都修改到同一目标平台下（X86、X64或AnyCPU）进行编译，一般即可解决问题。尤其是DLL的X86或X64平台，以及Debug或Release版本之间互相不匹配，非常容易引起该问题。

　　参考链接：异常：System.BadImageFormatException，未能加载正确的程序集XXX。

　　3、C#如何判断操作系统位数是32位还是64位。

　　方法很多，可以使用下面的代码判断：

if (System.IntPtr.Size == )

    MessageBox.Show("32位操作系统");

else if (System.IntPtr.Size == )

    MessageBox.Show("64位操作系统");

　　当然了，如果你的操作系统已经是windows7 64位的，如果还出现 IntPtr.Size==4的情况，是因为你的C#项目属性设置为首选32位的原因。如果想取消，操作为：右键点击项目“vectorAdd”——属性——生成——取消选中“首选32位”即可。

　参考链接：C#如何判断操作系统位数是32位还是64位。

C# .Net Framework4.5中配置和使用managedCUDA及常见问题解决办法的更多相关文章

git中配置的.gitignore不生效的解决办法
通常我们希望放进仓库的代码保持纯净,即不要包含项目开发工具生成的文件,或者项目编译后的临时文件.但是,当我们使用git status查看工作区状态的时候,总会提示一些文件未被track.于是,我们想让 ...
在webx.ml中配置struts2 后 welcome-file-list 失效的解决办法
struts2 <filter-mapping> <filter-name>struts2</filter-name> <url-pattern>*.a ...
intelli idea中配置Tomcat找不到的解决办法
这两天新入职一家公司,公司用的是intelli idea,以前用习惯了eclipse,感觉到有点不太习惯,当然,intelli idea也有自己的强大之处.在开始配置Tomact之前,按照网上的说法, ...
解析docker中的环境变量使用和常见问题解决
docker容器中的环境变量 docker可以为容器配置环境变量.配置的途径有两种: 在制作镜像时,通过ENV命令为镜像增加环境变量.在容器启动时使用该环境变量. 在容器启动时候,通过参数配置环境变量 ...
配置IIS使用Python 与常见问题解决
打开IIS管理器选择功能视图,然后选择ISAPI和CGI限制打开后,在右侧操作,点击添加,会出现下图所示按图中提示填写相应部分,在选择路径时,默认可能是dll文件,改成全部文件即可,然后再选择p ...
eclipse中logcat偶尔不显示log的问题解决办法
Android开发过程中 eclipse 经常会出现 logcat突然就是不现实log的情况.经常遇到,一直没有解决.后来解决了,记录一下. 默认的设置是error 改成verbos 问题解决.
mac中的myeclipse的控制台中文乱码问题解决办法
之前写java用到控制台的主要是字符和数字,中文输入貌似真的还没用过,所以就遇到了一个悲剧的老问题,估计每个程序员都会遇到——中文乱码. 用的是MyEclipse开发环境,Window->Gen ...
JSP页面中使用JSTL标签出现无法解析问题解决办法
今天建立一个JavaWeb工程测试JNDI数据源连接,在jsp页面中引入了JSLT标签库,代码如下: <%@ page language="java" import=&quo ...
SSH或者SSM开发web,mysql数据库，数据库配置文件配置不当~数据库读写数据乱码问题解决办法。
相信,大家都有遇到过在传入一个中文string,debug自己的每一行代码时,都发现始终是没有乱码的(即:排除了,源码文件的编码格式是没问题的),但是数据进入数据库之后就是乱掉了. 那么很明显问题就出 ...

随机推荐

【C#】隐式类型var
在.NET 3.0后微软引入了隐式类型var,编译器可以自动判断变量的类型,通过var这个隐式类型,可以提高开发人员的开发效率,很多时候可以不考虑对象的类型,编译器会自动帮我们判断使用隐式类型和使用 ...
URAL —— 1255 & HDU 5100——Chessboard ——————【数学规律】
用 k × 1 的矩形覆盖 n × n 的正方形棋盘用 k × 1 的小矩形覆盖一个 n × n 的正方形棋盘,往往不能实现完全覆盖(比如,有时候 n × n 甚至根本就不是 k 的整倍数). 解题 ...
Git使用教程，感觉比较全，所以【转载】
一:Git是什么? Git是目前世界上最先进的分布式版本控制系统. 二:SVN与Git的最主要的区别? SVN是集中式版本控制系统,版本库是集中放在中央服务器的,而干活的时候,用的都是自己的电脑,所以 ...
jquery的$.getScript在IE下的缓存问题
jquery的$.getScript在IE下的缓存问题
JavaScript typeof运算符和数据类型
// js有6种数据类型:Undefined.Null.Boolean.String.Number.Object //(01)typeof console.log(typeof undefined); ...
Android4.4源码学习笔记
1.StatusBar和Recents是如何建立联系的在BaseStatusBar的start()函数通过getComponent(RecentsComponent.class)得到了Recents ...
Linux ARP代理与 NAT
有时候我们会在一个已有网络(10.10.10.0/24)内组建一个实验网络(192.168.1.0/24),网络结构如上图所示. 假设我们不能控制(修改)A网络内除D主机以外的系统配置,但可以完全控制 ...
oracle学习篇七：更新操作、事务处理
----------------1.数据库更新操作----------------------------- select * from tab;--查询表 drop table siebel_use ...
H5分享到微信好友朋友圈QQ好友QQ空间微博二维码
这是分享按钮: <button onclick="call()">通用分享</button> <button onclick="call(' ...
原生js简单实现拖拽效果
实现弹窗拖拽效果的原理是:按下鼠标并移动——拖拽移动物体,抬起鼠标——停止移动.主要触发三个事件:onmousedown.onmousemove以及onmouseup: 首先搭建结构:一个宽350px ...

C# .Net Framework4.5中配置和使用managedCUDA及常见问题解决办法

一、About managedCuda

二、Additional libraries:

三、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):

四、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)

五、常见问题解决办法

C# .Net Framework4.5中配置和使用managedCUDA及常见问题解决办法的更多相关文章

随机推荐

热门专题

　　一、About managedCuda

　　二、Additional libraries:

　　　　三、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):

　　　　　　四、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)

　　　　五、常见问题解决办法