现在的位置: 首页 > 综合 > 正文

CUDA学习笔记

2014年02月16日 ⁄ 综合 ⁄ 共 1635字 ⁄ 字号 评论关闭

CUDA中:CPU和系统内存当作host,GPU与显存当作device

 

__global__ 限定词通知编译器这个函数应该被编译在device上运行而不是host

CUDA C需要语言的方法来标记函数为device code(CUDA C needed a linguistic method for marking a function
as device code)。

 

 

cudaMalloc()与标准C中的malloc类似,但是它告诉CUDA runtime在device中分配内存

 

 

blockIdx

This variable is of type uint3 (see Section B.3.1) and contains the block index within the grid.

It contains the value of the block index for whichever block is currently running the device code.

blockidx包含了当前运行device code的块索引

 

Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional index accessible within the kernel through the built-in blockIdx variable

 

For example, if we launch with kernel<<<2,1>>>(), you can think of the runtime creating two copies of the kernel and running them in parallel.

 

When we launched the kernel, we specified N as the number of parallel blocks.We call the collection of parallel blocks a grid.

并行块的集合称为网格(grid)

 

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:

 Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched; Dg.z must be equal to 1 for devices of compute capability 1.x;

 Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;

 Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an
external array as mentioned in Section B.2.3; Ns is an optional argument which defaults to

 

Ns是size_t类型,它指定了每个block除了静态分配的memory外,在shared memory中动态分配的字节数

 

 

抱歉!评论已关闭.