cuda學習筆記 1

cuda**（.cu）的目的是並行運算。只要在c/c++**中呼叫以 __global__為關鍵字修飾的函式( __ global __ void function( type *inputarraya, type *inputarrayb, type *outputarraya) )，稱為核函式，**經nvcc編譯，識別到核函式就會編譯成gpu指令碼; 呼叫該函式時，要在函式名稱加上<<>>( function<<>>( type *inputarraya, type *inputarrayb, type *outputarraya) )。不過，gpu 只能操作gpu上的變數，所以在呼叫 __ global __ 函式之前，先用cudamalloc申請好在cuda變數記憶體（__global函式的引數：input array，output array)，並用cudamemcpy （cudamemcpyhosttodevice)賦值輸入array。待函式執行完成後，執行結果儲存在輸出array中，用cudamemcpy （cudamemcpydevicetohost)把執行結果從gpu記憶體中copy到cpu中，平行計算完成，用cudafree釋放之前申請的cuda變數記憶體。以上就是cpu**中呼叫gpu的流程。

呼叫 cuda 核函式需要指定呼叫多少個block,每個block包含多少個thread。其中，多個block組成乙個grid. 共呼叫了 blockspergrid*threadsperblock 個並行執行的執行緒，所以要在cuda核函式中明確的指定每個執行緒執行時對應的array index。注意：thread, block有.x, .y二維資料，但有時只用其中一維.x 。下面將給出乙個簡單的demo，執行 c=a+b ( c[i] = a[i] + b[i] )運算。

#include
"../common/book.h"
#define n   10
__global__ void
add(
int*a,
int*b,
int*c )
intmain
(void
)// copy the arrays 'a' and 'b' to the gpu
handle_error
(cudamemcpy
( dev_a, a, n *
sizeof
(int),
cudamemcpyhosttodevice ));
handle_error
(cudamemcpy
( dev_b, b, n *
sizeof
(int),
cudamemcpyhosttodevice ));
//n blocks, 1 thread per block for n length arrays parallel computation(add)
add<<
1>>
>
( dev_a, dev_b, dev_c )
;// copy the array 'c' back from the gpu to the cpu
handle_error
(cudamemcpy
( c, dev_c, n *
sizeof
(int),
cudamemcpydevicetohost ));
// display the results
for(
int i=
0; i
)// free the memory allocated on the gpu
handle_error
(cudafree
( dev_a ));
handle_error
(cudafree
( dev_b ));
handle_error
(cudafree
( dev_c ));
return0;
}

cuda學習筆記 1

CUDA學習筆記（1） Hello CUDA

cuda學習筆記1 hello world實戰

CUDA學習備忘1

cuda學習筆記 1

CUDA學習筆記（1） Hello CUDA

cuda學習筆記1 hello world實戰

CUDA學習備忘1

相關推薦