CUDA入門學習筆記（一）

cuda高效能平行計算

執行第一cuda程式

學習幾個概念

核函式：核函式是一種特殊的函式，載入核函式與常規函式呼叫很像，格式：

<<>> (args)

dg:網格中的執行緒塊數，db:執行緒塊中的執行緒數目，args:傳入引數

函式識別符號

__global__ 是標誌著和函式的識別符號 __host__ 函式從主機端呼叫在主機執行 __device__ 函式從裝置端呼叫並在裝置端執行

cuda 執行時 api 可以將輸入資料傳輸到裝置端和將結果傳回到主機端

cudamalloc () 函式可以分配裝置端記憶體 cudamemcpy () 將資料傳入或者傳出裝置 cudafree () 釋放掉裝置中不再使用的記憶體 __syncthreads () 可以在乙個執行緒塊中進行執行緒同步 cudadevicesynchronize () 函式可以有效地同步乙個網格中的所有執行緒 atomaicadd () 可以防止多執行緒併發訪問乙個變數時造成衝突 size_t：代表記憶體大小的專用變數型別 cudaerror_r 錯誤處理的專用變數

將源程式.cpp轉為核函式.cu

dist_v1 中 main.cpp

#include
//include standard math library containing sqrt.
#define n 64 
// specify a constant value for array length.
// a scaling function to convert integers 0,1,...,n-1
// to evenly spaced floats ranging from 0 to 1.
float
scale
(int i,
int n)
// compute the distance between 2 points on a line.
float
distance
(float x1,
float x2)
intmain()
;// choose a reference value from which distances are measured.
const
float ref =
0.5f
;/* for loop to scale the index to obtain coordinate value,
* compute the distance from the reference point,
* and store the result in the corresponding entry in out. */
for(
int i =
0; i < n;
++i)
return0;
}

修改為kernel.cu

將.cpp中的 for 函式修改為 __ global __ 和__device__ 的迴圈呼叫

distancekernel<<>>(d_out,ref,n)

#include
#define n 64
#define tpb 32
__device__ float
scale
(int i,
int n)
__device__ float
distance
(float x1,
float x2)
__global__ void
distancekernel
(float
*d_out,
float ref,
int len)
intmain()

cuda學習筆記（一）儲存

1.乙個gpu 上有很多的sm stream multiprocessor 每個 sm中包括了8個 sp stream processor 標量流處理器，商業宣傳中所說的數百個核大多指的是 sp的數量。隸屬於同乙個sm的 sp共用同一套取指與發射單元。cuda 中的kernel 是以block ...

CUDA程式設計入門筆記1

要開始給專案中的程式做速度上的優化由於cpu的計算速度比較慢所以想用gpu來進行大量相同的計算 cuda c是對c c 語言進行拓展後形成的變種，相容c c 語法，檔案型別為 cu 檔案，編譯器為 nvcc 相比傳統的c c 主要新增了以下幾個方面用來確定某個函式是在cpu還是gpu上執行，以...

cuda學習筆記 1

cuda cu 的目的是並行運算。只要在c c 中呼叫以 global 為關鍵字修飾的函式 global void function type inputarraya,type inputarrayb,type outputarraya 稱為核函式，經nvcc編譯，識別到核函式就會編譯成gpu指令碼...

CUDA入門學習筆記（一）

cuda學習筆記（一）儲存

CUDA程式設計入門筆記1

cuda學習筆記 1

相關推薦