CUDA優化例項（四）紋理記憶體

本文參考：

本節的內容可能和標題不服，本節主要將紋理記憶體的使用，它到底有什麼有什麼速度的提公升，優化體現在**，我下節會寫，本節主要寫乙個紋理記憶體的例子。

紋理記憶體的使用有兩種方式，分別是texture object和texture reference，前者可在程式中動態生成，後者的部分在編譯期間靜態生成（確定了），有點像靜態陣列和動態陣列的意思。如當事先不知道要處理的資料的型別時，使用texture object是個不錯的選擇。本文以texture reference為例。

使用紋理記憶體步驟較複雜簡單介紹：

宣告texture referece：

texturetexref;申請裝置記憶體，可以是線性記憶體或cuda記憶體，其不同[cuda紋理筆記]

( 這裡以2維cuda array為例：

// 申請 cuda 陣列並拷貝資料
cudachannelformatdesc channeldesc = cudacreatechanneldesc(8, 0, 0, 0, cudachannelformatkindunsigned);
cudaarray* cuarray;
cudamallocarray(&cuarray, &channeldesc, width, height);
cudamemcpytoarray(cuarray, 0, 0, h_data, size, cudamemcpyhosttodevice);

cudacreatechanneldesc函式中的返回的是」channel format kind 「，其引數的含義是每維資料的字數，如果是float,int等為32位，如果是char，unsigned char等為8位。最後乙個引數cudachannelformatkindunsigned是資料的型別。它與textture reference裡一樣的datatype要對應。

cudamemcpytoarray是將host段的資料拷貝到cuda array。

3. 設定texture reference引數：

// 指定紋理引用引數
//一維的超越邊界取0
texref.addressmode[0] = cudaaddressmodeborder;
//二維的超越邊界取0
texref.addressmode[1] = cudaaddressmodeborder;
//不插值，取整
texref.filtermode = cudafiltermodepoint;
//不使用歸一化紋理座標
texref.normalized = 0;

繫結紋理記憶體

// 繫結紋理引用
cudabindtexturetoarray(texref, cuarray, channeldesc);

執行核函式。

texture memory.cu：

#include 
#include 
#include 
#include 
#include "device_launch_parameters.h"
#define degre_to_radian(x) ((x) * 3.1416f / 180)
#define ceil(x,y) ((((x) + (y) - 1))/ (y) )
typedef unsigned char uchar;
// 宣告紋理引用
texturetexref;
// 簡單的線性變換
__global__ void transformkernel(uchar* output, int width, int height, float theta)
intmain
()    // 申請 cuda 陣列並拷貝資料
cudachannelformatdesc
channeldesc = cudacreatechanneldesc
(8, 0, 0, 0, cudachannelformatkindunsigned);
cudaarray* cuarray;
cudamallocarray
(&cuarray, &channeldesc, width, height);
cudamemcpytoarray
(cuarray, 0, 0, h_data, size, cudamemcpyhosttodevice);
// 指定紋理引用引數，注意與紋理物件的使用不一樣
texref.addressmode[0] = cudaaddressmodeborder;
texref.addressmode[1] = cudaaddressmodeborder;
texref.filtermode = cudafiltermodepoint;
texref.normalized = 0;
// 繫結紋理引用
cudabindtexturetoarray
(texref, cuarray, channeldesc);
// 執行核函式
dim3
dimblock
(16, 16);
dim3
dimgrid
(ceil(width, dimblock.x), ceil(height, dimblock.y));
printf
("x:  %d , y:  %d  \n",dimgrid.x,dimgrid.y);
transformkernel
<< > > (d_data, width, height, angle);
cudadevicesynchronize
();    // 結果**和檢查結果
cudamemcpy
(h_data, d_data, size, cudamemcpydevicetohost);
printf
("\n\n");
for(i = 0; i < width*height; i++)
// **工作
cudafreearray
(cuarray);
cudafree
(d_data);
return 0;
}

當angle = 180時

當angle = 30時

當angle = 0時

180旋轉出現0是因為texref.addressmode[1] = cudaaddressmodeborder;和#define degre_to_radian(x) ((x) * 3.1415926 / 180)pi精度不夠，

#define degre_to_radian(x) ((x) * acos(-1) / 180)會好一些。

本節介紹個紋理記憶體的例子，下節會介紹紋理記憶體比全域性記憶體的優越性。

CUDA優化例項（四）紋理記憶體

CUDA分支優化

CUDA 學習優化思路

cuda程式設計效能優化

CUDA優化例項（四）紋理記憶體

CUDA分支優化

CUDA 學習優化思路

cuda程式設計 效能優化

相關推薦

cuda程式設計效能優化