不同工具下的矩陣乘法速度測試

今天花了一些時間將基於以上幾種工具（cuda&arrayfire&matlab）的矩陣乘法的速度進行了測試比較，驗證了一些想法吧。

首先是c（cpu）的乘法測試：寫的有點繁瑣，後面在cuda程式中進行了綜合

c.matrix.cpp

void matrix_print(double **a,long nl, long nh)
/*矩陣的輸出*/}
}

#include "stdio.h"
#include "stdlib.h"
#include#includevoid main()
//每個執行緒負責計算p中的乙個元素
pd[row * widthn + col] = pvalue;
}// 矩陣乘法
void matrixmul( double* m,double* n,double* p,int widthm, int widthn, int heightm, int heightn)

matrixmul.h

#ifndef __matrixmul_h__
#define __matrixmul_h__
//extern void matrixmul(double* m, double* n, double* p,int width);
extern void matrixmul(double* m,double* n,double* p,int widthm, int widthn, int heightm, int heightn);
#endif

cuda_array_ctest.cpp

#include "stdio.h"
#include "stdlib.h"
#include #include "time.h"
#include "matrixmul.h"
#include "arrayfire.h"
#include #include using namespace std;
using namespace af;
/*產生隨機矩陣m行n列，矩陣元素0~1*/
void matrixran (double *data,int m,int n)
catch (af::exception& e) 
system("pause");
return 0;
}

c-mex模式（matlab&gpu）

主要包括matrixcuda.cu,matrixmul.h,matrixmulcuda.cpp,matrixmul.m,其中matrixcuda.cu,matrixmul.上面已經寫出來了，直接copy即可，另外還有兩個檔案如下所示：

matrixmulcuda.cpp

#include "mex.h"
#include "matrixmul.h"
void mexfunction(int nlhs, mxarray *plhs, int nrhs, mxarray *prhs)

matrixmul.m

% runmatrixmul
clear
clcdisp('1. nvcc mulmat.cu compiling ...');
%system('nvcc -c matrixmul.cu -ccbin "c:\program files\microsoft visual studio 10.0\vc\bin"');
system('nvcc -c matrixmul.cu  -gencode arch=compute_50,code=sm_50 -ccbin "d:\program files (x86)\microsoft visual studio 10.0\vc\bin"')
mex matrixmulcuda.cpp matrixmul.obj -lcudart -l"c:\program files\nvidia gpu computing toolkit\cuda\v7.5\lib\x64";
disp('2.input two matrix:')
a=rand(2048);
b=rand(2048);
%a=ones(16,16);
%b=ones(16,16);
disp('3.compare with the cpu and gpu result:')
disp('gpu result:')
ticc_gpu =matrixmulcuda(b,a);
tg=toc;
disp('cpu result:')
ticc_cpu = a*b;
tc=toc;

至此4種方式的**全部在上面了，可以幫助剛開始的童鞋熟悉各個不同的gpu程式設計方式，接下來就是測試結果了。

測試時間如下圖表所示：

從上圖和上表我們不難看出以下幾點

1、隨著資料量的增加，gpu加速效果越來越明顯，arryfire趨於穩定；

2：隨著資料量的增加，除了arrayfire各個工具處理的時間的都會有不同程度的增加，尤其是cpu的最為明顯，所以在不同的方式下gpu都會有不同程度的加速；

3：這裡的cuda程式沒有進行記憶體優化，可能用上共享記憶體的方式或許會更快，改變blocksize的大小也會有影響，這裡偷懶將blocksize設定為8，沒有測其他值了；

4：根據《cbf中for迴圈變矩陣乘法的思想（arrayfire）》這篇部落格測試來看，如果不解決程式設計中for迴圈的問題，執行時間會維持在乙個相對較高的狀態，所以關鍵問題還是在不同的工具下都要用並行的思想去寫程式。

5：順便推廣一下arrayfire，執行速度穩定，包含了大量的矩陣運算工具，可參見部落格《arrayfire常用的那幾招（引用於葵花寶典）》。

ymode協議不同工具之間的區別

最近在使用ymode協議進行檔案傳輸的過程中發現一些問題，因而做一下總結。協議的接收是自己實現的，協議的傳送使用的是pc上的現有工具超級終端和 securecrt 7.3 使用超級終端用的比較多，協議最開始的除錯也是用的超級終端那麼問題來了。1 超級終端和securecrt 7.3到底有...

不同交通工具的速度

description 不同交通工具的速度是不同的。針對自行車電單車和汽車分別建立類，來模擬這一情況。定義vechicle類，是所有交通工具的父類屬性int speed表示交通工具的一般速度。靜態資料成員int numofvechicles，表示建立的交通工具的數量。這個值只增不減。靜態成員函式...

通過矩陣乘法看記憶體訪問對CPU運算速度的影響

關於intel c 編譯器和visual c 編譯器的差異塊可見 intel和microsoft c 編譯器在矩陣乘法測試例子中執行時間的差異從速度上考量這裡僅測試intel c 編譯器的情形。矩陣乘法有普通的按定義的方法和塊方法，測試結果表明後者可達到前者的兩倍速度。速度和通過加法運算看記憶體...

不同工具下的矩陣乘法速度測試

ymode協議不同工具之間的區別

不同交通工具的速度

通過矩陣乘法看記憶體訪問對CPU運算速度的影響

相關推薦