最近需要 horovod 相關的知識,在這裡記錄一下,進行備忘:
horovod 安裝:
安裝 cuda 9.0;
編譯安裝nccl 根據cuda 9.0;
安裝 gcc 4.9:
python 版本 python 3.6.9 (具體環境請自行適配)
安裝 openmpi 4.0 :
pip 安裝 horovod 框架:
horovod_nccl_home=nccl的home目錄horovod_nccl_lib=nccl的lib目錄horovod_nccl_include=nccl的include目錄horovod_gpu_allreduce=ncclpip install --no-cache-dir horovod
horovod_nccl_home=/home/name/nccl/build/ horovod_nccl_lib=/home/name/nccl/build/lib/ horovod_nccl_include=/home/name/nccl/build/include/ horovod_gpu_allreduce=nccl pip install --no-cache-dir horovod安裝後,使用:python -c "import horovod.tensorflow as hvd;" 命令進行測試,如果無錯誤輸出,則表示安裝成功;之後可參考官方手冊使用horovod;
➜ openmpi python -c "安裝測試結果import horovod.tensorflow as hvd;
"/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
_np_qint8 = np.dtype([("
qint8
", np.int8, 1
)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
_np_quint8 = np.dtype([("
quint8
", np.uint8, 1
)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
_np_qint16 = np.dtype([("
qint16
", np.int16, 1
)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
_np_quint16 = np.dtype([("
quint16
", np.uint16, 1
)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
_np_qint32 = np.dtype([("
qint32
", np.int32, 1
)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: futurewarning: passing (type, 1) or '
1type
' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '
(1,)type'.
np_resource = np.dtype([("
resource
", np.ubyte, 1)])
(官方文件,可以參考安裝和使用)
(講解了分布式多卡訓練相關的基礎知識)
分布式多卡-pytorch,tensorflow 系列教程 (較為詳細的教程,講解了現有較為優秀的框架的特點和使用方式)
(安裝使用參考,本文中的安裝步驟參考此教程)
Horovod 分布式深度學習框架
horovod初始化 程序分配 訓練引數配置 模型引數廣播 分布式optimizer 模型儲存 簡略快速了解 horovod使用說明 四部分詳細 horovod介紹 在單機4卡的機上起訓練,只需執行以下命令 horovodrun np 4 h localhost 4 python train.py在...
深度學習模型儲存 深度學習分布式模型
背景 隨著各大企業和研究機構在pytorch tensorflow keras mxnet等深度學習框架上面訓練模型越來越多,專案的資料和計算能力需求急劇增加。在大部分的情況下,模型是可以在單個或多個gpu平台的伺服器上執行的,但隨著資料集的增加和訓練時間的增長,有些訓練需要耗費數天甚至數週的時間,...
分布式學習
負載均衡 nginx 高效能 高併發的web伺服器 功能包括負載均衡 反向 靜態內容快取 訪問控制 工作在應用層 lvs linux virtual server,基於集群技術和linux作業系統實現乙個高效能 高可用的伺服器 工作在網路層 webserver tomcat,apache,jboss...