Horovod 分布式深度學習框架相關

2022-05-22 11:48:11 字數 3272 閱讀 2901

最近需要 horovod 相關的知識,在這裡記錄一下,進行備忘:

horovod 安裝:

安裝 cuda 9.0; 

編譯安裝nccl 根據cuda 9.0; 

安裝 gcc 4.9: 

python 版本 python 3.6.9 (具體環境請自行適配)

安裝 openmpi 4.0 : 

pip 安裝 horovod 框架:

horovod_nccl_home=nccl的home目錄horovod_nccl_lib=nccl的lib目錄horovod_nccl_include=nccl的include目錄horovod_gpu_allreduce=ncclpip install --no-cache-dir horovod

horovod_nccl_home=/home/name/nccl/build/ horovod_nccl_lib=/home/name/nccl/build/lib/ horovod_nccl_include=/home/name/nccl/build/include/  horovod_gpu_allreduce=nccl  pip install --no-cache-dir horovod
安裝後,使用:python -c "import horovod.tensorflow as hvd;" 命令進行測試,如果無錯誤輸出,則表示安裝成功;之後可參考官方手冊使用horovod;

➜  openmpi python -c "

import horovod.tensorflow as hvd;

"/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

_np_qint8 = np.dtype([("

qint8

", np.int8, 1

)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

_np_quint8 = np.dtype([("

quint8

", np.uint8, 1

)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

_np_qint16 = np.dtype([("

qint16

", np.int16, 1

)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

_np_quint16 = np.dtype([("

quint16

", np.uint16, 1

)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

_np_qint32 = np.dtype([("

qint32

", np.int32, 1

)])/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: futurewarning: passing (type, 1) or '

1type

' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '

(1,)type'.

np_resource = np.dtype([("

resource

", np.ubyte, 1)])

安裝測試結果

(官方文件,可以參考安裝和使用)

(講解了分布式多卡訓練相關的基礎知識)

分布式多卡-pytorch,tensorflow 系列教程  (較為詳細的教程,講解了現有較為優秀的框架的特點和使用方式)

(安裝使用參考,本文中的安裝步驟參考此教程)

Horovod 分布式深度學習框架

horovod初始化 程序分配 訓練引數配置 模型引數廣播 分布式optimizer 模型儲存 簡略快速了解 horovod使用說明 四部分詳細 horovod介紹 在單機4卡的機上起訓練,只需執行以下命令 horovodrun np 4 h localhost 4 python train.py在...

深度學習模型儲存 深度學習分布式模型

背景 隨著各大企業和研究機構在pytorch tensorflow keras mxnet等深度學習框架上面訓練模型越來越多,專案的資料和計算能力需求急劇增加。在大部分的情況下,模型是可以在單個或多個gpu平台的伺服器上執行的,但隨著資料集的增加和訓練時間的增長,有些訓練需要耗費數天甚至數週的時間,...

分布式學習

負載均衡 nginx 高效能 高併發的web伺服器 功能包括負載均衡 反向 靜態內容快取 訪問控制 工作在應用層 lvs linux virtual server,基於集群技術和linux作業系統實現乙個高效能 高可用的伺服器 工作在網路層 webserver tomcat,apache,jboss...