pytorch 多卡訓練碰到的問題及解決方案

1、多卡訓練後模型命名多了'module.' 這樣在讀取模型的時候需要新增關鍵字名

下面用乙個模型讀取的函式舉例，核心部分是'changed'部分

def load_params_from_file_v2(model, filename, opt=none, to_cpu=false):
if not os.path.isfile(filename):
raise filenotfounderror
print('==> loading parameters from checkpoint %s to %s' % (filename, 'cpu' if to_cpu else 'gpu'))
loc_type = torch.device('cpu') if to_cpu else none
params = list(model.named_parameters())
checkpoint = torch.load(filename, map_location=loc_type)
model_state_disk = checkpoint['model_state']
if 'version' in checkpoint:
print('==> checkpoint trained from version: %s' % checkpoint['version'])
update_model_state = {}
#print(model.state_dict())
for key, val in model_state_disk.items():
###########changed####################
val = model_state_disk[key]
key = 'module.' + key
update_model_state[key] = val
###########changed#######################
state_dict = model.state_dict()
state_dict.update(update_model_state)
#print(model)
params = list(model.named_parameters())
#print(params[0])
model.load_state_dict(state_dict)
for key in state_dict:
if key not in update_model_state:
print('not updated weight %s: %s' % (key, str(state_dict[key].shape)))
if opt is not none:
opt_state_disk = checkpoint['optimizer_state']
opt.load_state_dict(opt_state_disk)
return checkpoint['epoch']

2、多卡訓練，視訊記憶體占用問題

問題描述：當從零開始訓練4卡程式時，以gtx1080為例，主卡占用10g視訊記憶體，其餘卡占用8g視訊記憶體。而中斷後load模型卻能壓爆視訊記憶體，導致無法訓練。

這個是因為load模型方式出了問題

多卡訓練中斷後需要先load單卡訓練時的模型，再呼叫分布式訓練，這樣視訊記憶體和直接從零開始訓練是一樣的。

1）load model 使用 load_params_from_file（見下面**）

呼叫：load_params_from_file(model, cfg.resume_from, opt=optimizer, to_cpu=false , rank=torch.cuda.current_device())

2）model = mmdistributeddataparallel(model,device_ids=[torch.cuda.current_device()],broadcast_buffers=false)

load 模型的**：

def load_params_from_file(model, filename, opt=none, to_cpu=false, rank='0'):
if not os.path.isfile(filename):
raise filenotfounderror
print('==> loading parameters from checkpoint %s to %s' % (filename, 'cpu' if to_cpu else 'gpu'))
loc_type = torch.device('cpu') if to_cpu else none
params = list(model.named_parameters())
##############changed###############
checkpoint = torch.load(filename, map_location='cuda:{}'.format(rank))
##############changed###############
model_state_disk = checkpoint['model_state']
if 'version' in checkpoint:
print('==> checkpoint trained from version: %s' % checkpoint['version'])
update_model_state = {}
for key, val in model_state_disk.items():
###########changed####################
val = model_state_disk[key]
#key = 'module.' + key
update_model_state[key] = val
state_dict = model.state_dict()
state_dict.update(update_model_state)
params = list(model.named_parameters())
model.load_state_dict(state_dict)
for key in state_dict:
if key not in update_model_state:
print('not updated weight %s: %s' % (key, str(state_dict[key].shape)))
print('==> done (loaded %d/%d)' % (len(update_model_state), len(model.state_dict())))
if opt is not none:
opt_state_disk = checkpoint['optimizer_state']
opt.load_state_dict(opt_state_disk)
return checkpoint['epoch']

3、多卡問題，訓練分類時存在無樣本的情況導致訓練中斷

model = mmdistributeddataparallel(model,device_ids=[torch.cuda.current_device()],broadcast_buffers=false, find_unused_parameters=true)

可以對比問題2中的命令，發現多了find_unused_parameters=true

pytorch 多GPU訓練（單機多卡多機多卡）

首先是資料集的分布處理需要用到的包 torch.utils.data.distributed.distributedsampler torch.utils.data.dataloader torch.utils.data.dataset distributedsampler這個包我們用來確保dat...

Pytorch中多GPU訓練

參考在資料越來越多的時代，隨著模型規模引數的增多，以及資料量的不斷提公升，使用多gpu去訓練是不可避免的事情。pytorch在0.4.0及以後的版本中已經提供了多gpu訓練的方式，本文簡單講解下使用pytorch多gpu訓練的方式以及一些注意的地方。這裡我們談論的是單主機多gpus訓練，與分布式訓...

pytorch 多GPU訓練注意事項

1.多gpu訓練記得dataloader dataset dataset train,batch size config train batch shuffle config train shuffle num workers config train workers drop last true ...

pytorch 多卡訓練碰到的問題及解決方案

pytorch 多GPU訓練（單機多卡 多機多卡）

Pytorch中多GPU訓練

pytorch 多GPU訓練注意事項

相關推薦

pytorch 多GPU訓練（單機多卡多機多卡）