HDFS負載均衡

hdfs負載平衡

hdfs的資料可能並不總是被均勻的置於所有的datanode中,最常見的原因是向乙個已經存在的集群新增乙個新的節點。

當放置新的塊時(塊：乙個檔案的資料會被儲存為一系列的塊)。namenode 在選擇datanode節點儲存這些塊之前會考慮多方面引數。一些注意事項如下：

策略保證乙個塊的其中之乙個副本在同乙個節點(這個節點是塊寫的節點) 需要將乙個塊的副本分配到不同的機架上，這樣可以確保集群丟失整個機架也沒有影響眾多副本中的其中之一通常放在檔案寫入節點的同乙個機架上，這樣可以減少跨機架的網路i/o hdfs的資料均勻的分步到乙個集群的所有節點中

由於多個相互競爭的考慮，整個datanode資料可能不是均勻放置。hdfs為管理員提供了乙個工具,分析整個datanode的塊位置和平衡資料。

hadoop hdfs資料負載均衡原理

資料均衡過程的核心是乙個資料均衡演算法，該資料均衡演算法將不斷迭代資料均衡邏輯，直至集群內資料均衡為止。該資料均衡演算法每次迭代的邏輯如下：

步驟分析如下：

資料均衡服務（rebalancing server）首先要求 namenode 生成 datanode 資料分布分析報告,獲取每個datanode磁碟使用情況

rebalancing server彙總需要移動的資料分布情況，計算具體資料塊遷移路線圖。資料塊遷移路線圖，確保網路內最短路徑

開始資料塊遷移任務，proxy source data node複製一塊需要移動資料塊

將複製的資料塊複製到目標datanode上

刪除原始資料塊

目標datanode向proxy source data node確認該資料塊遷移完成

proxy source data node向rebalancing server確認本次資料塊遷移完成。然後繼續執行這個過程，直至集群達到資料均衡標準

詳細的datanode歸類策略

在第2步中，hdfs會把當前的datanode節點,根據閾值的設定情況劃分到over、above、below、under四個組中。在移動資料塊的時候，over組、above組中的塊向below組、under組移動。四個組定義如下：

over組：此組中的datanode的均滿足

datanode_usedspace_percent > cluster_usedspace_percent + threshold

above組：此組中的datanode的均滿足

cluster_usedspace_percent + threshold > datanode_ usedspace _percent > cluster_usedspace_percent

below組：此組中的datanode的均滿足

cluster_usedspace_percent > datanode_ usedspace_percent > cluster_ usedspace_percent – threshold

under組：此組中的datanode的均滿足

cluster_usedspace_percent – threshold > datanode_usedspace_percent

hadoop hdfs 資料自動平衡指令碼使用方法

在hadoop中，包含乙個start-balancer.sh指令碼，通過執行這個工具，啟動hdfs資料均衡服務。該工具可以做到熱插拔，即無須重啟計算機和 hadoop 服務。$hadoop_home/sbin/目錄下的start−balancer.sh指令碼就是該任務的啟動指令碼。啟動命令為：$hadoop_home/sbin/start-balancer.sh

檢視shell指令碼發現其實就是啟動了如下這個命令

$hadoop_home/bin/hdfs balancer

more $hadoop_home/sbin/start-balancer .sh... .省略... ."$hadoop_prefix"/sbin/hadoop-daemon .sh --config $hadoop_conf_dir --script "$bin"/hdfs start balancer $@

引數如下

usage: hdfs balancer
[-policy ]      the balancing policy: datanode or blockpool
[-threshold ]        percentage of disk capacity
[-exclude [-f file> | ]]  excludes the specified datanodes.
[-include [-f file> | ]]  includes only the specified datanodes.
[-idleiterations ]      number of consecutive idle iterations (-1
for infinite) before exit.

常用引數是-threshold 指定閥值

注：在實際操作中發現預設的負載均衡複製block的速度很慢在集群壓力較小時可考慮調整如下配置提高速度

hdfs-site.xml

dfs.balance.bandwidthpersecname>

1048576

specifies the maximum bandwidth that each datanode can utilize for

the balancing purpose in term of

thenumber

of bytes per second.

property>

預設1048576 是 1m/s

HDFS負載均衡

HDFS技術之負載均衡（六）

nginx 負載均衡 Nginx負載均衡策略

軟負載均衡和F5負載均衡（硬負載均衡）區別

HDFS負載均衡

HDFS技術之負載均衡（六）

nginx 負載均衡 Nginx負載均衡策略

軟負載均衡和F5負載均衡（硬負載均衡）區別

相關推薦