最簡單的K means 演算法原理和實踐教程

在之前的最簡單的k-means演算法原理和實踐教程最後我提到了這樣的乙個問題，你可以通過一些實驗發現，k-means演算法的最後聚類結果和初始化k個中心的位置有著極大的關係。

而我們在前文中提到過一些不同的初始化方法（前文中使用的是第一種初始化方法）。我們這裡的k-mean++演算法使用的初始化方法，實際是第三種：

我們知道之前的k-means演算法思路是這樣子的：

選取k個初始中心點 c=

c1,.

..,c

k .

對於每乙個i∈

1,..

.,k , 將 ci

設定為

x 中比所有 j≠

i都靠近的點 cj

的集合.

對於每乙個 i∈

1,..

.,k , 將 ci

設定為 ci

中所有點的質心:ci

=1|c

i|∑x

∈cix

重複(2)(3)，直到所有

c 值的變化小於給定閾值或者達到最大迭代次數。

現在的k-means++演算法思路是這樣子的從x

中隨機選取乙個中心點 c1

.計算x

中的每乙個樣本點與c1

之間的距離，通過計算概率d(

x′)2

∑x∈x

d(x)

2 (d(

x)表示每個樣本點到最近中心的距離)，選出概率最大的值對應的點作為下乙個中心ci

=x′∈

x 重複步驟(2)，直到我們選擇了所有k個中心

對k個初始化的中心，利用k-means演算法計算最終的中心。

沿著上述演算法思路，我們可以很快的給出對應的code

'''
_k_init(data, k, x_squared_norms, random_state, n_local_trials=none)
作用：根據k-means++初始化質心
data : 輸入資料
k : 中心數
x_squared_norms : 每個資料點的2範數的平方
random_state : 隨機數生成器，用於初始化中心
n_local_trials :通過一種特別的方式對k-means聚類選擇初始簇中心,從而加快收斂速度
'''def
_k_init
(self, x_squared_norms, random_state, n_local_trials=none):
n_samples, n_features = self.data.shape
centers = np.empty((self.k, n_features), dtype=self.data.dtype)
assert x_squared_norms is
notnone, 'x_squared_norms none in _k_init'
if n_local_trials is
none:
# this is what arthur/vassilvitskii tried, but did not report
# specific results for other than mentioning in the conclusion
# that it helped.
n_local_trials = 2 + int(np.log(self.k))
# 隨機的選擇第乙個中心
center_id = random_state.randint(n_samples)
if sp.issparse(self.data):
centers[0] = self.data[center_id].toarray()
else:
centers[0] = self.data[center_id]
# 初始化最近距離的列表，並計算當前概率
closest_dist_sq = euclidean_distances(
centers[0, np.newaxis], self.data, y_norm_squared=x_squared_norms,
squared=true)#計算x與中心的距離的平方得到距離矩陣
current_pot = closest_dist_sq.sum()#距離矩陣的和
# 選擇其餘n_clusters-1點
for c in range(1, self.k):
# 通過概率的比例選擇中心點候選點
# 離已經存在的中心最近的距離的平方
rand_vals = random_state.random_sample(n_local_trials) * current_pot
#將rand_vals插入原有序陣列距離矩陣的累積求和矩陣中，並返回插入元素的索引值
candidate_ids = np.searchsorted(stable_cumsum(closest_dist_sq),
rand_vals)
# 計算離中心候選點的距離
distance_to_candidates = euclidean_distances(
self.data[candidate_ids], self.data, y_norm_squared=x_squared_norms, squared=true)
# 決定哪個中心候選點是最好
best_candidate = none
best_pot = none
best_dist_sq = none
for trial in range(n_local_trials):
# compute potential when including center candidate
new_dist_sq = np.minimum(closest_dist_sq,
distance_to_candidates[trial])
new_pot = new_dist_sq.sum()
# 如果是到目前為止最好的實驗結果則儲存該結果
if (best_candidate is
none) or (new_pot < best_pot):
best_candidate = candidate_ids[trial]
best_pot = new_pot
best_dist_sq = new_dist_sq
# permanently add best center candidate found in local tries
if sp.issparse(self.data):
centers[c] = self.data[best_candidate].toarray()
else:
centers[c] = self.data[best_candidate]
current_pot = best_pot
closest_dist_sq = best_dist_sq
return centers

以上**作為k-means初始化部分，選用的是scikit-learn中的做法。

所以如果我們要使用k-means演算法就很簡單了，只要安裝了scikit-learn，通過下面的**就可以解決了

K means演算法原理

k means，聚類 clustering 屬於非監督學習 unsupervised learning 無類別標記。clustering 中的經典演算法，資料探勘十大經典演算法之一，其運算速度比較快，而且簡單。但是最終的結果和初始點的選擇有關，容易陷入區域性最優，且需要知道k的值。1.演算法接受引數...

第31課 KMeans 最簡單的聚類演算法

聚類並非一種機器學習專有的模型或演算法，而是一種統計分析技術，在許多領域受到廣泛應用。廣義而言，聚類就是通過對樣本靜態特徵的分析，把相似的物件，分成不同子集後面我們將聚類分出的子集稱為簇被分到同乙個子集中的樣本物件都具有相似的屬性。在機器學習領域，聚類屬於一種無監督式學習演算法。許多聚類演算法...

K means和ISODATA 演算法原理與實現

k means演算法原理對於給定的樣本集，按照樣本之間的距離大小，將樣本集劃分為k個簇。讓簇內的點盡量緊密的連在一起，而讓簇間的距離盡量的大。如果用資料表示式表示，假設簇劃分為 c1,c2,ck 則我們的目標是最小化平方誤差e 其中 i是簇ci的均值向量，有時也稱為質心，表示式為一般步驟 1....

最簡單的K means 演算法原理和實踐教程

K means演算法原理

第31課 KMeans 最簡單的聚類演算法

K means和ISODATA 演算法 原理與實現

相關推薦

K means和ISODATA 演算法原理與實現