機器學習譜聚類

譜聚類，譜就是指矩陣所有的特徵值的集合；而矩陣指的是由所有資料形成的圖的laplacian矩陣。因此譜聚類就是計算資料的laplacian矩陣的特徵向量，再取特徵向量中的一部分進行kmeans聚類。

but，為什麼是laplacian矩陣？為什麼不直接對原始資料kmeans聚類？這也就是譜聚類實現起來很簡單，但是原理卻比較難的原因吧。

1. 為什麼是laplacian矩陣？

2. 為什麼不直接對原始資料kmeans聚類？

3. 演算法實現

演算法步驟：

1）將資料圖化, 每個節點代表乙個樣本;

2）計算樣本間的相似度, 形成鄰接矩陣;

3）計算度矩陣, 鄰接矩陣的每一列的和生成的對角矩陣;

4）計算laplacian矩陣;

5）計算laplacian矩陣的特徵值，特徵向量;

6）取前k個特徵向量形成的矩陣，利用kmeans聚類;

# -*- coding: utf-8 -*-
"""譜聚類 - spectral clustering
1. 將資料圖化, 每個節點代表乙個樣本;
2. 計算樣本間的相似度, 形成鄰接矩陣;
3. 計算度矩陣, 鄰接矩陣的每一列的和生成的對角矩陣; 
4. 計算laplacian矩陣;
5. 計算laplacian矩陣的特徵向量;
6. 取前k個特徵向量形成的矩陣，利用kmeans聚類;
"""import numpy as np
from sklearn.cluster import kmeans
from sklearn import datasets
import matplotlib.pyplot as plt
class spectral:
def __init__(self):
pass
def _cal_adjacentmatrix(self, train_x):
""" 計算圖的鄰接矩陣 - 相似度矩陣 """
samplecnt = len(train_x)
distance = np.zeros((len(train_x), len(train_x)))
for i in range(samplecnt):
distance[i, :] = np.sqrt(np.sum((train_x - train_x[i,:])**2, axis=1))
return distance
def _cal_degreematrix(self, adjacentmat):
""" 計算度矩陣 
adjacentmat:    圖的鄰接矩陣
return:         度矩陣
"""return np.diag(np.sum(adjacentmat, axis=0))
def _cal_laplacianmat(self, adjacentmat, degreemat):
""" 計算laplacian矩陣, 並歸一化
adjacentmat:    圖的鄰接矩陣
degreemat:      度矩陣
"""laplacianmat = degreemat - adjacentmat
# laplacianmat規範化
for i in range(len(degreemat)):
degreemat[i, i] = degreemat[i, i]**(-1/2)
laplacianmat =  - np.dot(np.dot(degreemat, adjacentmat), degreemat)
return laplacianmat
def _eig_for_laplacianmat(self, laplacianmatrix):
""" 對laplacian矩陣進行特徵值分解 """
featureval, featurevector = np.linalg.eig(laplacianmatrix)
return featureval, featurevector
def clustering(self, k, **params):
adjacentmat = self._cal_adjacentmatrix(train_x)
degreemat = self._cal_degreematrix(adjacentmat)
laplacianmat = self._cal_laplacianmat(adjacentmat, degreemat)
featureval, featurevector = self._eig_for_laplacianmat(laplacianmat)
# 選取laplacian矩陣特徵向量前k個, kmeans聚類
clf = kmeans(**params)
clf.fit(featurevector[:, 0:k])
return clf.labels_
def load_data():
""" 利用sklearn自動生成circles資料集 """
params = 
circles = datasets.make_circles(**params)
return circles[0]
def draw_result(train_x, clusters):
plt.figure()
colors = ["red", "orange", "purple", "yellow", "blue"]
for i in range(params["n_clusters"]):
data = train_x[clusters == i]
plt.scatter(data[:, 0], data[:, 1], color=colors[i], s=20)
plt.show()
if __name__ == "__main__":
train_x = load_data()
obj = spectral()
params = 
# k的選擇非常重要
4. sklearn下的譜聚類
# -*- coding: utf-8 -*-
from sklearn.cluster import spectralclustering
from sklearn import datasets
import matplotlib.pyplot as plt
# 匯入資料
train = datasets.make_circles(n_samples=500, shuffle=true, noise=0.03, random_state=1, factor=0.618)
train_x = train[0]
# spectralclusteerring模型
params = 
clf = spectralclustering(**params)
clusters = clf.fit_predict(train_x)
# 聚類結果畫圖
與自己實現的演算法結果是一致的。  
5. 譜聚類的優缺點
參考文獻
				機器學習 譜聚類
聚類確實是將相似的樣本歸為一類，使同類樣本相似度盡可能高，異類的相似性盡可能低。譜聚類 是採用圖的思想。樣本點作為圖中的點，邊為樣本點之間的相似度。所以譜聚類就是想去邊，去掉邊的權重盡量小，即異類樣本間盡量不同 子圖內邊的權重盡量大，同類樣本盡可能相似。n 知識點 核函式的理解 就是高維空間中的兩個...
				20150916譜聚類學習
什麼是譜聚類？譜聚類的思想是生成乙個帶權無向圖g。g的每個頂點表示乙個樣本，連線頂點的邊表示兩個樣本之間具有相似性，邊的權值即樣本之間的相似度大小。然後對圖進行分割，使得不同組之間的邊的相似度盡可能的小，組內邊的相似度盡可能的大。因此，譜聚類需要解決兩個問題 如何生成帶權無向圖g？如何對圖進行分割？...
				機器學習   譜聚類從初始到應用
一 前述 譜聚類 spectral clustering 是一種基於圖論的聚類方法，主要思想是把所有的資料看做空間中的點，這些點之間可以用邊連線起來。距離較遠 或者相似度較低 的兩個點之間的邊權重值較低，而距離較近 或者相似度較高 的兩個點之間的邊權重值較高，通過對所有資料點組成的圖進行切圖，讓切圖...

機器學習 譜聚類

機器學習 譜聚類

20150916譜聚類學習

機器學習 譜聚類從初始到應用

相關推薦

機器學習譜聚類

機器學習譜聚類

機器學習譜聚類從初始到應用