三種方法實現PCA演算法 Python

主成分分析，即principal component analysis（pca），是多元統計中的重要內容，也廣泛應用於機器學習和其它領域。它的主要作用是對高維資料進行降維。pca把原先的n個特徵用數目更少的k個特徵取代，新特徵是舊特徵的線性組合，這些線性組合最大化樣本方差，盡量使新的k個特徵互不相關。關於pca的更多介紹，請參考：

pca的主要演算法如下：

- 組織資料形式，以便於模型使用；

其中協方差矩陣的分解可以通過按對稱矩陣的特徵向量來，也可以通過分解矩陣的svd來實現，而在scikit-learn中，也是採用svd來實現pca演算法的。關於svd的介紹及其原理，可以參考：矩陣的奇異值分解（svd）（理論）。

本文將用三種方法來實現pca演算法，一種是原始演算法，即上面所描述的演算法過程，具體的計算方法和過程，可以參考：a tutorial on principal components analysis, lindsay i smith. 一種是帶svd的原始演算法，在python的numpy模組中已經實現了svd演算法，並且將特徵值從大從小排列，省去了對特徵值和特徵向量重新排列這一步。最後一種方法是用python的scikit-learn模組實現的pca類直接進行計算，來驗證前面兩種方法的正確性。

用以上三種方法來實現pca的完整的python如下：

import numpy as np
from sklearn.decomposition import pca
import sys
#returns choosing how many main factors
def index_lst(lst, component=0, rate=0):
#component: numbers of main factors
#rate: rate of sum(main factors)/sum(all factors)
#rate range suggest: (0.8,1)
#if you choose rate parameter, return index = 0 or less than len(lst)
if component and rate:
print('component and rate must choose only one!')
sys.exit(0)
if not component and not rate:
print('invalid parameter for numbers of components!')
sys.exit(0)
elif component:
print('choosing by component, components are %s......'%component)
return component
else:
print('choosing by rate, rate is %s ......'%rate)
for i in range(1, len(lst)):
if sum(lst[:i])/sum(lst) >= rate:
return i
return 0
def main():
# test data
mat = [[-1,-1,0,2,1],[2,0,0,-1,-1],[2,0,1,1,0]]
# ****** transform of test data
mat = np.array(mat, dtype='float64')
print('before pca transformation, data is:\n', mat)
print('\nmethod 1: pca by original algorithm:')
p,n = np.shape(mat) # shape of mat 
t = np.mean(mat, 0) # mean of each column
# substract the mean of each column
for i in range(p):
for j in range(n):
mat[i,j] = float(mat[i,j]-t[j])
# covariance matrix
cov_mat = np.dot(mat.t, mat)/(p-1)
# pca by original algorithm
# eigvalues and eigenvectors of covariance matrix with eigvalues descending
u,v = np.linalg.eigh(cov_mat) 
# rearrange the eigenvectors and eigenvalues
u = u[::-1]
for i in range(n):
v[i,:] = v[i,:][::-1]
# choose eigenvalue by component or rate, not both of them euqal to 0
index = index_lst(u, component=2)  # choose how many main factors
if index:
v = v[:,:index]  # subset of unitary matrix
else:  # improper rate choice may return index=0
print('invalid rate choice.\nplease adjust the rate.')
print('rate distribute follows:')
print([sum(u[:i])/sum(u) for i in range(1, len(u)+1)])
sys.exit(0)
# data transformation
t1 = np.dot(mat, v)
# print the transformed data
print('we choose %d main factors.'%index)
print('after pca transformation, data becomes:\n',t1)
# pca by original algorithm using svd
print('\nmethod 2: pca by original algorithm using svd:')
# u: unitary matrix,  eigenvectors in columns 
# d: list of the singular values, sorted in descending order
u,d,v = np.linalg.svd(cov_mat)
index = index_lst(d, rate=0.95)  # choose how many main factors
t2 = np.dot(mat, u[:,:index])  # transformed data
print('we choose %d main factors.'%index)
print('after pca transformation, data becomes:\n',t2)
# pca by scikit-learn
pca = pca(n_components=2) # n_components can be integer or float in (0,1)
pca.fit(mat)  # fit the model
print('\nmethod 3: pca by scikit-learn:')
print('after pca transformation, data becomes:')
print(pca.fit_transform(mat))  # transformed data
main()

執行以上**，輸出結果為：

這說明用以上三種方法來實現pca都是可行的。這樣我們就能理解pca的具體實現過程啦~~

有興趣的讀者可以用其它語言實現一下哈。

pca 維基百科：

講解詳細又全面的pca教程： a tutorial on principal components analysis, lindsay i smith.

部落格：矩陣的奇異值分解（svd）（理論）：

部落格：主成分分析pca:

scikit-learn的pca介紹：

三種方法實現PCA演算法 Python

三種方法實現PCA演算法 Python

三種方法實現PCA降維

三種方法實現選擇問題

三種方法實現PCA演算法 Python

三種方法實現PCA演算法 Python

三種方法實現PCA降維

三種方法實現選擇問題

相關推薦