主成分分析,即principal component analysis(pca),是多元統計中的重要內容,也廣泛應用於機器學習和其它領域。它的主要作用是對高維資料進行降維。pca把原先的n個特徵用數目更少的k個特徵取代,新特徵是舊特徵的線性組合,這些線性組合最大化樣本方差,盡量使新的k個特徵互不相關。
pca的主要演算法如下:
其中協方差矩陣的分解可以通過按對稱矩陣的特徵向量來,也可以通過分解矩陣的svd來實現,而在scikit-learn中,也是採用svd來實現pca演算法的。
記錄一下python實現pca降維的三種方法:
1、直接法即上述原始演算法
2、svd(矩陣奇異值分解),帶svd的原始演算法,在python的numpy模組中已經實現了svd演算法,並且將特徵值從大從小排列,省去了對特徵值和特徵向量重新排列這一步
3、scikit-learn模組,實現pca類直接進行計算,來驗證前面兩種方法的正確性。
在進行pca降維中,會涉及到協方差的相關知識:請參考另一篇博文:協方差的理解與python實現
import numpy as np
from sklearn.decomposition import pca
import sys
#returns choosing how many main factors
def index_lst(lst, component=0, rate=0):
#component: numbers of main factors
#rate: rate of sum(main factors)/sum(all factors)
#rate range suggest: (0.8,1)
#if you choose rate parameter, return index = 0 or less than len(lst)
if component and rate:
print('component and rate must choose only one!')
sys.exit(0)
if not component and not rate:
print('invalid parameter for numbers of components!')
sys.exit(0)
elif component:
print('choosing by component, components are %s......'%component)
return component
else:
print('choosing by rate, rate is %s ......'%rate)
for i in range(1, len(lst)):
if sum(lst[:i])/sum(lst) >= rate:
return i
return 0
def main():
# test data
mat = [[-1,-1,0,2,1],[2,0,0,-1,-1],[2,0,1,1,0]]
# ****** transform of test data
mat = np.array(mat, dtype='float64')
print('before pca transformation, data is:\n', mat)
print('\nmethod 1: pca by original algorithm:')
p,n = np.shape(mat) # shape of mat
t = np.mean(mat, 0) # mean of each column
# substract the mean of each column
for i in range(p):
for j in range(n):
mat[i,j] = float(mat[i,j]-t[j])
# covariance matrix
cov_mat = np.dot(mat.t, mat)/(p-1)
# pca by original algorithm
# ei**alues and eigenvectors of covariance matrix with ei**alues descending
u,v = np.linalg.eigh(cov_mat)
# rearrange the eigenvectors and eigenvalues
u = u[::-1]
for i in range(n):
v[i,:] = v[i,:][::-1]
# choose eigenvalue by component or rate, not both of them euqal to 0
index = index_lst(u, component=2) # choose how many main factors
if index:
v = v[:,:index] # subset of unitary matrix
else: # improper rate choice may return index=0
print('invalid rate choice.\nplease adjust the rate.')
print('rate distribute follows:')
print([sum(u[:i])/sum(u) for i in range(1, len(u)+1)])
sys.exit(0)
# data transformation
t1 = np.dot(mat, v)
# print the transformed data
print('we choose %d main factors.'%index)
print('after pca transformation, data becomes:\n',t1)
# pca by original algorithm using svd
print('\nmethod 2: pca by original algorithm using svd:')
# u: unitary matrix, eigenvectors in columns
# d: list of the singular values, sorted in descending order
u,d,v = np.linalg.svd(cov_mat)
index = index_lst(d, rate=0.95) # choose how many main factors
t2 = np.dot(mat, u[:,:index]) # transformed data
print('we choose %d main factors.'%index)
print('after pca transformation, data becomes:\n',t2)
# pca by scikit-learn
pca = pca(n_components=2) # n_components can be integer or float in (0,1)
pca.fit(mat) # fit the model
print('\nmethod 3: pca by scikit-learn:')
print('after pca transformation, data becomes:')
print(pca.fit_transform(mat)) # transformed data
main()
輸出結果如下:
before pca transformation, data is:
[[-1. -1. 0. 2. 1.]
[ 2. 0. 0. -1. -1.]
[ 2. 0. 1. 1. 0.]]
method 1: pca by original algorithm:
choosing by component, components are 2......
we choose 2 main factors.
after pca transformation, data becomes:
[[ 2.6838453 -0.36098161]
[-2.09303664 -0.78689112]
[-0.59080867 1.14787272]]
method 2: pca by original algorithm using svd:
choosing by rate, rate is 0.95 ......
we choose 2 main factors.
after pca transformation, data becomes:
[[ 2.6838453 0.36098161]
[-2.09303664 0.78689112]
[-0.59080867 -1.14787272]]
method 3: pca by scikit-learn:
after pca transformation, data becomes:
[[ 2.6838453 -0.36098161]
[-2.09303664 -0.78689112]
[-0.59080867 1.14787272]]
三種方法實現PCA演算法 Python
主成分分析,即principal component analysis pca 是多元統計中的重要內容,也廣泛應用於機器學習和其它領域。它的主要作用是對高維資料進行降維。pca把原先的n個特徵用數目更少的k個特徵取代,新特徵是舊特徵的線性組合,這些線性組合最大化樣本方差,盡量使新的k個特徵互不相關。...
三種方法實現PCA演算法 Python
主成分分析,即principal component analysis pca 是多元統計中的重要內容,也廣泛應用於機器學習和其它領域。它的主要作用是對高維資料進行降維。pca把原先的n個特徵用數目更少的k個特徵取代,新特徵是舊特徵的線性組合,這些線性組合最大化樣本方差,盡量使新的k個特徵互不相關。...
兩種降維方法原理 PCA和LDA
1 pca 主成分分析 一句話 pca將方差最大的方向作為主成分,使用主成分來表示原始資料可以去除冗餘的維度,達到降維的目的。無監督降維技術,利用正交變換把由線性相關變數表示的觀測資料轉換為少數幾個由線性無關變數表示的資料,線性無關的變數稱為主成分。pca選擇的是投影後資料方差最大的方向。因此pca...