機器學習資料預處理洗牌切分

在邏輯回歸演算法資料預處理的過程中，有時會遇到標籤值分布不均衡的情況，我們在做切分資料集操作時，則需要打亂樣本順序，也叫洗牌。再用洗完牌的資料切分訓練集、測試集。

在numpy庫中，我們可以np.random.seed()函式設定隨機種子，來生成特定順序的序列，而在seed()函式中傳入的引數（整數），幫助我們設定生成隨機數的規則，如下：

for i in range(3):
np.random.seed(3)    # 設定隨機種子  所傳引數必須為整數
print(np.random.rand(3,2))  #隨機生成3行2列的矩陣
----
[[0.5507979  0.70814782]
[0.29090474 0.51082761]
[0.89294695 0.89629309]]
[[0.5507979  0.70814782]
[0.29090474 0.51082761]
[0.89294695 0.89629309]]
[[0.5507979  0.70814782]
[0.29090474 0.51082761]
[0.89294695 0.89629309]]

但所設『規則』只能適用一次，第二次進行隨機操作，隨機數發生改變。

np.random.seed(3)  # 隨機種子
print(np.random.rand(3,2))  #隨機生成3行2列的矩陣
print(np.random.rand(3,2))
----
[[0.5507979  0.70814782]
[0.29090474 0.51082761]
[0.89294695 0.89629309]]
[[0.12558531 0.20724288]
[0.0514672  0.44080984]
[0.02987621 0.45683322]]

在numpy庫中我們引用np.random.permutation()，方法返回的是所定『規則』下的特定序列，用該序列來完成對x，y值的打亂操作。

import numpy as np
x =[[0,0],
[0,1],
[1,0],
[1,1]]
y = [[1],
[1],
[0],
[0]]
#資料轉化為矩陣
x = np.c_[np.ones(len(x)),x]
y = np.c_[y]
#洗牌m = len(x)
np.random.seed(3)   	#隨機種子
order = np.random.permutation(m)   #返回打亂的序列
x = x[order]    #運用打亂的序列對資料進行打亂操作
y = y[order]
print(x)
print(y)
----
[[1. 1. 1.]
[1. 0. 1.]
[1. 0. 0.]
[1. 1. 0.]]
[[0]
[1][1]
[0]]

資料打亂後，np.split()對資料進行切分操作，把資料集的前70%當作訓練集，後30%當作測試集，由於練習資料量較小，輸出結果不重要，重在理解思路。

#  a為分界點，避免小數，使用int對資料型別進行強轉
a = int(0.7*m)
train_x,test_x = np.split(x,[a])
train_y,test_y = np.split(y,[a])
print('訓練集特徵\n',train_x)
print('訓練集標籤\n',train_y)
print('測試集特徵\n',test_x)
print('測試集標籤\n',test_y)
---訓練集特徵
[[1. 1. 1.]
[1. 0. 1.]]
訓練集標籤
[[0]
[1]]
測試集特徵
[[1. 0. 0.]
[1. 1. 0.]]
測試集標籤
[[1]
[0]]

機器學習資料預處理

均值為0，標準差為1 from sklearn import preprocessing scaler preprocessing.standardscaler scaler.fit transform x 對原始資料進行線性變換，變換到 0,1 區間也可以是其他固定最小最大值的區間 from s...

機器學習資料預處理

1 連續資料特徵離散化的方法由於lr 中模型表達能力有限，可以通過特徵離散化來提高非線性學習能力。主要方法 1 等距離散取值範圍均勻劃分成n 等分，每份的間距相等。2 等頻離散均勻分為n 等分，每份內包含的觀察點數相同 3 優化離散 3 1 卡方檢驗方法統計樣本的實際觀測值與理論判斷值之間的...

機器學習python資料預處理

from pandas import read csv from sklearn.preprocessing import standardscaler from numpy import set printoptions from sklearn.preprocessing import minm...

機器學習 資料預處理 洗牌 切分

機器學習 資料預處理

機器學習 資料預處理

機器學習python資料預處理

相關推薦

機器學習資料預處理洗牌切分

機器學習資料預處理

機器學習資料預處理