是在整理檔案時, 翻到的, 感覺是好久以前的**了, 不過看了, 還是可以的. 起碼注釋還是蠻清晰的. 那時候我真的是妥妥的調包man....
#邏輯回歸-標準化套路
from pyspark.ml.feature import
vectorassembler
import
pandas as pd
#1. 準備資料 - 樣本資料集
sample_dataset =[
(0,
"male
", 37, 10, "
no", 3, 18, 7, 4),
(0,
"female
", 27, 4, "
no", 4, 14, 6, 4),
(0,
"female
", 32, 15, "
yes", 1, 12, 1, 4),
(0,
"male
", 57, 15, "
yes", 5, 18, 6, 5),
(0,
"male
", 22, 0.75, "
no", 2, 17, 6, 3),
(0,
"female
", 32, 1.5, "
no", 2, 17, 5, 5),
(0,
"female
", 22, 0.75, "
no", 2, 12, 1, 3),
(0,
"male
", 57, 15, "
yes", 2, 14, 4, 4),
(0,
"female
", 32, 15, "
yes", 4, 16, 1, 2),
(0,
"male
", 22, 1.5, "
no", 4, 14, 4, 5),
(0,
"male
", 37, 15, "
yes", 2, 20, 7, 2),
(0,
"male
", 27, 4, "
yes", 4, 18, 6, 4),
(0,
"male
", 47, 15, "
yes", 5, 17, 6, 4),
(0,
"female
", 22, 1.5, "
no", 2, 17, 5, 4),
(0,
"female
", 27, 4, "
no", 4, 14, 5, 4),
(0,
"female
", 37, 15, "
yes", 1, 17, 5, 5),
(0,
"female
", 37, 15, "
yes", 2, 18, 4, 3),
(0,
"female
", 22, 0.75, "
no", 3, 16, 5, 4),
(0,
"female
", 22, 1.5, "
no", 2, 16, 5, 5),
(0,
"female
", 27, 10, "
yes", 2, 14, 1, 5),
(1, "
female
", 32, 15, "
yes", 3, 14, 3, 2),
(1, "
female
", 27, 7, "
yes", 4, 16, 1, 2),
(1, "
male
", 42, 15, "
yes", 3, 18, 6, 2),
(1, "
female
", 42, 15, "
yes", 2, 14, 3, 2),
(1, "
male
", 27, 7, "
yes", 2, 17, 5, 4),
(1, "
male
", 32, 10, "
yes", 4, 14, 4, 3),
(1, "
male
", 47, 15, "
yes", 3, 16, 4, 2),
(0,
"male
", 37, 4, "
yes", 2, 20, 6, 4)
]columns = ["
affairs
", "
gender
", "
age", "
label
", "
children
", "
religiousness
", "
education
", "
occupation
", "
rating"]
#pandas構建dataframe,方便
pdf = pd.dataframe(sample_dataset, columns=columns)
#2. 特徵選取:affairs為目標值,其餘為特徵值 - 這是工作中最麻煩的地方, 多張表, 資料清洗
df2 = df.select("
affairs
","age
", "
religiousness
", "
education
", "
occupation
", "
rating")
#3. 合併特徵-將多列特徵合併為一列"feature", 如果是離散資料, 需要先 onehot 再合併, 挺繁瑣的
#3.1 用於計算特徵向量的字段
colarray2 = ["
age", "
religiousness
", "
education
", "
occupation
", "
rating"]
#3.2 計算出特徵向量
df3 = vectorassembler().setinputcols(colarray2).setoutputcol("
features
").transform(df2)
#4. 劃分分為訓練集和測試集(隨機)
traindf, testdf = df3.randomsplit([0.8,0.2])
#print("訓練集:")
#traindf.show(10)
#print("測試集:")
#testdf.show(10)
#5. 訓練模型
from pyspark.ml.classification import
logisticregression
#5.1 建立邏輯回歸訓練器
lr =logisticregression()
#5.2 訓練模型
model = lr.setlabelcol("
affairs
").setfeaturescol("
features
").fit(traindf)
#5.3 **資料
model.transform(testdf).show()
#todo
#6. 評估, 交叉驗證, 儲存, 封裝.....
主要也是作為乙個歷史的筆記, 當然也作為乙個反例, 即如果不懂原理,來呼叫包的話, 你會發現, ml 其實是多麼的無聊, 至少從**套路上看這樣的.
機器學習 邏輯回歸 Python實現邏輯回歸
coding utf 8 author 蔚藍的天空tom import numpy as np import os import matplotlib.pyplot as plt from sklearn.datasets import make blobs global variable path...
邏輯回歸模型 SAS邏輯回歸模型訓練
邏輯回歸模型是金融信貸行業製作各類評分卡模型的核心,幾乎80 的機器學習 統計學習模型演算法都是邏輯回歸模型,按照邏輯美國金融公司總結的sas建模過程,大致總結如下 一般通用模型訓練過程 a 按照指定需求和模型要求製作driver資料集,包含欄位有user id,dep b 其中,空值賦預設值即 c...
線性回歸與邏輯回歸
cost functionj 12m i 1m h x i y i hypothesish x tx 梯度下降求解 為了最小化j j j 1m i 1m h x i y i x i j 每一次迭代更新 j j 1m i 1m h x i y i x i j 正規方程求解 最小二乘法 xtx 1x t...