Python 學習筆記1

寫下來以便自己記憶。

描述性統計方法：首先判斷變數的型別，乙個分類變數計算統計量、頻次value_counts，用直方圖；兩個分類標準化堆疊柱形圖crosstab，統計檢驗用卡方檢驗；乙個分類乙個連續變數用groupby分類盒須圖boxplot，統計檢驗用兩樣本t檢驗（多分類則用方差分析）；兩個連續變數pivot透視表散點圖，統計檢驗用相關分析（注意相關分析和回歸分析不同，相關分析用來確認變數是否有關係，回歸分析是已經確認有關係後再確認變數間是什麼函式關係）.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
os.chdir(r'f:\download\中文名')
#檔名中有中文，不能直接讀取，需要分兩步
trad_flow=pd.read_csv("rfm_trad_flow.csv",encoding='gbk')
trad_flow.head(10)
f=trad_flow.groupby(['cumid','type'])[['transid']].count()
f.head()
#%%透檢視 兩個分類乙個連續
f_trans=pd.pivot_table(f,index='cumid',columns='type',values='transid')
f_trans.head()
#%%缺失值處理
f_trans['special_offer']=f_trans['special_offer'].fillna(0)
f_trans["interest"]=f_trans['special_offer']/(f_trans['special_offer']+f_trans['normal'])
f_trans.head()
#%%m=trad_flow.groupby(['cumid','type'])[['amount']].sum()
m.head()
#%%m_trans=pd.pivot_table(m,index='cumid',columns='type',values='amount')
m_trans.head()
#%%m_trans['special_offer']=m_trans['special_offer'].fillna(0)
m_trans['returned_goods']=m_trans['returned_goods'].fillna(0)
m_trans["value"]=m_trans['normal']+m_trans['special_offer']+m_trans['returned_goods']
m_trans.head()
#%%標準化時間
from datetime import datetime
import time
def to_time(t):
out_t=time.mktime(time.strptime(t,'%d%b%y:%h:%m:%s'))
return out_t
a="14jun09:17:58:34"
print(to_time(a))
#%%trad_flow.head()
r=trad_flow.groupby(['cumid'])[['time_new']].max()
r.head()
#%%sklearn預處理 等深分箱
from sklearn import preprocessing
threshold=pd.qcut(f_trans['interest'],2,retbins=true)[1][1]
#%%#)# 返回每個數對應的分組，且額外返回bins，即每個邊界值 二分值陣列
binarizer=preprocessing.binarizer(threshold=threshold)
interest_q=pd.dataframe(binarizer.transform(f_trans['interest'].values.reshape(-1,1)))
interest_q.index=f_trans.index
interest_q.columns=["interest"]
interest_q
#%%threshold=pd.qcut(m_trans['value'],2,retbins=true)[1][1]
binarizer = preprocessing.binarizer(threshold=threshold)
value_q = pd.dataframe(binarizer.transform(m_trans['value'].values.reshape(-1,1)))
value_q.index=m_trans.index
value_q.columns=["value"]
value_q
#%%threshold = pd.qcut(r["time_new"], 2, retbins=true)[1][1]
binarizer = preprocessing.binarizer(threshold=threshold)
time_new_q = pd.dataframe(binarizer.transform(r["time_new"].values.reshape(-1,1)))
time_new_q.index = r.index
time_new_q.columns = ["time"]
time_new_q
#%%標籤
analysis=pd.concat([interest_q,value_q,time_new_q],axis=1)
analysis.head()
label = 
#analysis = analysis[['interest','value','time']]
analysis.head()
#%%

Python 學習筆記1

Python學習筆記（1）

Python 學習筆記 1

python學習筆記（1 ）

Python 學習筆記1

Python學習筆記（1）

Python 學習筆記 1

python學習筆記（1 ）

相關推薦