資料去重填補空缺值（拉格朗日）

此時我是不是該喊一聲「我胡漢三又回來啦！！！」

這篇部落格容許我摸一下資料清洗的褲腳......

1.首先。

這是在網上找的資料，乙個心臟病的資料集，英文不好的默默開啟翻譯，被我悄悄的做了手腳變成「髒資料」。

2.去重

（1）將文字傳入kettle，轉換為excel檔案

（2）進行去重步驟操作，可以看到有4條重複資料被去除，輸出**。

3.使用拉格朗日填補空缺值(一度讀成朗格拉日(๑°ㅁ°๑)‼)

（1）話不多說直接上**

#coding=utf-8
import pandas as pd
from scipy.interpolate import lagrange  # 匯入拉格朗日函式
inputfile = 'd:/heart0.xls'
outputfile = 'd:/heart01.xls'   #資料輸入輸出路徑
def fillnan(input, output, k=5):  #取k=5
data = pd.read_excel(input, header=none)  #讀入資料
title = data[:1] #資料列名
data = data[1:]  #將列名去掉只取數值
for i in range(len(data.columns)):
for j in range(1, len(data)+1):   #迴圈每個值
if j < k :     #當被插值位置距第一行<5
y = data[i][list(range(1, j)) + list(range(j+1, j+1+k))]  #取數
y = y[y.notnull()]  # 剔除空值
if (data[i].isnull())[j]:  #遇空值的話
data[i][j] = round(lagrange(y.index, list(y))(j))   #插值並返回插值多項式,代入j得到插值結果,四捨五入
elif (j >= k) and (j < len(data) - k):  #位置距第一行》=5,距最後一行》5
y = data[i][list(range(j-k, j)) + list(range(j+1, j+1+k))]
y = y[y.notnull()]
if (data[i].isnull())[j]:
data[i][j] = round(lagrange(y.index, list(y))(j))
elif j >= len(data) - k:  #距最後一行<=5
y = data[i][list(range(len(data)-1-2*k, j)) + list(range(j+1, len(data)))]
y = y[y.notnull()]
if (data[i].isnull())[j]:
data[i][j] = round(lagrange(y.index, list(y))(j))
data = pd.concat([title, data])  #將列名和數值合併
data.to_excel(output, header=none, index=false)  #輸出結果寫入檔案
fillnan(inputfile, outputfile)

（2）資料被填補

4.錯誤分析

（1）keyerror: 0l

traceback (most recent call last):

file "c:/users/dell/desktop/pycharmproject/python/cnki.py", line 54, in

fillnan(inputfile, outputfile)

file "c:/users/dell/desktop/pycharmproject/python/cnki.py", line 38, in fillnan

if (data[i].isnull())[j]: #遇空值的話

file "c:\users\dell\anaconda\lib\site-packages\pandas\core\series.py", line 521, in __getitem__

result = self.index.get_value(self, key)

file "c:\users\dell\anaconda\lib\site-packages\pandas\core\index.py", line 1595, in get_value

return self._engine.get_value(s, k)

file "pandas\index.pyx", line 100, in pandas.index.indexengine.get_value (pandas\index.c:3113)

file "pandas\index.pyx", line 108, in pandas.index.indexengine.get_value (pandas\index.c:2844)

file "pandas\index.pyx", line 154, in pandas.index.indexengine.get_loc (pandas\index.c:3704)

file "pandas\hashtable.pyx", line 375, in pandas.hashtable.int64hashtable.get_item (pandas\hashtable.c:7224)

file "pandas\hashtable.pyx", line 381, in pandas.hashtable.int64hashtable.get_item (pandas\hashtable.c:7162)

keyerror: 0l

問題分析：我開始以為是ol，所以一直沒找到錯誤。其實它是0l，即dataframe的行是從1開始數的直到最後一行，所以它出這個錯誤就是你在訪問它的值時超過了範圍，在正確範圍內就可以了。

問題解決：我將訪問範圍設定為 for j in range(1, len(data)+1):

（2）typeerror: 'int64index' object is not callable

traceback (most recent call last):

file "c:/users/dell/desktop/pycharmproject/python/cnki.py", line 25, in

data[i][j] = ployinterp_column(data[i], j)

file "c:/users/dell/desktop/pycharmproject/python/cnki.py", line 20, in ployinterp_column

return lagrange(y.index(), list(y))(n)

typeerror: 'int64index' object is not callable

問題分析：一開始是看書上**寫的，它是先寫了乙個函式然後填值，看好多網上也是類似這樣：

def ployinterp_column(s, n, k=5):
y = s[list(range(n-k, n)) + list(range(n+1, n+1+k))]  
y = y[y.notnull()]
return lagrange(y.index(), list(y))(n)
for i in range(len(data.columns)):
for j in range(1,len(data)+1):        
if (data[i].isnull())[j]:         
data[i][j] = ployinterp_column(data[i], j)

原因我也不清楚，可能人家的資料能跑出來，我的就不可以(ಥ _ ಥ)

問題解決：將s直接替換為data(i)，n換成j.

5.我的平均值填補渣渣鏈結......

資料缺失值處理之拉格朗日插值

由於計算量比較大相對適用於小資料集,大資料集一般用平均數中位數眾數處理缺失值 import numpy as np from scipy.interpolate import lagrange 匯入拉格朗日插值函式 import pandas as pd inputfile sale.xls...

拉格朗日插值法補齊資料python

書上的然後具體原理公式先占個坑，以後再詳細寫 def ployinterp column s,n,k 8 取出要插值位置的前後k個資料 y s list range n k,n list range n 1,n 1 k 剔除空值 y y y.notnull return lagrange y.in...

資料分析（Python）插值拉格朗日插值法

插值思想對於任意n個點，都可以用一條曲線連線起來，這條曲線的表示式為對於兩個點 x1,y1 x2,y2 這條曲線為根據插值點數不同觀察插值函式的規律實現 import matplotlib.pyplot as plt import numpy as np import math import...

資料去重 填補空缺值（拉格朗日）

資料缺失值處理之拉格朗日插值

拉格朗日插值法補齊資料python

資料分析（Python）插值 拉格朗日插值法

相關推薦

資料去重填補空缺值（拉格朗日）

資料分析（Python）插值拉格朗日插值法