資料集有可能是以arff格式(weka用的)儲存,一般的機器學習使用numpy,pandas和sklearn多一些,無法直接讀取檔案,所以需要scipy.io.arff.loadarff過渡下。
from scipy.io import arff
import pandas as pd
file_name=
'/users/schillerxu/documents/sourcecode/python/pandas/cm1.arff'
data,meta=arff.loadarff(file_name)
#print(data)
print
(meta)
df=pd.dataframe(data)
print
(df.head())
#print(df)
#儲存為csv檔案
# out_file='/users/schillerxu/documents/sourcecode/python/pandas/cm1.csv'
# output=pd.dataframe(df)
# output.to_csv(out_file,index=false)
程式執行的結果如下:
[running] python -u "/users/schillerxu/documents/sourcecode/python/pandas/arff_to_csv.py"
dataset: cm1
loc_blank's type is numeric
branch_count's type is numeric
call_pairs's type is numeric
loc_code_and_comment's type is numeric
loc_comments's type is numeric
condition_count's type is numeric
cyclomatic_complexity's type is numeric
cyclomatic_density's type is numeric
decision_count's type is numeric
decision_density's type is numeric
design_complexity's type is numeric
design_density's type is numeric
edge_count's type is numeric
essential_complexity's type is numeric
essential_density's type is numeric
loc_executable's type is numeric
parameter_count's type is numeric
halstead_content's type is numeric
halstead_difficulty's type is numeric
halstead_effort's type is numeric
halstead_error_est's type is numeric
halstead_length's type is numeric
halstead_level's type is numeric
halstead_prog_time's type is numeric
halstead_volume's type is numeric
maintenance_severity's type is numeric
modified_condition_count's type is numeric
multiple_condition_count's type is numeric
node_count's type is numeric
normalized_cylomatic_complexity's type is numeric
num_operands's type is numeric
num_operators's type is numeric
num_unique_operands's type is numeric
num_unique_operators's type is numeric
number_of_lines's type is numeric
percent_comments's type is numeric
loc_total's type is numeric
defective's type is nominal, range is (
'y', 'n'
) loc_blank branch_count call_pairs ... percent_comments loc_total defective
0 6.0 9.0 2.0 ... 4.00 25.0 b'n'
1 15.0 7.0 3.0 ... 39.22 32.0 b'y'
2 27.0 9.0 1.0 ... 47.27 33.0 b'y'
3 7.0 3.0 2.0 ... 0.00 12.0 b'n'
4 51.0 25.0 13.0 ... 11.67 106.0 b'n'
[5 rows x 38 columns]
[done] exited with code=0 in 0.664 seconds
可以明顯看到meta儲存的是資料集的基本資訊。
python載入arff檔案
生成arff檔案,csv轉為arff
一 什麼是arff格式檔案 1 arff是attribute relation file format縮寫,從英文本面也能大概看出什麼意思。它是weka資料探勘開源程式使用的一種檔案模式。由於weka是個很出色的資料探勘開源專案,所以使用的比較廣,這也無形中推廣了它的資料儲存格式。2 下面是weka...
python將nc檔案轉為tiff
import numpy as np import netcdf4 as nc from osgeo import gdal,osr var sa data r c users 13290 desktop soil data nc format var f nc.dataset data var l...
python3 將pdf檔案轉為text
pdf檔案儘管可以用python提取文字,但存在加密的情況,那種pdf就是解析不了的。另外pdf更類似於,所以即使可以用python提取,結果也容易有問題。所以效果不敢保證。在python3中解析pdf一般用pdfminer3k,就是pdfminer的python3版本。直接pip安裝即可 pip ...