如何使用JDBC快速處理大資料

在實習工作中，要處理一張存有204萬記錄的表，由於記錄是從網際網路上取到的，所以裡面有一些不太理想的詞，比如一些詞裡混有特殊字元，標點符號，或者有些詞根本就是標點符號等等。我寫這個程式的目的就是把這些不理想的詞找出來，能修改的就修改，沒有必要修改的就直接刪除。

for(int i=0;i<205;i++)

經試驗，第一條sql語句的執行效率是明顯不如第二條的。

string best="select * from cat_keyword where id>=(select id from cat_keyword order by id limit "+i*10001+",1)limit 10000";

這條sql語句尤其適用於offset特別大的情景，因為200多萬條記錄，到中後期，這個offset是很大的，使用未經優化的sql語句，到後期是越來越慢。

另外limit的取值也很有考究，我試過1000,10000,70000,100000的，最後發現limit取10000的時候速度是最快的，這個跟資料的大小，電腦以及資料庫記憶體的分配都有一定的關係，大家可以根據具體情況具體討論。我這裡僅僅提一下需要注意，可以優化的point。

我真正想說的是我想的另外一種解決方案，運用jdbc

下面給大家說下我這個程式的構造，首先是乙個實現將需要刪除和修改的詞插入另一張表的方法

//將找到的違規詞插入cat_garbage表裡
public void insert(resultset rs) throws sqlexception
}catch(sqlexception|classnotfoundexception|ioexception e)
finally
}

然後是乙個將可以修復的詞修復的方法

public void modify(resultset rs,string str,string reg) throws sqlexception
else if(str.indexof(reg)==0)
else
system.out.println(ok);
new good().insert(rs);
//string query="select * from cat_keyword where cid='"+rs.getint("cid")+"'and keyword='"+ok+"'";
string sql="update cat_garbage1 set new='"+ok+"' where id='"+rs.getint(1)+"'";
//string sql1="update cat_keyword set keyword='"+ok+"' where id='"+rs.getint("id")+"'";
stmt.executeupdate(sql);
}catch(sqlexception|classnotfoundexception|ioexception e)
finally
}

最後就是乙個核心的取詞過濾方法，裡面有很多正規表示式實現的取詞規則。

public void filt(resultset rs)
//字串後面帶逗號
if(rule(str,"^\\s+[,]$").matches())
//字串後面帶句號
if(rule(str,"^\\s+[.]$").matches())
//字串前面帶句號
if(rule(str,"^[.]\\s+$").matches())}}
}}		//字串後面帶-
if(rule(str,"^\\s+[-]$").matches())
//字串前面帶-
if(rule(str,"^[-]\\s+$").matches())
//字串前面帶冒號
if(rule(str,"^[:]\\s+$").matches())
//字串後面帶冒號
if(rule(str,"^\\s+[:]$").matches())
//字串後面帶/
if(rule(str,"^\\s+[/]$").matches())
//字串前面帶/
if(rule(str,"^[/]\\s+$").matches())
//字串後面帶？
if(rule(str,"^\\s+[?]$").matches())
//字串前面帶？
if(rule(str,"^[?]\\s+$").matches())
//字串前面帶|
if(rule(str,"^[|]\\s+$").matches())
//字串後面帶|
if(rule(str,"^\\s+[|]$").matches())
//字串前面帶(
if(rule(str,"^[(][\\s\\s]+$").matches())
else
}//字串前面帶[
if(rule(str,"^[\\[\\s\\s]+$").matches())
else
}//字串後面帶)或者裡面帶()
if(rule(str,"^\\s+\\s*\\s+[)]$").matches())
else if(rule(str,"^\\s+\\s*[(]\\s+[)]$").matches())
else
}		//字串後面帶]或者裡面帶
if(rule(str,"^\\s+\\s*\\s+[\\]]$").matches())
else if(rule(str,"^\\s+\\s*[\\\\s+[\\]]$").matches())
else
}//字串內包含書名號
if(rule(str,"^[\\s\\s]*[《][\\s\\s]+[》]").matches())
//字串內包含中文（）
if(rule(str,"^[\\s\\s]*[（][\\s\\s]+[）]").matches())
//純數字
if(rule(str,"^[1-9]*|0$").matches())
//單字母
if(rule(str,"^[a-za-z]$").matches())
*/string sql="select * from cat_keyword";
rs=stmt.executequery(sql);
go.filt(rs);
long b=system.currenttimemillis();
system.out.println("耗時： "+(b-a)/60000+"分鐘");
}catch(classnotfoundexception|sqlexception|ioexception e)
finally
}

用乙個sql語句，將整個表的資料一次性作為引數傳給負責取詞過濾的filt()方法，然後利用resultset rs這個指標依次向下對每個詞進行處理，這個方案執行下來處理完全表204萬記錄只需11分鐘，而是用優化之後的limit語句獲得的結果集再傳給filt()方法那個方案，卻需要26分鐘。可見使用jdbc，直接傳入所有資料的結果集，比使用limit多次限量匯入資料要快的多，而且不止是一倍的關係。前者11分鐘完勝後者的26分鐘。

如何使用JDBC快速處理大資料

使用JDBC處理大資料

使用JDBC處理大資料

JDBC處理大資料

如何使用JDBC快速處理大資料

使用JDBC處理大資料

使用JDBC處理大資料

JDBC處理大資料

相關推薦