使用SQL計算AUC值

在開發一些機器學習應用時，經常需要展示模型的roc曲線以及auc值。我們固然可以在**中編寫函式或者直接呼叫已有的軟體包來計算，但在某些場景下當面臨的資料量很大時，網路的傳輸可能會影響系統的效能。這種情況下可以考慮直接在sql語句中計算，而不需要將資料傳回到客戶端，從而提公升效率和穩定性。

計算auc值需要兩個引數：模型的輸出值和樣本真實的標籤。我們可以假設資料庫中有乙個表用來儲存這兩個資訊，然後基於這個表進行計算。為此我們先建立一張資料表(文中的例子均以postgresql語法編寫)

create
table score_label(
score int
,  label boolean
);

insert
into score_label (score, label)
values(9
,true);
insert
into score_label (score, label)
values(8
,false);
insert
into score_label (score, label)
values(7
,true);
insert
into score_label (score, label)
values(6
,false
);

這裡舉了個最簡單的例子：總共包含四個樣本，對應的auc是0.75。如果對auc的計算還沒有完全明確的同學可以看我之前寫的這篇文章。下面可以按照這篇文章中的邏輯來編寫計算邏輯，如下所示：

with roc as
(with
r1 as
(select score,
count(1
) filter (
where label is
true
)as t,
count(1
) filter (
where label is
false
)as f
from score_label
group
by score
order
by score desc),
r2 as
(select score, t, f,
sum(t)
over
(order
by score desc
)as tsum,
sum(f)
over
(order
by score desc
)as fsum
from r1)
,    r3 as
(select
case
when
(select
sum(f)
from r2)=0
then
0else f /
(select
sum(f)
from r2)
endas width,
case
when
(select
sum(t)
from r2)=0
then
0else tsum /
(select
sum(t)
from r2)
endas y,
case
when
(select
sum(f)
from r2)=0
then
0else fsum /
(select
sum(f)
from r2)
endas x
from r2
union
select0,
0,0)
,    r4 as
(select
*from r3
order
by x)
,    r5 as
(select cast(x as
numeric(18
,3)) x, 
cast(y as
numeric(18
,3)) y,
(y + lag(y,1,
0.0)
over
(order
by x, y)
)* width /
2as area
from r4)
select array_agg(x) x, 
array_agg(y) y, 
cast(
sum(area)
asnumeric(18
,2))
as auc
from r5)
select
*from roc;

整個邏輯乍看起來比較龐雜，實際上是通過with語法把整個計算邏輯逐步展開，先按照score進行分組排序，然後分別計算每個點對應的tpr和fpr，最後再按照微積分的思想對y進行積分。這裡我們不僅能計算出auc，還可以同時把roc曲線上的各個點以陣列的形式返回。執行這段sql可以得到以下結果：xy

auc0.75

有了這段sql邏輯之後，我們可以將其定義成資料庫中的儲存過程，或者通過客戶端直接呼叫這段sql，從而直接地在資料庫中完成auc的計算，而不需要將對應的資料表中的資料返回到客戶機上，提公升系統效能。

2018-12-02

使用SQL計算AUC值

AUC如何計算

AUC及其計算

使用R和Python計算AUC

使用SQL計算AUC值

AUC如何計算

AUC及其計算

使用R和Python計算AUC

相關推薦