聚類演算法 DBSCAN

dbscan：是一種簡單的，基於密度的聚類演算法。本次實現中，

dbscan

使用了基於中心的方法。在基於中心的方法中，每個資料點的密度通過對以該點為中心以邊長為

2*eps

的網格(鄰域)

內的其他資料點的個數來度量。根據資料點的密度分為三類點：

(1)核心點：該點在鄰域內的密度超過給定的閥值

minps

。(2)邊界點：該點不是核心點，但是其鄰域內包含至少乙個核心點。

(3)噪音點：不是核心點，也不是邊界點。

有了以上對資料點的劃分，聚合可以這樣進行：各個核心點與其鄰域內的所有核心點放在同乙個簇中，把邊界點跟其鄰域內的某個核心點放在同乙個簇中。

根據該演算法，實現如下**：

# scoding=utf-8
import pylab as pl
from collections import defaultdict,counter
points = [[int(eachpoint.split("#")[0]), int(eachpoint.split("#")[1])] for eachpoint in open("points","r")]
# 計算每個資料點相鄰的資料點，鄰域定義為以該點為中心以邊長為2*eps的網格
eps = 10
surroundpoints = defaultdict(list)
for idx1,point1 in enumerate(points):
for idx2,point2 in enumerate(points):
if (idx1 < idx2):
if(abs(point1[0]-point2[0])<=eps and abs(point1[1]-point2[1])<=eps):
# 定義鄰域內相鄰的資料點的個數大於4的為核心點
minpts = 5
corepointidx = [pointidx for pointidx,surpointidxs in surroundpoints.iteritems() if len(surpointidxs)>=minpts]
# 鄰域內包含某個核心點的非核心點，定義為邊界點
borderpointidx = 
for pointidx,surpointidxs in surroundpoints.iteritems():
if (pointidx not in corepointidx):
for onesurpointidx in surpointidxs:
if onesurpointidx in corepointidx:
break
# 噪音點既不是邊界點也不是核心點
noisepointidx = [pointidx for pointidx in range(len(points)) if pointidx not in corepointidx and pointidx not in borderpointidx]
corepoint = [points[pointidx] for pointidx in corepointidx] 
borderpoint = [points[pointidx] for pointidx in borderpointidx]
noisepoint = [points[pointidx] for pointidx in noisepointidx]
# pl.plot([eachpoint[0] for eachpoint in corepoint], [eachpoint[1] for eachpoint in corepoint], 'or')
# pl.plot([eachpoint[0] for eachpoint in borderpoint], [eachpoint[1] for eachpoint in borderpoint], 'oy')
# pl.plot([eachpoint[0] for eachpoint in noisepoint], [eachpoint[1] for eachpoint in noisepoint], 'ok')
groups = [idx for idx in range(len(points))]
# 各個核心點與其鄰域內的所有核心點放在同乙個簇中
for pointidx,surroundidxs in surroundpoints.iteritems():
for onesurroundidx in surroundidxs:
if (pointidx in corepointidx and onesurroundidx in corepointidx and pointidx < onesurroundidx):
for idx in range(len(groups)):
if groups[idx] == groups[onesurroundidx]:
groups[idx] = groups[pointidx]
# 邊界點跟其鄰域內的某個核心點放在同乙個簇中
for pointidx,surroundidxs in surroundpoints.iteritems():
for onesurroundidx in surroundidxs:
if (pointidx in borderpointidx and onesurroundidx in corepointidx):
groups[pointidx] = groups[onesurroundidx]
break
# 取簇規模最大的5個簇
wantgroupnum = 3
finalgroup = counter(groups).most_common(3)
finalgroup = [onecount[0] for onecount in finalgroup]
group1 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[0]]
group2 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[1]]
group3 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[2]]
pl.plot([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], 'or')
pl.plot([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], 'oy')
pl.plot([eachpoint[0] for eachpoint in group3], [eachpoint[1] for eachpoint in group3], 'og')
# 列印噪音點，黑色
pl.plot([eachpoint[0] for eachpoint in noisepoint], [eachpoint[1] for eachpoint in noisepoint], 'ok')   
pl.show()

因為dbscan

使用簇的基於密度的定義，因此它是相對抗噪音的，並且能處理任意形狀和大小的簇。但是如果簇的密度變化很大，例如

abcd

四個簇，

ab的密度大大大於

cd，而且

ab附近噪音的密度與簇

cd的密度相當，這是當

minps

較大時，無法識別簇

cd，簇cd和

ab附近的噪音都被認為是噪音；當

minps

較小時，能識別簇

cd，但

ab跟其周圍的噪音被識別為乙個簇。這個問題可以基於共享最近鄰

(snn)

的聚類結局。

DBSCAN聚類演算法

基於密度定義，我們將點分為 dbscan演算法的本質就是隨大流，邊界點緊緊圍繞著核心點，他們抱團，不帶噪點玩兒小團體多了，聯絡比較密切的小團體之間聚成了同個類比較偏遠的小團體想要加入這個圈子，進不去，就單幹，我們自己玩自己的，聚成了另外的乙個類一開始就被孤立的噪點吧，自然有自己的傲骨，接著孤芳...

DBSCAN 聚類演算法

dbscan演算法是一種基於密度聚類的演算法。核心概念核心點若某個點的密度達到演算法設定的閾值即 r 鄰域內點的數量不小於 minpts 則其為核心點。直接密度可達若某點p在點q的 r 鄰域內，且q是核心點，則稱p從q出發直接密度可達。密度可達若有乙個點的序列q0 q1 qk，對任意qi從...

DBSCAN聚類演算法

核心點在半徑eps內含有超過minpts個數的點。邊界點在半徑eps內含有小於minpts個數的點，但落在核心點的領域。噪音點不是以上兩種點的點。每個點都要判斷一遍，標記為核心點，邊界點和噪音點，噪音點要刪除。eps 半徑 min sample 簇的樣本數 metric 計算方式 import...

聚類演算法 DBSCAN

DBSCAN聚類演算法

DBSCAN 聚類演算法

DBSCAN聚類演算法

相關推薦