NGram近似字串匹配

介紹ngram是來自給定序列的n個單位的子串行。這些單位可以是單詞，字元等。例如，短語「 hello world」的2-gram或bigram是「 he」，「 el」，「 ll」，「 lo」，「 o」，「 w」，「 wo」，「 or」，「 rl」和「 ld」。

用途ngram建模通常用於對自然語言進行建模。它還可用於**序列中的下乙個專案。 ie，給定具有發生頻率的ngram模型，您可以**序列中的下乙個專案。

它也用於近似字串匹配。基於這樣的觀察，任何兩個相似的字串都可能共享許多相同的ngram。因此，它也可用於檢測竊。使用ngram技術的好處之一是索引ngram的能力，因為可以預先計算字串的ngram。這與諸如levenshtein之類的編輯距離演算法相反，在該演算法中，要比較的字串和要與之比較的字串是輸入的一部分。索引ngram的能力可導致更快的搜尋，但會導致非常大的索引。

碼下面的**是為vbscript編寫的，但應直接移植到vba。

createngram函式將字串和整數作為輸入。整數定義了您想要用於建立子串行的n-gram的大小。它輸出乙個二維陣列。第一項是ngram，第二項是它發生的頻率。

comparengram函式採用兩個ngram陣列，並輸出乙個表示兩個陣列相似度的double。返回的數字是兩個陣列的dice係數。係數越高，兩個字串越相似。

function createngram(strinput, intn)
dim arrngram, intbound, i, j, strgram, didinc, arrtemp
if len(strinput) = 0 then exit function
redim arrngram(len(strinput) + 1, 1)
strinput = chr(0) & ucase(trim(strinput)) & chr(0)
intbound = -1
for i = 1 to len(strinput)-intn+1
strgram = mid(strinput, i, intn)
didinc = false
for j = 0 to intbound
if strgram = arrngram(j, 0) then
arrngram(j, 1) = arrngram(j, 1) + 1
didinc = true
exit for
end if
next
if not didinc then
intbound = intbound + 1
arrngram(intbound, 0) = strgram
arrngram(intbound, 1) = 1
end if
next
redim arrtemp(intbound, 1)
for i = 0 to intbound
arrtemp(i, 0) = arrngram(i, 0)
arrtemp(i, 1) = arrngram(i, 1)
next
createngram = arrtemp
end function
function comparengram(arr1, arr2)
dim i, j, intmatches, intcount1, intcount2
intmatches = 0
intcount1 = 0
for i = 0 to ubound(arr1)
intcount1 = intcount1 + arr1(i, 1)
intcount2 = 0
for j = 0 to ubound(arr2)
intcount2 = intcount2 + arr2(j, 1)
if arr1(i, 0) = arr2(j, 0) then
if arr1(i, 1) >= arr2(j, 1) then
intmatches = intmatches + arr2(j, 1)
else
intmatches = intmatches + arr1(i, 1)
end if
end if
next
next
comparengram = 2 * intmatches / (intcount1 + intcount2)
end function

NGram近似字串匹配

演算法字串匹配 BF KMP 近似匹配

字串近似搜尋

字串匹配

NGram近似字串匹配

演算法 字串匹配 BF KMP 近似匹配

字串近似搜尋

字串匹配

相關推薦

演算法字串匹配 BF KMP 近似匹配