IR中python 寫倒排索引與查詢處理

學習資訊檢索課程，老師讓寫乙個倒排索引與查詢處理的程式，於是抱著試試的心態自學python寫了出來。

整個沒有什麼太大的演算法技巧，唯一的就是查詢處理那裡遞迴函式正反兩次反覆查詢需要多除錯下。

資料結構：

#-*-coding:utf-8-*-
#!/usr/bin/python
'''資料結構
建立索引
mydir   文件列表
onedoc  每乙個文件
mydoc   當前查詢的文件
mywords 建立索引的字典
myindex 0 文件下標 1 單詞下標 2 次數 3...
wordcntdict中的個數 doccnt文件個數
三個字典
mywordsdictindex  單詞編號 起始位置
antimywordsdict   單詞編號 結束位置
mywordsdict       單詞->單詞編號
查詢mypos是每個的單詞起始的index下標
myfindindex是每個單詞的標號，
mydocs 查詢到的文件號
'''mydir=
mywords=
myindex=
mywordsdictindex={}
antimywordsdict={}
mywordsdict={}
wordcnt=0#dict中的個數
doccnt=0#文件個數
listcnt=0#index個數
mypos=
mydocs=
myfindindex=
mydoc=0
direct=0
print id(mydir)

建立索引：

#-*-coding:utf-8-*-
#!/usr/bin/python
from mydate import *
import sys
import os
import pprint
import pickle
def getmydoc(thepath,onedir):
ans=
for line in open(thepath+'/'+onedir):
line=line.strip('\n')
return ans
def createindex(thepath):
global mydir
global mywords
global myindex
global mywordsdictindex
global antimywordsdict
global mywordsdict
global wordcnt
global doccnt
global listcnt
global mypos
global mydocs
global myfindindex
global mydoc
global direct
mydir=os.listdir(thepath)
for i in mydir:
if(os.path.isdir(thepath+'/'+i)==true):
mydir.remove(i)
#print mydir
mydir=['a.txt','b.txt','c.txt']
wordcnt=0#dict中的個數
doccnt=0#文件個數
listcnt=0#index個數
print id(wordcnt)
for onedoc in mydir:
mylist=getmydoc(thepath,onedoc)
onedocword=0#每個詞在這個文字中的位置
docworddict={}
for myword in mylist:
if(myword not in mywordsdict):
mywords[wordcnt][0]=myword
mywordsdict[myword]=wordcnt
wordcnt+=1
#print myword,mywordsdict[myword]
if(myword not in docworddict):
docworddict[myword]=listcnt
listcnt+=1
ins=docworddict[myword]
myindex[ins][0]=doccnt
myindex[ins][1]=mywordsdict[myword]
myindex[ins][2]+=1
onedocword+=1
doccnt+=1
myindex.sort(key=lambda x:x[1]) #sort
beg=0
fin=0
for i in range(len(mywords)):
mywordsdictindex[mywords[i][0]]=beg
mywords[i][1]=beg	
while fin 
#-*-coding:utf-8-*-
#!/usr/bin/python
#得到乙個文字的列表
import sys
import os
import pprint
import pickle
import pdb
from mydate import *
'''返回值三種：1 整個查詢詞都找到了 0 並沒有同時出現在乙個文字中 -1 查詢完畢或不存在
mydoc 查詢詞是否都在這個文件中
direct 查詢方向 direct=0 遞迴向下，攜帶標記flag若為1則表明之前一直存在。0表明並不都在乙個文字中那麼mydoc取過程中的最大值
當到len(mypos)的時候，決定是否將該結果放入，並將最後乙個詞的mypos後移 改變查詢方向，並返回1
direct=1 遞迴返回，與0同樣操作，當到第0層再改變查詢方向
'''def findword(loc,flag):
global mydir
global mywords
global myindex
global mywordsdictindex
global antimywordsdict
global mywordsdict
global wordcnt
global doccnt
global listcnt
global mypos
global mydocs
global myfindindex
global mydoc
global direct
if(loc==len(mypos)):
#pdb.set_trace()
direct=1#############################
if(flag==1):
i=mypos[loc-1]+1
#print mydocs
if(i
#-*-coding:utf-8-*-
#!/usr/bin/python
import hwf
from mydate import *
import createindex
import sys
import os
import pprint
import pickle
createindex.createindex('.')#建立索引
hwf.getwords()#查詢單詞

正排索引與倒排索引

什麼是正排索引 forward index 由key查詢實體的過程，是正排索引.什麼是倒排索引 inverted index 由item查詢key的過程，是倒排索引。倒排索引可以理解為map item,list id 能夠由查詢詞快速時間複雜度o 1 找到包含這個查詢詞的檔案的資料結構。舉例文件...

倒排索引原理機器學習基礎倒排索引與搜尋引擎

在介紹倒排索引之前，我們先來看看什麼是索引。索引是資料庫當中的概念，維基百科中的說法是資料庫索引，是資料庫管理系統中乙個排序的資料結構，以協助快速查詢更新資料庫表中資料可以簡單地把索引當成是字典裡的檢索目錄，我們比如我們要查乙個叫 index 的單詞，通過目錄，可以快速地找到字母i開始的位置。...

倒排索引與正序索引

1 mysql正序索引 2 es倒排索引簡介 mysql的聚簇索引的查詢通過索引定位到儲存資料的葉子節點。mysql的非聚簇索引通過索引定位到葉子節點的聚簇索引值，再根據聚簇索引回表查詢資料。因此正序索引是通過聚簇索引定位到資料，如果select from table where like a ...

IR中python 寫倒排索引與查詢處理

正排索引與倒排索引

倒排索引原理 機器學習基礎 倒排索引與搜尋引擎

倒排索引與正序索引

相關推薦

倒排索引原理機器學習基礎倒排索引與搜尋引擎