sphinx續4 coreseek的工作原理

在分析sphix原理之前，我先澄清一下為什麼經常出現coreseek這個詞？

因為sphinx預設不支援中文索引及檢索，而coreseek基於sphinx開發了coreseek全文檢索伺服器，它提供了為sphinx設計的中文分詞包libmmseg包含mmseg中文分詞，是目前用的最多的sphinx中文檢索。

在沒有sphinx之前，mysql資料庫要對海量的文章中的詞進行全文索引，一般用的語句例如：select *** where *** like '%word%';這樣的like查詢，並且再結合萬用字元%，是使用不到mysql本身的索引，需要全表掃瞄，時間超慢的！

如果用到sphinx，全文索引交給sphinx來做，sphinx返回含有該word的id號，然後用該id號直接去資料庫準確定位那些資料，整個過程如下圖：

sphinx的索引檔案儲存的不是完整的資料，只是由id和分詞組成的陣列，由於索引檔案不同直接檢視，但我們可以通過search工具來驗證：

先建索引：

/usr/local/coreseek/bin/indexer -c

/usr/local/coreseek/etc/sphinx.conf

coreseek fulltext 4.1 [ sphinx 2.0.2-dev

(r2922)]

beijing choice software technologies inc

( 再通過search 查詢單詞test:

/usr/local/coreseek/bin/search test -c

/usr/local/coreseek/etc/sphinx.conf

coreseek fulltext 4.1 [ sphinx 2.0.2-dev

(r2922)]

beijing choice software technologies inc

( using config file '/usr/local/coreseek/etc/sphinx.conf'...

index 'test1': query 'test ': returned 3

matches of 3 total in 0.050 sec

displaying matches:

1. document=1, weight=2421, group_id=1,

date_added=thu jan 8 21:43:32 2015

id=1

group_id=1

group_id2=5

date_added=2015-01-08

21:43:32

title=test

one

content=this

is my test document number one. also checking search within phrases.

2. document=2, weight=2421, group_id=1,

date_added=thu jan 8 21:43:32 2015

id=2

group_id=1

group_id2=6

date_added=2015-01-08

21:43:32

title=test

two

content=this

is my test document number two

3. document=4, weight=1442, group_id=2,

date_added=thu jan 8 21:43:32 2015

id=4

group_id=2

group_id2=8

date_added=2015-01-08

21:43:32

title=doc

number four

content=this

is to test groups

words:

1. 'test': 3 documents, 5 hits

再通過search 查詢單詞this:

/usr/local/coreseek/bin/search this -c

/usr/local/coreseek/etc/sphinx.conf

coreseek fulltext 4.1 [ sphinx 2.0.2-dev

(r2922)]

beijing choice software technologies inc

( using config file

'/usr/local/coreseek/etc/sphinx.conf'...

index 'test1': query 'this ': returned 4

matches of 4 total in 0.000 sec

displaying matches:

1. document=1, weight=1304, group_id=1,

date_added=thu jan 8 21:43:32 2015

id=1

group_id=1

group_id2=5

date_added=2015-01-08

21:43:32

title=test

one

content=this

is my test document number one. also checking search within phrases.

2. document=2, weight=1304, group_id=1,

date_added=thu jan 8 21:43:32 2015

id=2

group_id=1

group_id2=6

date_added=2015-01-08

21:43:32

title=test

two

content=this

is my test document number two

3. document=3, weight=1304, group_id=2,

date_added=thu jan 8 21:43:32 2015

id=3

group_id=2

group_id2=7

date_added=2015-01-08

21:43:32

title=another

doc

content=this

is another group

4. document=4, weight=1304, group_id=2,

date_added=thu jan 8 21:43:32 2015

id=4

group_id=2

group_id2=8

date_added=2015-01-08

21:43:32

title=doc

number four

content=this

is to test groups

words:

1. 'this': 4 documents, 4 hits

主要返回的是含有表id和命中率的陣列。

注意：不知道大家有沒有想到乙個致命的問題，建立了sphinx全文索引後，如果在mysql中新增加資料，不重新indexer一下，sphinx索引是搜尋不到的！即使是加引數–rotate,資料多的情況下，也要很長時間，這個問題怎麼解決呢！明天就來講主索引和增量索引，以及用cron來處理新資料自動加入增量索引中。

補充：

sphinx續4 coreseek的工作原理

Sphinx4語音識別的框架

例項講解4 awk命令e續

設計模式（4）觀察者模式（續）

sphinx續4 coreseek的工作原理

Sphinx4語音識別的框架

例項講解4 awk命令e續

設計模式（4） 觀察者模式（續）

相關推薦

設計模式（4）觀察者模式（續）