KingbaseES 全文检索功能介绍

KingbaseES 内置的缺省的分词解析器采用空格分词，因为中文的词语之间没有空格分割，所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件：zhparser and sys_jieba，其中zhparser 支持 GBK 和 UTF8 字符集，sys_jieba 支持 UTF8 字符集。

一、默认空格分词

1、tsvector

test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');

                             to_tsvector

----------------------------------------------------------------------

 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17

(1 row)

test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');

                                                     to_tsvector

---------------------------------------------------------------------------------------------------------------------

 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');

                                                     to_tsvector

---------------------------------------------------------------------------------------------------------------------

 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17

(1 row)

这里可以看到，如果词干分析器是english ，会采取词干标准化的过程；而simple 只是转换成小写。默认是 simple。

test=# show default_text_search_config;

 default_text_search_config

----------------------------

 pg_catalog.simple

(1 row)

2、标准化过程

标准化过程会完成以下操作：

总是把大写字母换成小写的
也经常移除后缀（比如英语中的s,es和ing等），这样可以搜索同一个字的各种变体，而不是乏味地输入所有可能的变体。
数字表示词位在原始字符串中的位置，比如“man"出现在第6和15的位置上。
to_tesvetor的默认配置的文本搜索是“英语“。它会忽略掉英语中的停用词（stopword，译注：也就是am is are a an等单词)。

3、tsvector搜索

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become';

 ?column?

----------

 t

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'becom';

 ?column?

----------

 f

(1 row)


test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become');

tsquery | to_tsquery | to_tsquery

----------+------------+------------

'become' | 'become' | 'becom'

(1 row)

to_tsquery 也会进行标准化转换，在搜索时必须用 to_tsquery，确保数据不会因为标准化转换而搜索不到。

4、逻辑操作

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('become');

 ?column?

----------

 t

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become');

 ?column?

----------

 f

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('tri & become');

 ?column?

----------

 t

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom');

 ?column?

----------

 f

(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try | !become');

 ?column?

----------

 t

(1 row)

5、可以用 :* 表示某词开始字符

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*');

 ?column?

----------

 t

(1 row)

6、其他语言支持

test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');

                                                     to_tsvector

---------------------------------------------------------------------------------------------------------------------

 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17

(1 row)

test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ;

                             to_tsvector

----------------------------------------------------------------------

 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17

(1 row)

                           ^

test=# SELECT to_tsvector('french','Try not to become a man of success, but rather try to become a man of value') ;

                                                   to_tsvector

-----------------------------------------------------------------------------------------------------------------

 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17

(1 row)

                                     ^

test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ;

                                                   to_tsvector

-----------------------------------------------------------------------------------------------------------------

 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17

(1 row)

simple并不忽略禁用词表，它也不会试着去查找单词的词根。使用simple时，空格分割的每一组字符都是一个语义；simple 只做了小写转换；对于数据来说，simple文本搜索配置项很实用。

二、中文检索

在开始介绍中文检索前，我们先来看个例子：

test=# select to_tsvector('人大金仓致力于提供高可靠的数据库产品');

               to_tsvector

------------------------------------------

 '人大金仓致力于提供高可靠的数据库产品':1

因为内置的分词器是按空格分割的，而中文间没有空格，因此，整句话就被看做一个分词。

1、创建中文搜索插件

create extension zhparser;

create text search configuration zhongwen_parser (parser = zhparser);

alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;

上面 for 后面的字母表示分词的token，上面的token映射只映射了名词(n)，动词(v)，形容词(a)，成语(i)，叹词(e)，缩写(j) 和习用语(l)6种，这6种以外的token全部被屏蔽。词典使用的是内置的simple词典。具体的token 如下：

test=# select ts_token_type('zhparser');

     ts_token_type

------------------------

 (97,a,adjective)

 (98,b,differentiation)

 (99,c,conjunction)

 (100,d,adverb)

 (101,e,exclamation)

 (102,f,position)

 (103,g,root)

 (104,h,head)

 (105,i,idiom)

 (106,j,abbreviation)

 (107,k,tail)

 (108,l,tmp)

 (109,m,numeral)

 (110,n,noun)

 (111,o,onomatopoeia)

 (112,p,prepositional)

 (113,q,quantity)

 (114,r,pronoun)

 (115,s,space)

 (116,t,time)

 (117,u,auxiliary)

 (118,v,verb)

 (119,w,punctuation)

 (120,x,unknown)

 (121,y,modal)

 (122,z,status)

(26 rows)

2、查看pg_ts_config

创建text search configuration 后，可以在视图pg_ts_config 看到如下信息：

test=# select * from pg_ts_config;

  oid  |     cfgname     | cfgnamespace | cfgowner | cfgparser

-------+-----------------+--------------+----------+-----------

  3748 | simple          |           11 |       10 |      3722

 13265 | arabic          |           11 |       10 |      3722

 13267 | danish          |           11 |       10 |      3722

 13269 | dutch           |           11 |       10 |      3722

 13271 | english         |           11 |       10 |      3722

 13273 | finnish         |           11 |       10 |      3722

 13275 | french          |           11 |       10 |      3722

 13277 | german          |           11 |       10 |      3722

 13279 | hungarian       |           11 |       10 |      3722

 13281 | indonesian      |           11 |       10 |      3722

 13283 | irish           |           11 |       10 |      3722

 13285 | italian         |           11 |       10 |      3722

 13287 | lithuanian      |           11 |       10 |      3722

 13289 | nepali          |           11 |       10 |      3722

 13291 | norwegian       |           11 |       10 |      3722

 13293 | portuguese      |           11 |       10 |      3722

 13295 | romanian        |           11 |       10 |      3722

 13297 | russian         |           11 |       10 |      3722

 13299 | spanish         |           11 |       10 |      3722

 13301 | swedish         |           11 |       10 |      3722

 13303 | tamil           |           11 |       10 |      3722

 13305 | turkish         |           11 |       10 |      3722

 16390 | parser_name     |         2200 |       10 |     16389

 24587 | zhongwen_parser |         2200 |       10 |     16389

3、使用中文分词

test=# select to_tsvector('zhongwen_parser','人大金仓致力于提供高可靠的数据库产品');

                           to_tsvector

------------------------------------------------------------------

 '产品':7 '人大':1 '可靠':5 '提供':3 '数据库':6 '致力于':2 '高':4

4、contains 函数

test=# \df+ contains

                                                                                           List of functions

 Schema |   Name   | Result data type | Argument data types | Type | Volatility | Parallel | Owner  | Security | Access privileges | Language |               Source code

        | Description

--------+----------+------------------+---------------------+------+------------+----------+--------+----------+-------------------+----------+------------------------------------------+-------------

 sys    | contains | boolean          | text, text          | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |

 sys    | contains | boolean          | text, text, integer | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |

 sys    | contains | boolean          | text, tsquery       | func | immutable  | safe     | system | invoker  |                   | sql      | select $1::tsvector @@ $2                |

 sys    | contains | boolean          | tsvector, text      | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2::tsquery                 |

 sys    | contains | boolean          | tsvector, tsquery   | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2                          |

默认contains 函数使用的是空格分词解析器，因此，无法使用contains 进行中文判断

test=# select contains('人大金仓致力于提供高可靠的数据库产品','产品');

 contains

----------

 f