最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy 等。


发现一个不错的工具Porter stemmer,主页是。它被实现为N多版本,C、Java、Perl等。


Stemming, in the parlance of searching and information retrieval, is the
operation of stripping the suffices from a word, leaving its stem.
Google, for instance, uses stemming to search for web pages containing
the words connectedconnectingconnection and connections when
you ask for a web page that contains the word connect.

There are basically two ways to implement stemming. The first approach
is to create a big dictionary that maps words to their stems. The
advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages
are the space required by the dictionary and the investment required to
maintain the dictionary as new words appear. The second approach is to
use a set of rules that extract stems from words. The advantages of this
approach are that the code is typically
small, and it can gracefully handle new words; the disadvantage is that
it occasionally makes mistakes. But, since stemming is imperfectly
defined, anyway, occasional mistakes are tolerable, and the rule-based
approach is the one that is generally chosen.

In 1979, Martin Porter developed a stemming algorithm that, with minor
modifications, is still in use today; it uses a set of rules to extract
stems from words, and though it makes some mistakes, most common words
seem to work out right. Porter describes his
algorithm and provides a reference implementation in C at;


比如输入 "create" 和 "created" ,得到的结果是 "creat" 。这点让我大失所望!这根本就没有把单词还原为原来的样子啊?



The purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form.
Porter stemmer
是"created"不一定能还原到"create",但却可以使"create" 和 "created" ,都得到"creat" !


比如我输入 "create" 和 "created" ,它解析得到 "creat"

那么,只需要在查询时也做同样的处理即可!比如查询 "create created",在数据库里查的时候,都只需要检索"creat"即可!




