the paice/husk stemmer - 中国搜索技术门户

推荐给好友 上一篇 | 下一篇

the paice/husk stemmer

本站欢迎转载,但任何媒体、网站或个人转载使用时请注明来源:中国搜索门户http://www.cnsousuo.com/viewnews-1638

【中国搜索门户讯】
The Paice/Husk Stemmer was developed by Chris Paice at Lancaster University in the late 1980s, and was originally implemented with assistance from Gareth Husk. The Stemmer has been implemented in Pascal, C, PERL and Java. Implementations of the Stemmer are available at awebsitemaintained by the author. When operating with its standard rule-set, it is a rather ′strong′ or ′heavy′ stemmer.

The Paice/Husk Stemmer is a simple iterative Stemmer – that is to say, it removes the endings from a word in an indefinite number of steps. The Stemmer uses a separate rule file, which is first read into an array or list. This file is divided into a series of sections, each section corresponding to a letter of the alphabet. The section for a given letter, say "e", contains the rules for all endings ending with "e", the sections being ordered alphabetically. An index can thus be built, leading from the last letter of the word to be stemmed to the first rule for that letter.

When a word is to be processed, the stemmer takes its last letter and uses the index to find the first rule for that letter. The rule is examined, and is accepted if:

  • It specifies an ending which matches the last letters of the word.
  • Any special conditions for that rule are satisfied (e.g, the so-called ′intact′ condition, which ensures that the rule is only fired if no other rules have yet been applied to the word).
  • Application of the rule would not result in a stem shorter than a specified length or without a vowel.

If a rule is accepted then it is applied to the word. If it is not accepted, the rule index is incremented by one and the next rule is tried. However, if the first letter of the next rule does not match with the last letter of the word, this implies that no ending can be removed, and so the process terminates.

When a rule is applied to a word, this usually means that the ending of the word is removed or replaced. For example, the rule

e1

means ′if the current word/stem ends with "e" then delete 1 letter and continue′ (the curly brackets just contain a comment showing the rule in another form). So this is a simple ′e-removal′ rule, which for example would convert "estate" to "estat". After applying this rule, the new final letter (now "t") would be taken and used to access a different section of the rule table. If, however, the final symbol had been "." instead of "

Suppose now that the rule had said:

>{ -e - }>", the process would have terminated, and "estat" would have been returned at once.

e1i

In this case, the "e" would have been removed and then replaced by the letter "i" – giving, in the present case, "estati".

Once a rule has been found to match, it is not applied at once, but must first be checked to confirm that it would leave an acceptable stem. For example, it would not be sensible to apply the ′e-removal′ rule to the word "me", since the remaining stem would be too short - and would not even contain a vowel!

More details about this can be found on the ′How the Stemmer Operates′ page.

 

figure 1. Paice/Husk Stemmer

>{ -e -i }

 



TAG: paice husk stemmer
 

评分:0

我来说两句

seccode