The Lovins stemming algorithm - 中国搜索技术门户

推荐给好友 上一篇 | 下一篇

The Lovins stemming algorithm

本站欢迎转载,但任何媒体、网站或个人转载使用时请注明来源:中国搜索门户http://www.cnsousuo.com/viewnews-1643

【中国搜索门户讯】

The first ever published stemming algorithm was: Lovins JB (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22-31. Julie Beth Lovins’ paper was remarkable for the early date at which it was done, and for its seminal influence on later work in this area.

The design of the algorithm was much influenced by the technical vocabulary with which Lovins found herself working (subject term keywords attached to documents in the materials science and engineering field). The subject term list may also have been slightly limiting in that certain common endings are not represented (ements and ents for example, corresponding to the singular forms ement and ent), and also in that the algorithm′s treatment of short words, or words with short stems, can be rather destructive.

The Lovins algorithm is noticeably bigger than the Porter algorithm, because of its very extensive endings list. But in one way that is used to advantage: it is faster. It has effectively traded space for time, and with its large suffix set it needs just two major steps to remove a suffix, compared with the eight of the Porter algorithm.

The Lovins stemmer has 294 endings, 29 conditions and 35 transformation rules. Each ending is associated with one of the conditions. In the first step the longest ending is found which satisfies its associated condition, and is removed. In the second step the 35 rules are applied to transform the ending. The second step is done whether or not an ending is removed in the first step.

For example, nationally has the ending ationally, with associated condition, B, ‘minimum stem length = 3’. Since removing ationally would leave a stem of length 1 this is rejected. But it also has ending ionally with associated condition A. Condition A is ‘no restriction on stem length’, so ionally is removed, leaving nat.

The transformation rules handle features like letter undoubling (sitting -> sitt -> sit), irregular plurals (matrix and matrices), and English morphological oddities ultimately caused by the behaviour of Latin verbs of the second conjugation (assume / assumption, commit / commission etc). Although they are described as being applied in turn, they can be broken into two stages, rule 1 being done in stage 1, and either zero or one of rules 2 to 35 being done in stage 2.

Here is the list of endings as given in Appendix A of Lovins’ paper. They are grouped by length, from 11 characters down to 1. Each ending is followed by its condition code.

 


 
Appendix A. The list of endings
 
    .11. 
    alistically B    arizability A    izationally B 
 
    .10. 
    antialness A    arisations A    arizations A    entialness A 
 
    .09. 
    allically C    antaneous A    antiality A    arisation A 
    arization A    ationally B    ativeness A    eableness E 
    entations A    entiality A    entialize A    entiation A 
    ionalness A    istically A    itousness A    izability A 
    izational A 
 
    .08. 
    ableness A    arizable A    entation A    entially A 
    eousness A    ibleness A    icalness A    ionalism A 
    ionality A    ionalize A    iousness A    izations A 
    lessness A 
 
    .07. 
    ability A    aically A    alistic B    alities A 
    ariness E    aristic A    arizing A    ateness A 
    atingly A    ational B    atively A    ativism A 
    elihood E    encible A    entally A    entials A 
    entiate A    entness A    fulness A    ibility A 
    icalism A    icalist A    icality A    icalize A 
    ication G    icianry A    ination A    ingness A 
    ionally A    isation A    ishness A    istical A 
    iteness A    iveness A    ivistic A    ivities A 
    ization F    izement A    oidally A    ousness A 
 
    .06. 
    aceous A    acious B    action G    alness A 
    ancial A    ancies A    ancing B    ariser A 
    arized A    arizer A    atable A    ations B 
    atives A    eature Z    efully A    encies A 
    encing A    ential A    enting C    entist A 
    eously A    ialist A    iality A    ialize A 
    ically A    icance A    icians A    icists A 
    ifully A    ionals A    ionate D    ioning A 
    ionist A    iously A    istics A    izable E 
    lessly A    nesses A    oidism A 
 
    .05. 
    acies A    acity A    aging B    aical A 
    alist A    alism B    ality A    alize A 
    allic BB    anced B    ances B    antic C 
    arial A    aries A    arily A    arity B 
    arize A    aroid A    ately A    ating I 
    ation B    ative A    ators A    atory A 
    ature E    early Y    ehood A    eless A 
    elity A    ement A    enced A    ences A 
    eness E    ening E    ental A    ented C 
    ently A    fully A    ially A    icant A 
    ician A    icide A    icism A    icist A 
    icity A    idine I    iedly A    ihood A 
    inate A    iness A    ingly B    inism J 
    inity CC    ional A    ioned A    ished A 
    istic A    ities A    itous A    ively A 
    ivity A    izers F    izing F    oidal A 
    oides A    otide A    ously A 
 
    .04. 
    able A    ably A    ages B    ally B 
    ance B    ancy B    ants B    aric A 
    arly K    ated I    ates A    atic B 
    ator A    ealy Y    edly E    eful A 
    eity A    ence A    ency A    ened E 
    enly E    eous A    hood A    ials A 
    ians A    ible A    ibly A    ical A 
    ides L    iers A    iful A    ines M 
    ings N    ions B    ious A    isms B 
    ists A    itic H    ized F    izer F 
    less A    lily A    ness A    ogen A 
    ward A    wise A    ying B    yish A 
 
    .03. 
    acy A    age B    aic A    als BB 
    ant B    ars O    ary F    ata A 
    ate A    eal Y    ear Y    ely E 
    ene E    ent C    ery E    ese A 
    ful A    ial A    ian A    ics A 
    ide L    ied A    ier A    ies P 
    ily A    ine M    ing N    ion Q 
    ish C    ism B    ist A    ite AA 
    ity A    ium A    ive A    ize F 
    oid A    one R    ous A 
 
    .02. 
    ae A    al BB    ar X    as B 
    ed E    en F    es E    ia A 
    ic A    is A    ly B    on S 
    or T    um U    us V    yl R 
    s′ A    ′s A 
 
    .01. 
    a A    e A    i A    o A 
    s W    y B 


 


Here are the 29 conditions, called A to Z, AA, BB and CC (* stands for any letter):

 


 
Appendix B. Codes for context-sensitive rules associated with certain endings
    A    No restrictions on stem 
    B    Minimum stem length = 3 
    C    Minimum stem length = 4 
    D    Minimum stem length = 5 
    E    Do not remove ending after e 
    F    Minimum stem length = 3 and do not remove ending after e 
    G    Minimum stem length = 3 and remove ending only after f 
    H    Remove ending only after t or ll 
    I    Do not remove ending after o or e 
    J    Do not remove ending after a or e 
    K    Minimum stem length = 3 and remove ending only after l, i or u*e 
    L    Do not remove ending after u, x or s, unless s follows o 
    M    Do not remove ending after a, c, e or m 
    N    Minimum stem length = 4 after s**, elsewhere = 3 
    O    Remove ending only after l or i 
    P    Do not remove ending after c 
    Q    Minimum stem length = 3 and do not remove ending after l or n 
    R    Remove ending only after n or r 
    S    Remove ending only after dr or t, unless t follows t 
    T    Remove ending only after s or t, unless t follows o 
    U    Remove ending only after l, m, n or r 
    V    Remove ending only after c 
    W    Do not remove ending after s or u 
    X    Remove ending only after l, i or u*e 
    Y    Remove ending only after in 
    Z    Do not remove ending after f 
    AA    Remove ending only after d, f, ph, th, l, er, or, es or t 
    BB    Minimum stem length = 3 and do not remove ending after met or ryst 
    CC    Remove ending only after l 
 


There is an implicit assumption in each condition, A included, that the minimum stem length is 2.

Finally, here are the 35 transformation rules. 
 


 
Appendix C. Transformation rules used in recoding stem terminations
    1          remove one of double b, d, g, l, m, n, p, r, s, t 
    2    iev   ->   ief 
    3    uct   ->   uc 
    4    umpt   ->   um 
    5    rpt   ->   rb 
    6    urs   ->   ur 
    7    istr   ->   ister 
    7a    metr   ->   meter 
    8    olv   ->   olut 
    9    ul   ->   l except following a, o, i 
    10    bex   ->   bic 
    11    dex   ->   dic 
    12    pex   ->   pic 
    13    tex   ->   tic 
    14    ax   ->   ac 
    15    ex   ->   ec 
    16    ix   ->   ic 
    17    lux   ->   luc 
    18    uad   ->   uas 
    19    vad   ->   vas 
    20    cid   ->   cis 
    21    lid   ->   lis 
    22    erid   ->   eris 
    23    pand   ->   pans 
    24    end   ->   ens except following s 
    25    ond   ->   ons 
    26    lud   ->   lus 
    27    rud   ->   rus 
    28    her   ->   hes except following p, t 
    29    mit   ->   mis 
    30    ent   ->   ens except following m 
    31    ert   ->   ers 
    32    et   ->   es except following n 
    33    yt   ->   ys 
    34    yz   ->   ys 
 
(Rule 30 as given here corrects a typographical error in the published paper of 1968.)

The following examples show the intentions behind these rules.


    1          rubb[ing] -> rub, embedd[ed] -> embed etc 
    2    believ[e] -> belief 
    3    induct[ion] -> induc[e] 
    4    consumpt[ion] -> consum[e] 
    5    absorpt[ion] -> absorb 
    6    recurs[ive] -> recur 
    7    administr[ate] -> administ[er] 
    7a    parametr[ic] -> paramet[er] 
    8    dissolv[ed] -> dissolut[ion] 
    9    angul[ar] -> angl[e] 
    10    vibex -> vibic[es] 
    11    index -> indic[es] 
    12    apex -> apic[es] 
    13    cortex -> cortic[al] 
    14    anthrax -> anthrac[ite] 
    15    ? 
    16    matrix 
 



TAG: Algorithm algorithm Stemming Lovins stemming
 

评分:0

我来说两句

seccode