Class SnoballStemmer

Description

Takes a word, or list of words, and reduces them to their English stems.

This is a fairly faithful implementation of the Porter stemming algorithm that reduces English words to their stems, originally adapted from the ANSI-C code found on the official Porter Stemming Algorithm website, located at http://www.tartarus.org/~martin/PorterStemmer and later changed to conform more accurately to the algorithm itself.

There is a deviation in the way compound words are stemmed, such as hyphenated words and words starting with certain prefixes. For instance, "international" should be reduced to "internation" and not "intern," but an unmodified version of the alorithm will do just that. Currently, only hyphenated words are accounted for.

Thanks to Mike Boone (http://www.boonedocks.net/) for finding a fatal error in the is_consonant() function dealing with short word stems beginning with "Y".

Additional thanks to Mark Plumbley for finding an additional problem with short words beginning with "Y"--the word "yves" for example. I fixed the _o() and is_consonant() functions to appropriately sanity check the values being passed around. Updated 3/12/04.

Thanks to Andrew Jeffries (http://www.nextgendevelopment.co.uk/) for discovering a bug for words beginning with "yy"--this would cause the is_consonant() method checking either of these first "y"s to fall into a recursive infinite loop and crash the program. Updated 9/23/05.

11/09/05, big update. Prompted by an email from Richard Shelquist, I went back over the class and fixed some errors in the algorithm; in particular I made sure to conform EXACTLY to the written algorithm found at the Stemmer website. This class now takes the test vocabulary file found at http://tartarus.org/~martin/PorterStemmer/voc.txt and stems every single word exactly as shown in the output file found at http://tartarus.org/~martin/PorterStemmer/output.txt, with two exceptions: "ycleped" and "ycliped", which I believe my version stems correctly, due to assuming the "Y" at the beginning of a word followed by a consonant-- as in "Yvette"--is to be treated as a vowel and NOT a consonant. Yeah, that's arrogant; allow me some, okay? Of course, should someone find an exception after boasting of my arrogance, please let me know. I'm only human, after all.

Thanks to Damon Sauve (http://www.shopping.com/) for suggesting a better fix to the handling of hyphenated words (in his case, multi-hyphenated words). His fix used a regular expression to extract the final part of the hyphenated word, while mine does a substr() split instead. Also, his version allows dots and apostrophes in words, such as URLs and contractions, and I realize this is a real-world scenario that I didn't account for, so it's been incorporated.

Located in /lib/core/Search/Common/Stemmer/SnoballStemmer.php (line 88)


	
			
Method Summary
SnoballStemmer __construct ()
integer count_vc (string $word)
boolean is_consonant (string $word, integer $pos)
string stem (string $word, [ $lang = 'en'])
array stem_list (mixed $words)
Methods
Constructor __construct (line 92)
  • access: public
SnoballStemmer __construct ()
count_vc (line 606)

Counts (measures) the number of vowel-consonant occurences.

Based on the algorithm; this handy function counts the number of occurences of vowels (1 or more) followed by consonants (1 or more), ignoring any beginning consonants or trailing vowels. A legitimate VC combination counts as 1 (ie. VCVC = 2, VCVCVC = 3, etc.).

  • access: public
integer count_vc (string $word)
  • string $word: Word to measure
is_consonant (line 558)

Checks that the specified letter (position) in the word is a consonant.

Handy check adapted from the ANSI C program. Regular vowels always return FALSE, while "y" is a special case: if the prececing character is a vowel, "y" is a consonant, otherwise it's a vowel.

And, if checking "y" in the first position and the word starts with "yy", return true even though it's not a legitimate word (it crashes otherwise).

  • access: public
boolean is_consonant (string $word, integer $pos)
  • string $word: Word to check
  • integer $pos: Position in the string to check
stem (line 113)

Takes a word and returns it reduced to its stem.

Non-alphanumerics and hyphens are removed, except for dots and apostrophes, and if the word is less than three characters in length, it will be stemmed according to the five-step Porter stemming algorithm.

Note special cases here: hyphenated words (such as half-life) will only have the base after the last hyphen stemmed (so half-life would only have "life" subject to stemming). Handles multi-hyphenated words, too.

  • return: Stemmed word
  • access: public
string stem (string $word, [ $lang = 'en'])
  • string $word: Word to reduce
  • $lang
stem_list (line 168)

Takes a list of words and returns them reduced to their stems.

$words can be either a string or an array. If it is a string, it will be split into separate words on whitespace, commas, or semicolons. If an array, it assumes one word per element.

  • return: List of word stems
  • access: public
array stem_list (mixed $words)
  • mixed $words: String or array of word(s) to reduce

Documentation generated on Sun, 06 Mar 2011 00:25:00 -0500 by phpDocumentor 1.4.3