The Lazy Programmer

March 14, 2008

IFilter Part 3: Using Word Stemmers

Filed under: Windows — ferruccio @ 6:08 pm
Tags: , , ,

In the last two articles, we learned how to use the Windows IFilter interface to extract text from documents and to break that text into individual words. Now we are going to take those words and use the IStemmer interface to generate alternative forms of the words. e.g. Given the word be, the stemmer will give us back words like is, am and been.


The first time I looked at this interface, it seemed kind of backwards to me. I had previously written my own stemmer using the Porter stemming algorithm, which like many stemmers, would take an input word and produce a word root. e.g. If you give one of these stemmers the word running, it will spit out run. The same holds true for runner, but not for ran. The reason for this behavior is that these types of stemmers look at the end of a word and apply a set of fixed rules to transform the word into its root form. I don’t know what sort of stemming the IStemmer interface uses, but it seems to know about irregular word forms. I’ve only tested this on English documents but I assume the same holds true for other languages.

So, why would you want to generate alternate word forms? The obvious use is for search engines. A search engine will take the user’s input, break it up into search terms and the search not only for the terms entered, but the alternate forms as well. Simply generating root forms can result in smaller indexes, but generating alternate forms leads to more flexibility in search capabilities. e.g. If you search for running, maybe documents about runners should not be ranked as highly but should still appear in the search results.

Added stemming capabilities to our little extraction example is really simple. Most of the added code is simply OO plumbing. You can download the updated project here.

The code to load the stemmer is almost identical to the code that loads the word breaker. It simply looks at “StemmerClass” in the registry instead of “WBreakerClass” and then instantiates the appropriate stemmer object. The stemmer works in a similar manner as a word breaker too. The GenerateWordForms() method takes a word and an WordFormSink object and calls the PutWord and PutAltWord methods on that WordFormSink object. All we do is take the output from WordSink.PutWord and WordSink.PutAltWord and use those words as input to the stemmer. As a test we’ll run the same sample.txt file as before though it.

This is a sample document.

Number: 1234
Date: 2/19/2008
Time: 9:28PM
Amount: $123.50

Thank You.

And the result is:

$ extract ..\sample.txt
Document: ..\sample.txt
----------------------------------------
This
is be been being am are was were
a a's
sample sample's samples samples' sampled sampling
document document's documents documents' documented documenting
Number Number's Numbers Numbers' Numbered Numbering Numb Number Numbest
1234
NN1234
Date Date's Dates Dates' Dated Dating
2/19/2008
DD20080219
Time Time's Times Times' Timed Timing
9:28PM
TT2128
Amount Amount's Amounts Amounts' Amounted Amounting
$123.50
NN123D5$
Thank Thanked Thanking Thanks
You

In case you’re thinking that Numbest is a mistake. I thought that too, but realized it was just a quirk of the English language. Numb, Number, Numbest. šŸ™‚

Advertisements

Blog at WordPress.com.

%d bloggers like this: