The Lazy Programmer

February 19, 2008

IFilters Part 2: Using Word Breakers

Filed under: Windows — ferruccio @ 10:57 pm
Tags: ,

In the last article, I described how to use the IFilter interface to extract raw textual data from any document which had IFilter support available. In this article, I am going to discuss how to take that raw text, or raw text from any source, and break it up into words using word breakers. I am going to take the sample project I presented the previous article and modify it to use word breakers to break the document down into it’s constituent words.

You can download the source code for this article here.

The first thing we have to do is locate a suitable word breaker. Windows comes with a number of them pre-installed. If you look in the registry under:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex\Language

you will find a number of language entries. The following screenshot shows the language support installed on XP.

registry editor showing installed language support

You can see there are a number of languages already supported out of the box. In addition there is a “Neutral” language you can use if the language of your choice is not available. We are going to use “English_US” for our example. The registry value we are interested in is WBreakerClass. This will give us the class id of the word breaker we want.

We’re going to add a new function to our document extractor called LoadWordBreaker() which will locate the word breaker we need in the registry and instantiate the word breaker COM object.

bool DocExtractor::LoadWordBreaker(void)
{
   const wchar_t* keyName = L"SYSTEM\\CurrentControlSet\\Control\\ContentIndex\\Language\\English_US";
   HKEY hKey = 0;
   if (RegOpenKeyEx(HKEY_LOCAL_MACHINE, keyName, 0, KEY_READ, &hKey) == ERROR_SUCCESS)
   {
      wchar_t wordBreakerClass[MAX_PATH];
      DWORD size = sizeof(wordBreakerClass);
      DWORD type = 0;
      LSTATUS status = RegQueryValueEx(hKey, L"WBreakerClass", 0, &type,
                                       (BYTE *) wordBreakerClass, &size);
      RegCloseKey(hKey);

      if (status == ERROR_SUCCESS || type == REG_SZ)
      {
         // locate word breaker
         CLSID clsidWordBreak;
         if (FAILED(CLSIDFromString(wordBreakerClass, &clsidWordBreak)))
            return false;

         // create word breaker
         if (FAILED(CoCreateInstance(clsidWordBreak, 0, CLSCTX_ALL,
                                     IID_IWordBreaker, (void **) &pWordBreaker_)))
            return false;

         // initialize word breaker
         BOOL bLic = 0;
         if (FAILED(pWordBreaker_->Init(FALSE, 100, &bLic)))
         {
            pWordBreaker_->Release();
            return false;
         }

         return true;
      }
   }

   return false;
}

We read the WBreakerClass value and convert it to a CLSID using CLSIDFromString(). We then instantiate a word breaker object and call it’s Init() function to initialize the word breaker. The parameters to Init() are:

  • fQuery: A BOOL which indicates whether to do query-time or index-time word breaking. Most of the time you will want to do index-time word breaking (FALSE) unless you are using the word breaker to process user input directly in preparation for doing a search.
  • ulMaxTokenSize: The maximum word length that the word breaker will return. Words longer than this length will be truncated.
  • pfLicense: A pointer to a BOOL which indicates whether or not there are licensing restrictions attached to the use of the word breaker. This seems to always return FALSE. The idea is, if it returns TRUE, you are then supposed to use IWordBreaker::GetLicenseToUse() to get the actual text of the license.

Once the word breaker has been set up, using it is very straightforward. We will modify the DocExtractor::ProcessTextChunk() function to use the word breaker instead of simply saving the text for later use.

//
// process a text chunk
//
void DocExtractor::ProcessTextChunk(IFilter *pFilter, STAT_CHUNK& stat)
{
   const size_t RBUFSIZE = 8 * 1024;

   bool done = false;
   while (!done)
   {
      wchar_t rbuf[RBUFSIZE];
      ULONG bufsize = RBUFSIZE;
      HRESULT hr = pFilter->GetText(&bufsize, rbuf);
      switch (hr)
      {
      case FILTER_E_NO_MORE_TEXT :
         done = true;
         break;

      case FILTER_S_LAST_TEXT :
         done = true;
         // fall through
      default :
         if (SUCCEEDED(hr) && bufsize > 0)
         {
            TEXT_SOURCE ts;
            ts.pfnFillTextBuffer = FillTextBuffer;
            ts.awcBuffer = rbuf;
            ts.iCur = 0;
            ts.iEnd = bufsize;
            MyWordSink ws(*this);
            pWordBreaker_->BreakText(&ts, &ws, 0);
         }
      }
   }
}

The IWordBreaker::BreakText() function takes a pointer to a word sink as a parameter. BreakText() will call member functions of the word sink as necessary to process words. The IWordSink interface defines a number of callback functions. The only ones we are interested in right now are PutWord() and PutAltWord().

class MyWordSink : public WordSink
{
public :
   MyWordSink(DocExtractor& docex) : WordSink(), docex_(docex) {}

private :
   HRESULT STDMETHODCALLTYPE PutAltWord(ULONG cwc, WCHAR const *pwcInBuf, ULONG cwcSrcLen, ULONG cwcSrcPos);
   HRESULT STDMETHODCALLTYPE PutWord(ULONG cwc, WCHAR const *pwcInBuf, ULONG cwcSrcLen, ULONG cwcSrcPos);

   DocExtractor&  docex_;
};

The PutWord() function spits out the next word in the document. The PutAltWord() spits out the same word in an alternate format. What do I mean by that? I think the best way to explain it is through an example. I put together a sample document called sample.txt which has the following content:

This is a sample document.

Number: 1234
Date: 2/19/2008
Time: 9:28PM
Amount: $123.50

Thank You.

I then run our new extractor against it and this is what comes out:

$ extract ..\sample.txt
Document: ..\sample.txt
----------------------------------------
word: This
word: is
word: a
word: sample
word: document
word: Number
alt: 1234
word: NN1234
word: Date
alt: 2/19/2008
word: DD20080219
word: Time
alt: 9:28PM
word: TT2128
word: Amount
alt: $123.50
word: NN123D5$
word: Thank
word: You

Notice that the word breaker recognizes certain things like numbers and dates and emits those “words” in a special format. The “alternate” format is to represent those values as text strings. The supported formats for the data are:

  • NN followed by one or more decimal digits for numbers, decimal points are represented by a D and the whole thing is followed up by a $ if it represents currency.
  • DD followed by a date in YYYYMMDD format.
  • TT followed by a 24-hour time.

The nice thing about the word breakers is that based on which word breaker you select, it will format this type of data automatically based on the language chosen. So if you are scanning documents for dates, you can simply look for words that start with DD.

In a future article we’ll look at using word stemmers to find the different variations of a word.

Advertisements

Create a free website or blog at WordPress.com.

%d bloggers like this: