The Lazy Programmer

February 2, 2008

Using the IFilter interface to extract text from documents

Filed under: Windows — ferruccio @ 6:25 am
Tags: ,

A long time ago, I wrote search engines.

One of the most tedious tasks that a search engine developer had to do was write parsers for every document format they wanted to support. And, naturally, some of the formats were undocumented, so I spent some time reverse-engineering them. I became intimately familiar with the internals of Word, WordPerfect and even WordStar documents. HTML, as far as I knew, didn’t exist. I told you it was a long time ago.

Today, Windows has a nice little interface called IFilter that you can use to extract text and other properties from just about any document as long as there is an appropriate IFilter parser installed for it. Windows 2000 and up come with IFilter parsers for Microsoft Office documents already installed. You can easily find and download IFilter parsers for other document formats such as PDF.

To demonstrate the IFilter interface, I wrote a simple command line tool which dumps the contents of any file (as long as it has an IFilter parser installed) to the console. You can download the project files here.

I created a class called DocExtractor which hides the details of the IFilter interface. Here is the source to the sample program:

class MyExtractor : public DocExtractor
{
public :
   MyExtractor(void) : DocExtractor() { text_ = L""; }

   const wstring& GetText(void) { return text_; }

private :
   void Start(void)
   {
      text_ = L"";
   }

   void Text(const std::wstring& text)
   {
      text_ += text;
   }

   void Property(const wstring& name, const wstring& value)
   {
      wcout << L"property: " << name << L"=" << value << endl;
   }

   wstring   text_;
};

int wmain(int argc, wchar_t** argv)
{
   for (int i = 1; i < argc; ++i)
   {
      try
      {
         wcout << L"Document: " << argv[i] << endl;
         wcout << L"----------------------------------------" << endl;
         MyExtractor ex;
         ex.Extract(argv[i]);
         wcout << L"Text:" << endl;
         wcout << L"----------------------------------------" << endl;
         wcout << ex.GetText() << endl << endl;
      }
      catch (exception e)
      {
         wcout << L"exception thrown: " << e.what() << endl;
      }
   }

   return 0;
}

MyExtractor works by overriding the virtual functions defined by DocExtractor. Start() is used to signal the beginning of parsing. Text() is called when the next chunk of text is available and Property() reports when certain property values embedded within the document are available, such as “Title” and “Author”.

The DocExtractor::Extract() function is fairly straightforward:

void DocExtractor::Extract(const TCHAR* filename)
{
   IFilter *pFilter = 0;
   HRESULT hr = LoadIFilter(filename, 0, (void **) &pFilter);
   if (SUCCEEDED(hr))
   {
      DWORD flags = 0;
      HRESULT hr = pFilter->Init(IFILTER_INIT_INDEXING_ONLY |
                                 IFILTER_INIT_APPLY_INDEX_ATTRIBUTES |
                                 IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES |
                                 IFILTER_INIT_FILTER_OWNED_VALUE_OK |
                                 IFILTER_INIT_APPLY_OTHER_ATTRIBUTES,
                                 0, 0, &flags);
      if (FAILED(hr))
      {
         pFilter->Release();
         throw exception("IFilter::Init() failed");
      }

      Start();

      STAT_CHUNK stat;
      while (SUCCEEDED(hr = pFilter->GetChunk(&stat)))
      {
         if ((stat.flags & CHUNK_TEXT) != 0)
            ProcessTextChunk(pFilter, stat);

         if ((stat.flags & CHUNK_VALUE) != 0)
            ProcessValueChunk(pFilter, stat);
      }

      Finish();

      pFilter->Release();
   }
   else
   {
      throw exception("LoadIFilter() failed");
   }
}

The call to LoadIFilter() causes Windows to search for the appropriate IFilter parser for the filename given and returns an IFilter COM object for it. The Init() function initializes the IFilter object and tells it how we plan on using it. You can get a description of the various settings that are available on MSDN.

The document is divided by the parser into multiple “chunks”. Each chunk can contain text or values (or both). The call to GetChunk() retrieves the next available chunk in the document.

The next function we define is DocExtractor::ProcessTextChunk(). As you may have guessed, this takes a chunk and processes it so that our program can more easily digest it.

void DocExtractor::ProcessTextChunk(IFilter *pFilter, STAT_CHUNK& stat)
{
   const size_t RBUFSIZE = 8 * 1024;
   wstring t = L"";

   bool done = false;
   while (!done)
   {
      wchar_t rbuf[RBUFSIZE];
      ULONG bufsize = RBUFSIZE;
      HRESULT hr = pFilter->GetText(&bufsize, rbuf);
      switch (hr)
      {
      case FILTER_E_NO_MORE_TEXT :
         done = true;
         break;

      case FILTER_S_LAST_TEXT :
         done = true;
         // fall through
      default :
         if (SUCCEEDED(hr))
         {
            size_t tlen = t.length();
            t.resize(tlen + bufsize, 0);
            const wchar_t* s = rbuf;
            for (size_t i = tlen; bufsize--; ++i)
               t[i] = *s++;
         }
      }
   }

   Text(t);
}

This code asks for text in the current chunk in eight kilobyte pieces and puts all that text in an STL wstring object. Keep in mind that the text put into rbuf by the GetText() call is always Unicode and does not have a null terminator on the end.

The code to handle property values is a little more involved. Property names can either be strings or ids.

void DocExtractor::ProcessValueChunk(IFilter *pFilter, STAT_CHUNK& stat)
{
   wstring propName = L"";

   // get property name
   switch (stat.attribute.psProperty.ulKind)
   {
   case PRSPEC_LPWSTR :
      propName = stat.attribute.psProperty.lpwstr;
      break;

   case PRSPEC_PROPID :
      switch (stat.attribute.psProperty.propid)
      {
      case PIDSI_TITLE :         propName = L"Title";          break;
      case PIDSI_SUBJECT :       propName = L"Subject";        break;
      case PIDSI_AUTHOR :        propName = L"Author";         break;
      case PIDSI_KEYWORDS :      propName = L"Keywords";       break;
      case PIDSI_COMMENTS :      propName = L"Comments";       break;
      case PIDSI_TEMPLATE :      propName = L"Template";       break;
      case PIDSI_LASTAUTHOR :    propName = L"LastAuthor";     break;
      case PIDSI_REVNUMBER :     propName = L"RevNumber";      break;
      case PIDSI_EDITTIME :      propName = L"EditTime";       break;
      case PIDSI_LASTPRINTED :   propName = L"LastPrinted";    break;
      case PIDSI_CREATE_DTM :    propName = L"Created";        break;
      case PIDSI_LASTSAVE_DTM :  propName = L"LastSaved";      break;
      case PIDSI_PAGECOUNT :     propName = L"PageCount";      break;
      case PIDSI_WORDCOUNT :     propName = L"WordCount";      break;
      case PIDSI_CHARCOUNT :     propName = L"CharCount";      break;
      case PIDSI_APPNAME :       propName = L"AppName";        break;
      default :                  propName = L"?";
      }
      break;
   }

   // get property value
   wstring propValue = L"";
   HRESULT hr = 0;
   PROPVARIANT *pv = 0;
   while (SUCCEEDED(hr = pFilter->GetValue(&pv)))
   {
      wstring prop;
      HRESULT hr = pFilter->GetValue(&pv);
      if (pv != 0)
      {
         switch (pv->vt)
         {
         case VT_LPWSTR :
            prop = pv->pwszVal;
            break;

         case VT_I4 :
            // TODO: convert pv->intVal to string
            break;

         case VT_FILETIME :
            // TODO: convert pv->filetime to string
            break;
         }
         CoTaskMemFree(pv);
      }
      // if there is more than one value for this value
      // turn it into a comma delimited list
      if (propValue.length() != 0)
         propValue += L", ";
      propValue += prop;
   }

   Property(propName, propValue);
}

That’s it! Just point the extract tool at a file and it will dump out all the text and properties in that file. In a future post, I’ll discuss how to use word breakers to split all that text into individual words.

Update: There’s a freely downloadable utility called IFilter Explorer that you can use to see what file formats have parsers installed for them.

Advertisements

7 Comments

  1. You seem to have overlooked one detail when joining the text hunks together. You should check the breakType member of the STAT_CHUNK object — it may indicate that you need to insert a word/sentence/paragraph/chapter separator between the hunks. This turns out to be especially important for .pptx files or words from adjacent text elements will be joined together (e.g. last word of title merged with first word of body).

    Comment by Bill Dimm — May 20, 2008 @ 9:12 am

  2. Bill,

    I never ran across that issue, probably because most of my work was with PDF files, which seemed to break text chunks on page boundaries. Great tip, though. Thanks.

    ā€“ Ferruccio

    Comment by Ferruccio — May 20, 2008 @ 9:24 pm

  3. Hi,

    The sample project extracts text and properties of MS office 2007 (*.docx) documents but in case of MS office 2003 (*.doc), it only extract text content, it does not extract any property. Why this happens in case of MS office 2003?

    Can we read properties of MS office 2003 documents using this IFilter?

    Thanks
    Prakash

    Comment by Prakash Tandukar — January 12, 2009 @ 1:33 am

  4. My pc is Windows 2000 SP4. Using Adobe PDF Ifilter 6.0

    Goal- to extract from a pdf file.

    To compile and link in C I had to change TCHAR’s to wchar_t’s and add ntquery.lib to the library list.

    I also changed the Platform from 0x0600 to 0x0502.

    In C Debug on LoadIfilter I get-

    First-chance exception in lazystandalone.exe (ACE.DLL) 0xC00001d: Illegal Instruction

    (ACE.DLL is in Adobe PDF IFilter 6.0)

    Other similar errors occur as filter is used and released.

    Oddly enough the extraction works ok.

    On completion Debug ends with

    Unhandled exception in lazystandalone.exe (PDFL60.DLL) 0xc0000005: Access Violation

    If I run in batch the program ends with
    Font Capture: lazystandalone.exe – Application Error
    The instruction at 0x00de61b3″ referenced memory at “0x017b44d8” The memory could not be read. Click ok to terminate the program.

    Any suggestions would be appreciated.

    Howard

    Comment by Howard — July 22, 2009 @ 12:12 pm

  5. […] Part 2: Using Word Breakers By ferruccio In the last article, I described how to use the IFilter interface to extract raw textual data from any document which […]

    Pingback by IFilters Part 2: Using Word Breakers « The Lazy Programmer — November 29, 2009 @ 11:59 am

  6. Hi,
    i am using your code for implement IFilter but my application crash due to Doc-extractor class when I am not using that class it work perfectly means i am not getting crash , I am not able to get why is it happen ?

    pls help me .
    thanks

    Comment by prabhat — December 22, 2009 @ 5:43 am

  7. prabhat, (and Howard, sorry for the delayed response, but I had no idea what the problem was until recently)

    Are you using the 6.0 version of the Adobe PDF IFilter? If so, that may be the source of your problem. The 6.0 IFilter has a documented bug which causes its host (your program) to crash upon exit. You need to use a newer version.

    BTW. Adobe no longer ships a separate IFilter. It comes bundled with both Reader and Acrobat. So installing the latest version of either of those will give you their latest IFilter.

    If you already have Reader (or Acrobat) installed, it is not enough to uninstall the 6.0 IFilter. You will also need to repair the Reader (or Acrobat) installation. Or you can uninstall and reinstall it.

    ā€“ Ferruccio

    Comment by Ferruccio — December 22, 2009 @ 7:10 am


RSS feed for comments on this post.

Create a free website or blog at WordPress.com.

%d bloggers like this: