A while back, I started building a PDF parser in C++. I had been using the Adobe PDF IFilter to extract text from PDF files in order to index the content, but I wanted to be able to be able to also extract formatting information so I dug into the PDF format. The PDF format itself is fairly easy to parse, but the contents can be quite complex.
The PDF format consists of a series of objects, expressed in a simple syntax based on PostScript. There are primitives such as strings and numbers, and there are collections (arrays and dictionaries) which can contain both primitives and containers. You can see how things quickly become complicated when you have dictionaries containing arrays containing other complex objects.