The Lazy Programmer

April 22, 2008

Using Boost to tokenize strings

Filed under: C#,Programming — ferruccio @ 9:55 pm
Tags: , ,

Like most C++ programmers that came from a C background, whenever I needed to parse a string into a bunch of tokens, I reached for strtok(). Sure, it’s not thread-safe. But I can take care of that later if it becomes an issue, right?

Eventually, I ran across the STL and learned to put my strings and other objects into containers. But strtok would eventually rear it’s ugly head to break the abstraction of STL strings just so I could split up a string without putting in too much effort.

At some point, I started playing with Boost. If you’re a C++ programmer and you don’t know what Boost is, you need to go to boost.org and take a look at it right now because I guarantee that you are doing way too much unnecessary work without it. Go ahead, I’ll wait until you’re done looking around.

Great! Now that you know what Boost is, I’d like to show you several ways you can use Boost to break strings up into tokens and maybe introduce a few Boost features along the way. We’ll start with a class designed to do just that: boost::tokenizer.

Here’s a short sample program that breaks a string using spaces and commas as token delimiters:

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
   string text = "token, test   string";

   char_separator<char> sep(", ");
   tokenizer<char_separator<char>> tokens(text, sep);
   for ( tokenizer<char_separator<char>>::iterator it = tokens.begin();
         it != tokens.end();
         ++it)
   {
      cout << *it << "." << endl;
   }
}

Well, that works but it’s a little too wordy for my taste. The tokenizer code is pretty short and sweet. We create a char_separator object named sep that defines our token delimiter and a tokenizer which takes a string and a char_separator and returns an iterator we can use to get all our tokens. The wordiness is mostly in the for loop, where we have the usual iterator boilerplate code. Fortunately, we can fix that using BOOST_FOREACH. Our code now becomes:

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
   string text = "token, test   string";

   char_separator<char> sep(", ");
   tokenizer<char_separator<char>> tokens(text, sep);
   BOOST_FOREACH(string t, tokens)
   {
      cout << t << "." << endl;
   }
}

The BOOST_FOREACH macro lets you iterate over every object in a container or array. In our example it creates a local string t which is mapped to the current value of the tokens iterator. Be careful using it though. It *is* a macro, so don’t put any expression with commas in it or the compiler will get seriously confused and start spewing a torrent of error messages.

If you’ve been following the development of the C++0x standard, you may know that a foreach construct has been approved for inclusion in the language. With any luck, we’ll start seeing it in C++ compilers within the next decade ๐Ÿ™‚

One of the cool features of boost::tokenizer is that you break up string based on more complex criteria than just having a list of delimiters. If you need to parse data in CSV format, you can use the escaped_list_separator to make the job really simple.

   string csv = "simple field,\"quoted field, with commas\",another field";
   tokenizer<escaped_list_separator<char>> esc_tokens(csv);
   BOOST_FOREACH(string t, esc_tokens)
   {
      cout << t << "." << endl;
   }


That’s all well and good, but what if you need to put all the tokens you’ve collected into a container. You could just do that in the BOOST_FOREACH loop; or you could let the Boost string algorithms library do all the work for you:

#include <iostream>
#include <string>
#include <list>
#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
   string text = "token, test   string";

   list<string> tokenList;
   split(tokenList, text, is_any_of(", "), token_compress_on);
   BOOST_FOREACH(string t, tokenList)
   {
      cout << t << "." << endl;
   }
}

The split function takes your string and your token-delimiting criteria (delimiters must be either a space or a comma in this example) and fills the container of your choice with the resulting tokens. The token_compress_on constant tells split() to treat multiple delimiters as a single entity. Without it, it would return empty strings between consecutive delimiters. Note that we don’t have to give it a list to fill. A vector or deque or any container with a push_back() method would have worked as well.

Advertisements

2 Comments

  1. Thanks, just what I needed! ๐Ÿ™‚

    Got tired of using stringstream hacks or strtok as you mentioned for such a simple thing as splitting a string by some separator. I’m using Boost in my project and today I thought “hey, I’m sure Boost has something for tokenizing” and I went googling. Your blog was a first hit, congrats ๐Ÿ™‚

    Comment by Pawe? Paprota — April 15, 2009 @ 4:24 am

  2. Thx by the help dude

    Comment by Anonymous — January 31, 2010 @ 5:34 pm


RSS feed for comments on this post.

Blog at WordPress.com.

%d bloggers like this: