I’ve been experimenting using Python to read data from binary files and started to notice the following pattern in my code.
- Read a block of binary data.
- Use struct.unpack() to break out individual fields.
- Create a dictionary from those fields using the appropriate key names.
So, let’s suppose I needed to read a block which contained a name (20 characters), an age (unsigned integer) and a salary (float). The code might look something like this:
import struct (name, age, salary) = struct.unpack("< 20s I f", src.read(28)) record = {'name' : name, 'age' : age, 'salary' : salary}
That’s not bad, but it starts getting tedious to have to calculate the size of the block and construct the dictionary manually each time, so I wrote a small function to do all that automatically:
import struct def readStruct(src, format, names): s = {} for nv in zip(names, struct.unpack(format, src.read(struct.calcsize(format)))): s[nv[0]] = nv[1] return s
The code to read the record from the previous example then becomes:
record = readStruct(src, "< 20s I f", ('name', 'age', 'salary'))
It’s a small change, but it has made working with binary files easier and less error-prone. The next step will be to combine the format and names parameters into a single format string that has the names embedded in it. We’re going to use regular expressions to achieve that goal.
The updated code is:
import struct import re def readStruct(src, format): rex = re.compile("\{([a-zA-Z_]+)\}") names = rex.findall(format) format = rex.sub("", format) s = {} for nv in zip(names, struct.unpack(format, src.read(struct.calcsize(format)))): s[nv[0]] = nv[1] return s
and the previous example now becomes:
record = readStruct(src, "< {name} 20s {age} I {salary} f")
Which, I think is a lot easier to write and read. Note that you don’t have to interleave the formatting and name strings. You could just as easily have written the format string as “<20sIf{name}{age}{salary}” but I think interleaving them makes their use much clearer.


Nice code. My only comment is that I would add a check that the format had the same number of fields/names, since zip() silently clips the longer list, and you will have a mismatch between the name and field assigned. For example:
import struct import re def readStruct(src, format): rex = re.compile("\{([a-zA-Z_]+)\}") names = rex.findall(format) format = rex.sub("", format) fields = struct.unpack(format, src.read(struct.calcsize(format))) if len(fields) != len(names): raise ValueError("Mismatch in lengths of struct format and field names") s = {} for name, field in zip(names, fields): s[name] = field return sComment by Jared.Grubb — April 6, 2009 @ 3:10 pm