Lately, I’ve been working on a multiplayer game protocol for a game I’m writing in Odin. It’s in its early stages, as I’m going step by step through Glenn Fiedler’s excellent Building a Game Network Protocol blog series. Currently, I have a working implementation that covers the Serialization Strategies post, and I’ve already encountered several interesting topics. As the title of this post suggests, we’re going to take a look at bit packing here. Many of the points I am making is coming from that blog series, and my implementation is mostly an Odin port of what Glenn is showing in C++. The main purpose of this blog post is to dig deeper into some of the code from that blog series. You can find the current implementation of my project here game-networking.
Just to be clear, I am not an expert in networking, game networking protocols, or bit packing. I’m simply trying to explain what I know and have learned through this process. If you discover any errors in my explanations or anywhere else, then please let me know!
Motivation
You might be wondering what bit packing has to do with building a multiplayer game protocol. The short answer is latency, but let’s dive deeper to understand the connection.
For networking in games and many other real-time applications, only the latest state of the “world” or “simulation” is relevant. This is inherently different from information transferred over an HTTP request, where it’s critical that every bit of information is received by the receiver in the correct order.
HTTP runs on top of another protocol called TCP, which is designed to guarantee that every bit of the request you send will be received on the other end in the intended order. Imagine sending a TCP packet every second for one minute, where the payload contains an incrementing number. The receiving side should receive 60 packets with the same incrementing numbers in order. TCP guarantees this even if the network between the sender and receiver is unstable or drops out. While this is great for reliably downloading files, browsing the web and so on, it’s not ideal for games because it introduces unwanted latency.
Why UDP?
So, what’s the alternative? UDP (User Datagram Protocol). In contrast to TCP, UDP provides no guarantees about whether a packet will be received or in what order it will arrive. It does offer some error correction functionality, similar to TCP, but it’s not very strong. Many developers append their own error detection codes, such as CRC32. Why is this better than TCP? While it’s true that we have to handle more ourselves, UDP gives us the flexibility we need to minimize latency as much as possible.
The Importance of Latency
Every gamer knows that those precious milliseconds matter, especially in highly competitive games and for the overall smoothness and quality of the multiplayer experience. Few things ruin a multiplayer game as quickly as bad network lag and desynchronization.
To reduce latency, there are two core principles: send as little data as possible, and send the freshest data possible. TCP suffers most in the “freshness” department. To uphold its guarantees, TCP might have to retransmit old packets containing information about game states that occurred many frames, or even seconds ago if the network is unstable. This is far from ideal for a multiplayer gaming experience.
That said, some games have successfully implemented multiplayer network protocols over TCP, so it’s not entirely inapplicable. However, it’s important to understand the trade-offs involved.
Bit Packing
Now, let’s address the “sending as little data as possible” part. This is where bit packing comes into play. Sending less data is important for several reasons It enables us to send fewer packets over the network, reducing the chances of packet loss. It also helps on putting less pressure on the required network bandwidth between the sender and receiver.
By bit packing, we can ensure that we pack the information we need to send and receive as tightly as possible, with no wasted bits. To begin outlining the bit packing, we need to consider its main purpose and our requirements. For game networking, we know that we’ll need to write bytes onto a UDP socket. We also know that we want those bytes to be as packed with information as possible, meaning we want as few dead bits (useless 0s) as we can manage. I highly recommend reading the first two posts in the Building a Game Network Protocol series before continuing. They go into more detail in some areas that I won’t cover, while my focus here will be to deeply explain the details of bit packing and serialization code, with the twist that it’s written in Odin instead of C++.
So, what do we need to write something onto a UDP socket? The first thing to look at is the socket API. If we check the man page for socket, we’ll see that it mentions different socket types and the send
syscall. Looking at the send
man page, we can see that there are send
and sendto
syscalls, both of which take a void buf[len]
as one of the arguments. This is essentially a byte buffer. There are more details to this, but this is what we need to know for now.
Since we’re writing this bit packer in Odin, we won’t directly call syscalls or use C for this. Let’s take a look at the Odin net
package instead. The relevant procedures in the net
package are:
As we can see, both the send and receive UDP procedures take a byte slice []u8
as arguments. You can think of the send procedure as the writing
side, and the recv procedure as the reading
side. This is exactly the two sides our bit packing will have as well, we’ll have a byte buffer that we can write
to and read
from.
One important thing to mention about the bit packing implementation we’re writing is that we’re sending and receiving packets and data to and from potentially malicious actors. This is a crucial context to keep in mind, as it places restrictions on what we can do in our implementation. In many cases, the trivial solution would be to just do a memcopy and call it a day. But in our case, where we can’t necessarily trust the bytes we would be copying, that’s a bad idea. We need to ensure that when reading, the data format is what we expect and is legal. If it’s not, we need to handle those cases as they arise. One last thing before we start implementing stuff, we’re gonna split the bit packing into two parts, a Bit Writer and a Bit Reader.
The Bit Writer
Finally we’re at what this blog post were supposed to be about, I hope it hasn’t been too bad yet, but I think the context is important. Let’s take a look at our Bit Writer struct then.
Alright, not too bad but there’s some weird things going on here. The buffer field, that makes sense, but why is a slice of the type []u32
? We’re supposed to write bytes, not unsigned integers!
The main idea behind this is that modern cpus like it when things are 4-byte / 32-bit (word) aligned. So when we write to the buffer, we’ll write 4-bytes / 32-bits at a time instead of a single byte at a time.
So we’re doing this in the hope of better performance, but we’ll pay for it in terms of some complexity. Now, I’ve done no benchmarking of this solution compared to one that just writes a byte at a time, so I am just relying
on Glenn Fiedler’s word here that its a good idea. It makes for an interesting implementation at least! Okay, we’re assuming it’s a good idea to write a word at a time, and we cant just transmute the []u32
into []u8
so that should be fine.
But how do we actually do the write
one word at a time part? Let’s look at some different basic types and think a bit about it.
- u32 - This is the trivial case for us, this fits perfectly into the
u32
values we’re writing. - i32 - This is a signed integer, but doesn’t really matter since we’re not exactly interested in whether there’s a difference between the types, just that we can pack the bits tightly, wich we can easily do in this case.
- u8 - Now, this becomes a bit more problematic, there’s room for 4 u8s in a
u32
, if we only write one, then we’re only utilizing 25% of the word, not exactly fulfilling our goal of having as few dead bits as possible. - u64 - Well, this doesn’t fit, does it? We’ll have to split it up into two separate
u32
values. - Monster struct, which not only doesn’t fit into one
u32
, but it doesn’t align with a multiple of au32
. And we can’t just write the first 32-bits, then the last 8-bits and then waste 24 bits.
Alright, obviously the other fields in the Bit_Writer
struct has to have something to do with this. The two fields that sticks out the most are probably scratch
and scratch_bits
. The others can make sense without knowing a
lot more about how the Bit_Writer works. The word_index
is exactly what it sounds like, its the index to the current word we’re writing to, important for keeping track of where we are. num_bits
are the total amount of bits of
the buffer, essentially its len(buffer) * 32
, but for convenience we’ll keep it as a field. And at the end it is bits_written
, this is used to keep track of how many bits we’ve written, which will be used in a post later when
ensuring alignment and so on. We’ll see how scratch
and scratch_bit
fields are used through examples.
We need to think a bit more about what it means to pack things as thightly as we can. This essentially means that we want to write only the bits that we need to represent the value that we want. Think of the number 14
, in binary
that’s 1110
, which means that we need 4 bits to represent it. Note that it cannot be represented in 3 bits, even though the 0th bit is 0, in the same way that we need a 0 in the decimal number 10 (ten) to represent it correctly.
So for packing purposes we only want to write these 4 bits, no more, since that’s all that is needed. This gives us a hint about how the write
procedure for the Bit_Writer
should look like. It needs to take the value it should
write, and how many bits that it should write. Here’s the signature of the write_bits
procedure for the Bit_Writer
.
For those of you that are unfamiliar with Odin, the ^
before the Bit_Writer
means that writer
is a pointer to a Bit_Writer
. Also we’re returning a bool
here as part of the error handling.
Alright, this looks good, we can give it a u32
value, and we can specify how many bits to use and it returns a bool if it fails. Seems good.
Now, how do we deal with the cases we looked at earlier, let’s take the case of writing a u8 to begin with.
This turns out to be pretty easy, remember, we always know which type we’re writing, so in the case for a u8 we can just cast it to a u32
, which is safe since we’re upcasting and then we just write 8 bits. Pretty nice. But what if the value of the u8 does not need 8 bits? Well that requires a bit more finnicky-ness and we’ll deal with in a post later, now we need to get the foundations down. In general though we’ll follow this method, anything that’s less than a u32
we’ll cast to a u32
and write it with the bits that’s needed. If it’s larger than a 32-bits, we’ll have to break it up into chunks that can each fit into a u32
. Also, its not that we’ll always be perfectly efficient and not waste a single bit ever, that can be pretty difficult sometimes.
Let’s take a look back at the case of the Monster struct, if we wanted to write that using the write_bits
procedure this is how I’d like it to be.
I can just write the each of the fields into the value parameter since they can each fit into a u32
. The problem is that these does not fit into a single u32
in the buffer. The thing to do that makes sense is to write the first
32-bits of the struct into one word in the buffer, then the last 8 bits into the next word in the buffer. But its not as easy as it sounds, we need to keep track of the bits we’ve written, and depending if we’ve gone past the
32-bit boundary of a word we need to write those bits into the next word. Keeping track of how many bits we’ve written is the purpose of the scratch_bits
field, and the scratch
is supposed to be temporary storage where we can
write the bits before we can write them all together into a word in the buffer. In addition, we’re propagating the boolean value from the write_bits
so that whatever called the write_monster
procedure knows whether we were able to successfully write it or not.
This part includes some bit masking, so if you don’t know what that is I suggest just quickly reading up on it, it’s a pretty simple case for us here, and I’ll try to explain it as well as I can.
The first issue we have when implementing the write_bits
procedure is, how can we actually get the n
bits from the value that’s passed into the the procedure. In the example of the monster_id
, we’re passing it as a u32
but its only a u16
, and we’re specifying that we’re writing 16 bits. Which means we want to get the 16 lowest bits of the 32-bit value. We can do this with something called a bitmask, and for creating the mask we’re gonna do a bit shift trick. We’re working on a little endian system, meaning that the Least Significant Bit is furthest to the right. The trick we’ll do to create the mask is the following.
Let’s start with the inner 1 << bits
part. This means that we’re shifting the value 1 left bits
number of times, meaning in our case of writing the monster_id
, bits = 16
. The value 1 can be represented in binary with the 32-bit value.
Let’s then left shift 1 by 16 bits
Now this does not help us with getting the 16 lowest bit, for that we would like the bitmask
so that we can easily biwise &
or AND
it with the monster_id
value to get the lowest 16 bits. Now this is were subtracting 1 comes in.
If you’re not familiar with binary this might not make sense, but think of it like having the value 1000
in decimal and then subtracting 1
and getting 999
, but in binary its all ones instead of nines. This binary value is stored into the mask variable. To actually get the 16 value we want we need to &
the value
variable that’s passed into write_bits
with the mask
.
The nice thing about what we did here, is that it will work for all values for the bits
parameter as long as its between 0 and 32, something we’ll have to ensure is checked later.
Our implementation of the write_bits
procedure now look like this.
The next step is to ensure that we can accumulate values up until we have 32 bits ready to be written into the buffer. To make this happen we’ll finally start using our scratch
and scratch_bits
fields.
Again, the idea of the scratch
is precisely to accumulate these values until we’re ready to write them to the buffer, so let’s see how we are going to do that.
Here we’ve added the line writer.scratch |= u64(masked_value) << writer.scratch_bits
, there’s a couple of things that happens. The |=
operator can be thought of as +=
except that we bitwise OR
the value on the left with
the value of the right and then store it into the value on the left. The purpose of doing this bitwise OR
is that we want to write the bits set from the u64(masked_value) << writer.scratch_bits
into the writer.scratch
field.
This looks kinda what we want, we could maybe think that all we would need to do is writer.scratch |= u64(masked_value)
, but that’s not the case. The reason we need to left-shift the masked_value
by
writer.scratch_bits
is because that’s how many bits we’ve written so far. Think about the case for our Monster
struct, and the write_monster
procedure we came up with earlier, let’s take a look at it here again for clarity.
If we start out with the first write_bits
call here, we’d write the monster.id
value and 16 bits. For the next call to write_bits
we’d write the monster.health
value and 8 bits. If we are not left-shifting by the amount of
bits we written so far we’ll just overwrite however many bits in the scratch we’re specifying in the write_bits
call. This shows that we now need to keep track of those bits written in the scratch_bits
, so we extend the
procedure.
So far this makes sense, except, why do we need both the scratch_bits
and the bits_written
field? Well, its because we need to do some resetting of scratch_bits
for every word, while the
bits_written
will keep track of the total amount of bits we’ve written so far. We’ll use bits_written
in a later post.
So let’s think through this, again with the write_monster
procedure as reference. We write the monster.id
and 16 bits, then the scratch
should keep whatever value that was in its lower 16
bits, scratch_bits
and bits_written
should be 16. Next we write the monster.health
value. No problem, the scratch
will contain monster.health
value in the bits [16, 23]
, since we start at bit 0. The scratch_bits
will
have the value 16 + 8 = 24
, and bits_written
will have the value 16 + 8 = 24
. Alright, let’s do the stamina pretty quickly. It’s stored in the bits [24, 31]
of the scratch
, scratch_bits = 32
and bits_written = 32
.
Now we’re perfectly aligned with a word (32-bit) boundary, so we should write this to the buffer. So let’s do exactly that before we continue writing the last monster.size
field in the write_monster
procedure.
At a high level, what we need to do is check for whether we’ve reached the 32-bit boundary, and then somehow write the bits in scratch
into the buffer.
In this implementation we check if the scratch bits is equal to or larger than 32, this means that if we’re perfectly on the boundary, or we have written past it, which obviously can happen. Think about a hyptothetical case where
the scratch_bits
is 28 and we write another 8 bits, making the scratch_bits = 28 + 8 = 36
. This is essentially 4 bits into the next word, so that’s something our implementation also needs to deal with, but the newly added line
does not totally cover it. We can see that we take the lower 32-bits of the scratch
and AND
it with hex mask 0xFFFF_FFFF
, this is the same as 0b1111_1111_1111_1111_1111_1111_1111_1111
. This is actually the reason for my 4
spaced formatting of binary numbers. One hex value is 4 bits, which makes it easy to see. Then we take those lower 32 bits and write them into the buffer at the index of the write.word_index
. This is the field that keeps track
of which word we’re supposed to write to.
Okay, we’re almost at the end of implementing write_bits
, the last things that remain is to ensure that we reset the state of the scratch
, scratch_bits
and of course increment the
word_index
. This is how we’ll do it.
This addition is not too complicated, like I said, we increment the word_index
, then we right-shift the scratch
by 32 bits. There is a pretty subtle but important detail about that, we can’t just set the scratch to 0, because
remember if we write past the 32-bit boundary, we need to keep track of those bits for the next word. That is what this does, in the case we looked at where we had written 36 bits before the if writer.scratch_bits >= 32
would
trigger, this makes sure that we keep those 4 bits. This is the same reason we’re subtracting 32 from scratch_bits
instead of setting it to 0. We need to know that we have written 4 bits into the next word. And that’s pretty
much it! Well, except error handling of course, let’s add that.
These checks mostly checks if we’re trying to write an illegal amount of bits, or if we’re writing 0 bits, then we can just return true immediately. The idea is that we should always be allowed to write 0 bits, it doesn’t really make sense, but returning false is not really correct either. The last check is the most complicated one, it essentially checks that we’re not gonna write beyond the amount of bits that can be written to in the buffer.
This is it for the write_bits
procedure, we’ll probably not touch this in a really long time now, if at all. It’s also a really important piece to understand since its a major workhorse in the bit packing and eventually the game
network protocol. It’s worth noting that we should write a lot of tests for this, which I have in my implementation, but that’s too tedious to deal with here. Let’s continue with the read_bits
procedure, we should be able to
do that one pretty quickly now, since it’s basically doing the reverse of writing. At then end we’ll look at the flush_bits
procedure, somthing that’s needed to make sure that we’ve written all the bits that might be left in the
scratch
to the buffer.
The Bit Reader
The Bit Reader is very similar to the Bit Writer, actually it got the exact same fields in it’s structure, it’s used pretty much the same way, except we’re reading from the buffer instead writing into it.
Let’s take a look at the implementation, which is also quite similar but has some important distinctions, naturally.
The signature looks very similar, except now we’re returning multiple values, the value we’re reading from the buffer and whether that read was successful. The bits
here is now how many bits we’re gonna read from the buffer.
We have a lot of the same checks to ensure that the Bit_Reader
is in a state where it makes sense to read from. For the details its easier if we use a concrete case like we did with the Monster
struct and the write_monster
for the Bit_Writer
. So let’s make a read_monster
procedure now, and think through how it would go.
When we read the id
field in the read_monster
procedure, our scratch_bits
, scratch
and word_index
is all initialized to 0. So, scratch_bits
is definitely smaller than bits
,
so we immediately go into that part of the code. Then we can see that we fill the scratch with current word from the buffer
left-shifted by scratch_bits
. Then we left-shift by the amount of
scratch_bits
, for the case of the id
field, this will be 0. We’ll understand better why this left-shift is needed soon.
Next, we add 32 to the scratch_bits
, so in our case scratch_bits
now holds the value 32. Then we increment the word_index
by 1, since we have read a whole word into the scratch
, so that
makes sense.
After the if
section we create the bit mask, the same we did for the writer, but now we use it to retreive the lower 32-bits from the scratch
instead of writing into it.
Then we right-shift the scratch
by bits and store that into scratch
again. This is done so that the lower bits we’ve already read is now overwritten by the remaining upper remaining bits, this is important.
Next we subtract the amount of bits we’ve read from the scratch_bits
, so in our case of reading the id
field, scratch_bits
will now hold the value 16, since
32 - 16 = 16
. For the case of the reader, we can think of the scratch_bits
as a value that tells us how many bits we have left to read, before we reach the next word, where 0 means that we have
reached the next word boundary. Finally we increment the amount of bits_read, and return the value and boolean true
for success.
Let’s quickly do a couple of more iterations to get to another interesting case. The next field in the Monster
struct is the health
field, which is a u8
, so we need to read 8 bits. This time
scratch_bits
is larger than bits
so we skip the if and just get the value out as normal, subtract the bits, increment the bits_read
and return true
. Next up is the stamina
field, also a
u8
so we read 8 bits. This time scratch_bits
and bits
is exactly equal, so we skip the if block this time around aswell, it’s also worth noting that this is exactly at the next word boundary.
Finally we read the size
field, and its also a u8
, but now we’re back at scratch_bits
being smaller than bits
, so we need to do that whole thing again. We read the word from the current
word_index
, and this is important to note, the word_index
points at the next word to be read from. At the beginning it was 0, then we read all of that, now its 1, and we’ll increment it
to 2 at the end of this if block, and add 32 to the scratch_bits
. Then we do the usual end of procedure stuff.
The final thing we still don’t fully understand is what happens when things don’t align so well, imagine we have a situation where the scratch_bits is 8, so we’re 8 bits away from a word
boundary, but we then read 16 bits, so 8 bits is in the current scratch
from the word we’ve been processing, and the next 8 bits is from the next word. In that situation we left-shift by 8 bits
and read in the whole word from the boundary. The purpose of the left-shift now is to push back the the lower scratch_bits
number of bits we haven’t processed yet, so that it doesn’t overwrite
the unprocessed ones. Also, it’s worth noting here that the scratch_bits
in this case will hold the value 32 + 8 = 40
. Then when we read out the value using the mask, which will take the
lower 16 bits. Then we right-shift, and now we subtract bits
from the scratch_bits
again, making sure that we’re on track on knowing how many bits is left until next word boundary, which is
40 - 16 = 24
. Which makes sense right, we were 8 bits from the boundary before beginning and we read 8 bits into the next word, then we should be 24 bits away from the word after that.
That’s it, that’s all we’ll say about the Bit_Reader
, and it’s quite a lot for a small procedure, but as you can see, there is a lot going on with those bit manipulation tricks. I hope I have
explained it well. Naturally this function also requires a lot of testing to build up trust in its correctness, which we’ll leave out from here.
Flushing
The last thing we’ll have to do before both the writing and reading of bits is complete is that we need to ensure all the bits in the writer has been flushed. This is only a problem for
Bit_Writer
and not the Bit_Reader
. The reason why is because what’s left in the scratch
in case of the Bit_Writer
is bits we’ve specifically written into it. For the Bit_Reader
those
bits are just whatever is the rest of the final word we’ve written to. Luckily for us, the flush_bits
procedure is pretty simple.
Let’s briefly discuss this before packing things up (hehe). Firstly, we check if the writer is in a legal state, as usual. Then we check if the scratch_bits
is larger than 0, if it is
it means that we have something left in the scratch
that we haven’t written to the buffer. If there’s something left, we just AND
everything in the scratch
with the word in the buffer
we’re currently writing to. We then increment the word_index
. Lastly we set the scratch
and scratch_bits
to 0 and return true
.
Now, I am a bit torn about the necessity of incrementing the word_index
and resetting the scratch
and scratch_bits
, because after we have flushed we have introduced a synchronization
issue between the Bit_Writer
and Bit_Reader
. This is an important thing to note, we must always make sure that the writer side and the reader side expect the exact same number of bits in
the same order. Which means that if only the Bit_Writer
flushes, then we’ve broken that synchronization. Just imagine if the writer wrote the first 8 bits as the health, then the next 16 as
the id, and the reader does the opposite. That’s gonna be chaos, but the write_bits
and read_bits
won’t complain. This is why we’ll only use this as a primitive procedure and build up more
advanced serialization procedure on top of it, which we can then write good tests to ensure that it works as expected. A little foreshadowing to the next blog post maybe…