Information Representation

It’s important from time to time to go back to the basics. You most probably already know that computers use binary digits (bits) to represent everything they operate on. And you might be wondering why I’m considering this topic to be “advanced” and “mysterious”. Unfortunately, any simple topic has some mysterious corners that are hard to see and so people will easily omit them. This lack of knowledge will later appear in the form of bugs and security vulnerabilities that are not only difficult to detect, understand, and fix, but also exponentially costlier with respect to time.

Let’s start at the beginning. Any finite sequence of bits, say 10111001, can be represented and manipulated by a computer. However, this sequence of bits per se does not actually represent information because it does not have a specific meaning. It could represent an integer, a real number, a character, a machine-level instruction, a character string, or anything. That’s what we call data. It is just a sequence of symbols that does not have a specific meaning. Even if we know that the two sequences of bits 1010 and 1000 represent numbers, we still cannot add them. That’s because we don’t know how to interpret these bits to construct a number.

This leads to the conclusion that it’s not enough to build a machine that can represent sequences of bits because it will not be able to correctly operate on these bits unless we  build into it a way to interpret them. From this, we can define digital information as bits + interpretation. For the processor to execute an instruction, it must first get the operands from memory then execute the instruction according to the built-in interpretation. In order to get the operands, the processor must have a way to “address” them so that the memory module can locate them. I will focus the rest of the article on addressing and talk about interpretations in later articles.

If you are an expert, things will seem to be very trivial at the beginning, but I promise you that it will quickly get insanely mysterious. Many things seem to be simple. But after careful deliberation, we discover that they are not quite simple. If you already know these things but you have few gaps in your understanding, then you will also be enjoying. However, if you are a beginner, then be warned that the difficulty level will escalate quickly and so you might not be able to catch up.

A bit can have only two values (0 or 1) and therefore we cannot use a single bit to represent any useful information. So it’s more convenient for humans and machines to work with groups of bits. Hardware designers and computer scientists chose group size of 8 bits because they believe this is the minimum number of bits that is most convenient to work with.  A sequence of 8 bits is called a byte. Mortals use hexadecimal notation because it is more compact and they can easily covert between the two.

For computers, this means that the minimum number of bits that they can transfer and manipulate is 8. So a byte is the smallest addressable unit of memory. In most operating systems, the array of bytes that is accessible by a machine-level program is called virtual memory. Now since every byte of this array is addressable, then there must be a unique address for each. The address by itself could be anything, what matter is that it’s unique. The set of all addresses of the virtual memory of a program is called the virtual address space. The term “virtual” is used because of the fact that the addresses used by the program are not the actual addresses that will be used by the memory module to access the required byte. Also the byte itself could be stored in any memory module of the computer. The mapping from virtual addresses to physical addresses is performed by the operating system and the processor. I will talk more about this in another article.

The earliest processors are able to operate on operands of sizes no more than a byte. For example, the add instruction can only add two bytes. The reason for this is that the data bus was thin and storage locations in the register file were small and cannot store more than one byte. Processors quickly evolved and became capable of operating on operands of larger sizes. Most processors today can operate on 32-bit or 64-bit operands. Slowly but shortly the latter will dominate the former.

The size of a general-purpose register determines the maximum number of bits that the standard machine-level instructions can operate on. In particular, the accumulator register (which usually also determines the size of the data bus). The size of this register determines what’s known as the word size which we will denote as w. Typically, but not always, the word size is the same as the size of the address bus, meaning that addresses range from 0 to 2^w-1. So if w = 32, then the size of the virtual address space is 4 GB or approximately 4*(10^9) bytes. This also means that a program has access to at most 4 GB at any instant in its lifetime. I will assume this assumption to be true throughout this article.

Consider 32-bit processor and a 32-bit value stored in memory. The value is stored in 4 bytes and each of these bytes has an address. How will the processor address such multibyte objects? And how will the bytes be ordered in memory? Fortunately, most processors store multibyte objects contiguously in memory and consider the address of the object to be the smallest address of its bytes. For example, if we have a 4-byte variable x and the address of the variable &x is 10 then the addresses of its bytes would be 10, 11, 12, and 13.

Unfortunately, processors from different companies and even from the same company follow different byte orderings (byte ordering is also known as endianness denoting which end of the object comes first). There are two common conventions. Some processors store an object from the least significant byte to the most significant byte, which means that the address of the object is the address of the least significant byte. This convention is called little-endian and followed by most Intel machines. Other processors store an object from the most significant byte to the least significant byte. This convention is called big-endian and followed by most machines from IBM and Oracle. Many recent processors are bi-endian, meaning that they can be configured to operate either way. For example, ARM, PowerPC, and IA-64 are bi-endian architectures.

For example, suppose we want to store the value 5A0193FD at memory addresses 10 to 13. No matter what the ordering is, the address of the value will be the smallest address which is 10. The ordering determines what is stored at that address. A little-endian machine will store the least significant bye FD at 10 and the most significant byte 5A at 13. So it will appear in memory in this order: FD 93 01 5A, the reversed order we are used to. On the other hand, a big-endian machine will store that most significant byte 5A at 10 and the least significant byte at 13. So it will appear in memory in this order: 5A 01 93 FD, the same order we are used to.

The little-endian ordering has the property that the same value can be read from memory at different lengths using the same address. For example, consider a 32-bit memory location with bytes 8D 00 00 00. The same value can be read at the same address as 8-bit (8D), 16-bit (008D), 24-bit (00008D), and 32-bit (0000008D). In rare circumstances, this property can be used to optimize machine-level code and it is actually exploited by some code optimizers. On the other hand, big-endian ordering enables us to obtain an approximation of the value by reading few bytes at the specified address. Endianness also affects hardware design. In most cases, these properties have tiny value and so both orderings are technically equal.

Endianness is problem in two common cases: when transferring a bit stream from one computer to another through a network and when transferring files using some media. The first case is solved by a standard network byte order that all networking protocols must adhere to. The standard byte order is big-endian. Unfortunately, not all networking protocols use big-endian ordering. The second case depends on the format of the file. Text files, for example, can optionally start with a byte order mark (BOM) to determine the endianness of the file (also the encoding of the text). If BOM is missing then the application that will read the file must assume an endianness, probably the same as the machine’s. For binary files, some of them also have some form of BOM while others follow a standard ordering independently of the machine. You might have come across the problem where a text file on one machine is properly displayed but not on another machine. Byte ordering might be one of the reasons.

There are other less common byte ordering conventions collectively known as middle-endian or mixed-endian, where the size of the end is other than one byte, such as 16-bit. Also when transmitting a bit stream using a wire, bit-endian ordering makes sense.

The following C function takes an address to an object and the size of that object in bytes and prints these bytes in hexadecimal notation. It helps us recognize the byte ordering being used.

void print_bytes(unsigned char *address, size_t size)
    size_t i;
    for (i = 0; i < size; i++)
        printf(" %.2x", address[i]);

For example, if we have a variable x of type int then we can call the function as follows:

 print_bytes((unsigned char *)&x, sizeof(x)); 

In (hypothetical) future articles, I will discuss interpretations. In particular, signed and unsigned integers, floating-point numbers, and how computers do arithmetic on them.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s