Pippijn - Programming / C / Strings

String representations in C

What is a string?

That is a very important question. Most programmers think that strings represent text. This is not the real meaning of the word 'string'. Text is usually represented by a byte string, but does not have to be. It may as well be represented by an octet string or a 32 bit integer string.

Different representations

NUL terminated

This is the representation most commonly used in the C programming language. All string literals are NUL terminated. C has two kinds of string literals: character and wide character string literals. The latter is written L"text" and is of type wchar_t const*. The representation in memory of "text" would be equivalent to:

{ 't', 'e', 'x', 't', 0 }
// In ASCII values:
{ 0x74, 0x65, 0x78, 0x74, 0 }
// In EBCDIC values:
{ 0xa3, 0x85, 0xa7, 0xa3, 0 }

L"text"'s representation is even more platform dependent and could be:

{ L't', L'e', L'x', L't', 0 }
// In ASCII values:
{ 0x0074, 0x0065, 0x0078, 0x0074, 0 }

This is not surprising. More surprising would be the representation of "☺" and L"☺":

"☺"
{ 0x3f, 0x98, 0xba, 0 }
L"☺"
{ 0x263a, 0 }

The former is the UTF-8 octet-representation of the smiley face, the latter is the Unicode code point in a wchar_t string. This is all highly platform dependent and could be all different for you. This text, however, is not about platform dependentness of string literals. It is about how strings can be represented in memory.

Advantages

Low memory consumption
There is no need to store the length of a string anywhere, because that can be found easily by iterating over the characters and increment a counter until a NUL character is found.
C standard library compatible
The C standard library functions operate on such NUL terminated strings, so conversion is not required.

Disadvantages

Cannot hold NUL characters
Due to this string depending on the special NUL character denoting its end, it cannot contain NUL characters itself. A string like "Hello\0World" would be interpreted as "Hello".

struct string

A more sophisticated string representation is the following struct, frequently used by C++'s std::string implementations:

struct string
{
  char *rep;
  size_t len;
  size_t max;
};

string.rep is the internal representation of the string, which is just a data pointer and not necessarily a NUL terminated C string. len holds the current length and max is the number of bytes this string can hold before it needs a reallocation.

Advantages

Can hold any binary data
This string does not depend on special characters marking its end, so it can hold such characters without any problem.
Constant time strlen
The strlen equivalent for this string runs in constant time, because the string maintains a variable holding the length.

Disadvantages

Not standard library compatible
In order to use C standard library functions such as strcat or printf, this string needs to be converted to a NUL terminated C string. That can be done either by copying the string and omitting NUL characters or by making sure the string is at least NUL terminated and ignoring inner NUL characters. C++'s std::string chose to do the latter.

Netstring

Netstrings are strings in the form

[length]":"[string]","

They are stored in octet strings.

Advantages

Can hold any binary data
Just like the struct string above, this string stores its size along with the data.
Almost constant time strlen
Finding the length of a Netstring is as easy as finding the ':' in the string and converting the text before it to an integer.
Can be sent over network without conversions
This string does not use byte order dependent data, so it can be sent and received verbatim over network.

Disadvantages

High memory consumption
This string stores the string representation of the length, which may need up to 11 bytes of memory in addition to the ':' and ',', so up to 13 bytes are spent on the length only.

String with size prepended

This is a char* object that looks just like any normal C string, except that it has a hidden size prepended to it. In memory, a "Hello" string would look like this:

// Big endian ASCII
0x00, 0x05, 0x48, 0x65, 0x6c, 0x6c, 0x6f
// Little endian ASCII
0x05, 0x00, 0x48, 0x65, 0x6c, 0x6c, 0x6f

The pointer actually points at the third character, so the length is hidden from normal functions and only the special functions that destructively operate on the string's representation know about its existence.

Advantages

Low memory consumption
This string type only takes 2 extra octets of memory to be able to represent any string up to a length of 65535 characters.
Can hold any binary data
Like the Netstring, this string prepends its length to its actual data, which allows the string to hold any string of characters, including NUL.
Constant time strlen
The length is known and already in host integer representation, so finding the length is a constant time operation.
Standard library compatible
As long as no NUL characters exist within the string, it can be passed to standard library functions such as strcmp, making many operations on this type of string trivial.

Disadvantages

Length stored in host byte order
If you want to send this string over a network channel, the size needs to be converted to network byte order, unless the host is big endian. The string is therefore slightly harder to handle.

Note that you can prepend anything to the actual string data, as long as you remember the offset. This technique is commonly used in hashed strings that contain their own hash code, making getting the hash value a constant time operation.

If you do it smartly, you can minimise storage overhead by storing up to 0x7f in the last byte and look at the previous byte if it is greater than 0x7f. If the previous byte is also greater than 0x7f, proceed with the next previous byte, and so on. For further details, see the wikipedia article on BER.

Pippijn van Steenhoven

Menu

String representations in C

What is a string?

Different representations

NUL terminated

Advantages

Disadvantages

struct string

Advantages

Disadvantages

Netstring

Advantages

Disadvantages

String with size prepended

Advantages

Disadvantages