String representations in C
What is a string?
That is a very important question. Most programmers think that strings represent text. This is not the real meaning of the word 'string'. Text is usually represented by a byte string, but does not have to be. It may as well be represented by an octet string or a 32 bit integer string.
Different representations
NUL terminated
This is the representation most commonly used in the C programming language.
All string literals are NUL terminated. C has two kinds of string literals:
character and wide character string literals. The latter is written
L"text"
and is of type
wchar_t const*
. The representation in memory of
"text"
would be equivalent to:
{ 't', 'e', 'x', 't', 0 } // In ASCII values: { 0x74, 0x65, 0x78, 0x74, 0 } // In EBCDIC values: { 0xa3, 0x85, 0xa7, 0xa3, 0 }
L"text"
's representation is even more platform dependent and could be:
{ L't', L'e', L'x', L't', 0 } // In ASCII values: { 0x0074, 0x0065, 0x0078, 0x0074, 0 }
This is not surprising. More surprising would be the representation of
"☺"
and
L"☺"
:
"☺" { 0x3f, 0x98, 0xba, 0 } L"☺" { 0x263a, 0 }
The former is the UTF-8 octet-representation of the smiley face, the latter is
the Unicode code point in a
wchar_t
string. This is all highly platform
dependent and could be all different for you. This text, however, is not about
platform dependentness of string literals. It is about how strings can be
represented in memory.
Advantages
- Low memory consumption
There is no need to store the length of a string anywhere, because that can be found easily by iterating over the characters and increment a counter until a NUL character is found.
- C standard library compatible
The C standard library functions operate on such NUL terminated strings, so conversion is not required.
Disadvantages
- Cannot hold NUL characters
Due to this string depending on the special NUL character denoting its end, it cannot contain NUL characters itself. A string like
"Hello\0World"
would be interpreted as"Hello"
.
struct string
A more sophisticated string representation is the following struct, frequently
used by C++'s
std::string
implementations:
struct string { char *rep; size_t len; size_t max; };
string.rep
is the internal representation of the string, which is just a
data pointer and not necessarily a NUL terminated C string.
len
holds the
current length and
max
is the number of bytes this string can hold before
it needs a reallocation.
Advantages
- Can hold any binary data
This string does not depend on special characters marking its end, so it can hold such characters without any problem.
- Constant time strlen
The
strlen
equivalent for this string runs in constant time, because the string maintains a variable holding the length.
Disadvantages
- Not standard library compatible
In order to use C standard library functions such as
strcat
orprintf
, this string needs to be converted to a NUL terminated C string. That can be done either by copying the string and omitting NUL characters or by making sure the string is at least NUL terminated and ignoring inner NUL characters. C++'sstd::string
chose to do the latter.
Netstring
Netstrings are strings in the form
[length]":"[string]","
They are stored in octet strings.
Advantages
- Can hold any binary data
Just like the
struct string
above, this string stores its size along with the data. - Almost constant time strlen
Finding the length of a Netstring is as easy as finding the ':' in the string and converting the text before it to an integer.
- Can be sent over network without conversions
This string does not use byte order dependent data, so it can be sent and received verbatim over network.
Disadvantages
- High memory consumption
This string stores the string representation of the length, which may need up to 11 bytes of memory in addition to the ':' and ',', so up to 13 bytes are spent on the length only.
String with size prepended
This is a
char*
object that looks just like any normal C string, except
that it has a hidden size prepended to it. In memory, a "Hello" string would
look like this:
// Big endian ASCII 0x00, 0x05, 0x48, 0x65, 0x6c, 0x6c, 0x6f // Little endian ASCII 0x05, 0x00, 0x48, 0x65, 0x6c, 0x6c, 0x6f
The pointer actually points at the third character, so the length is hidden from normal functions and only the special functions that destructively operate on the string's representation know about its existence.
Advantages
- Low memory consumption
This string type only takes 2 extra octets of memory to be able to represent any string up to a length of 65535 characters.
- Can hold any binary data
Like the Netstring, this string prepends its length to its actual data, which allows the string to hold any string of characters, including NUL.
- Constant time strlen
The length is known and already in host integer representation, so finding the length is a constant time operation.
- Standard library compatible
As long as no NUL characters exist within the string, it can be passed to standard library functions such as strcmp, making many operations on this type of string trivial.
Disadvantages
- Length stored in host byte order
If you want to send this string over a network channel, the size needs to be converted to network byte order, unless the host is big endian. The string is therefore slightly harder to handle.
Note that you can prepend anything to the actual string data, as long as you remember the offset. This technique is commonly used in hashed strings that contain their own hash code, making getting the hash value a constant time operation.
If you do it smartly, you can minimise storage overhead by storing up to 0x7f in the last byte and look at the previous byte if it is greater than 0x7f. If the previous byte is also greater than 0x7f, proceed with the next previous byte, and so on. For further details, see the wikipedia article on BER.