Pippijn - Programming / Data-Hiding

Information hiding

This text is about the principle known as information hiding, data hiding or (not entirely correctly) encapsulation. In our examples, we use the C programming language, because it fits the principle best of all. This is a bit ironic, because C of all languages widely used today is, assembly language not counted, said to be the least object oriented language.

Introduction

Client code
is all the code that uses the data types and operations defined on them but does not operate on the internal structures of those types.
Information hiding
is a principle of hiding implementation details from client code. This is achieved by writing accessor functions as required.

A simple example of information hiding would be a string object:

struct string
{
  char *rep;
  size_t len;
  size_t capacity.
};

This struct knows its length and has a data pointer pointing at the in-memory string representation. It also knows its maximum capacity, so it can allocate more memory on demand. In order to avoid corruption of the internal state (in this case, the len member), we write accessor functions, for example:

size_t strLength (struct string const *str);
char const *strToCStr (struct string const *str);
char strCharAt (struct string const *str, size_t pos);
void strSetCharAt (struct string *str, size_t pos, char c);
void strAddChars (struct string *str, char const *chars);

These functions operate on the data structure declared above. Functions that take a pointer-to-const promise not to modify the structure.

Information hiding means to hide as much knowledge as possible. As part of this, we could define new type aliases to hide the fact that a string is represented by a pointer to a string structure:

typedef struct string *String;
typedef struct string const *ConstString; // Immutable string

By doing this, we allow the representation to change at any time. For example, we could choose to represent String objects by a simple data pointer not knowing its size or a data pointer prepended by its size.

See <a href="c/strings">C string representations</a> if you are interested in other string representations.

Now, our accessor functions look like this:

size_t strLength (ConstString str);
char const *strToCStr (ConstString str);
char strCharAt (ConstString str, size_t pos);
void strSetCharAt (String str, size_t pos, char c);
void strAddChars (String str, char const *chars);

We did not only abstract away some knowledge, we also reduced our required typing.

Violations of the principle

There are various more or less obvious ways to violate the principle of information hiding. An obvious example would be directly accessing the structure's members:

// Wrong:
void
OnAdded (String str)
{
  strncat (str->rep, " has been added", str->capacity - str->len);
  str->len = strlen (str->rep);
}
// Right:
void
OnAdded (String str)
{
  strAddChars (str, " has been added");
}

The function strAddChars hides the knowledge about String's internal structure and abstracts away the logic needed to add characters to the string. The resulting code is more descriptive and less error prone. Instead of kludging our own string concatenation operation every time, we write a single testable function that does it for us. Our own version in "Wrong" truncates the string, if it does not have enough room for the added text.

A less obvious violation of information hiding would be pointer operations on String objects. Consider our String to be defined as char*. This type allows several operations that are built into C:

typedef char *String;
typedef char const *ConstString;

void
someFunc (String str)
{
  printf ("The second character is `%d'\n", str[1]); // Index
  str += 20; // Addition
  printf ("The twenty-second character is `%d'\n", str[1]); // Index
  printf ("The twenty-first character is `%d'\n", *str); // Dereference
  String newStr = str - 10; // Subtracting integer
  printf ("The difference between newStr and str is %d\n", newStr - str); // Pointer difference
}

All these operations are defined on char*, and therefore also on String. This does not have to be the case, though. If String was defined as the struct string* earlier, all of these operations would cause a compilation error. Using these operations requires knowledge about the internal representation of String and is therefore in violation of the information hiding principle.

Enforcing encapsulation in C

C provides a very good way to enforce information hiding. We call this opaque pointers. Opaque in this sense means, we cannot touch or even see the internal structure. C allows us to forward declare structures and pass around pointers to them:

struct string;
typedef struct string *String;
typedef struct string const *ConstString;
// All of the above accessor functions still work
// The OnAdded function marked with "Right" also still works, but the one
//   marked "Wrong" will cause a compilation error

String is now what we call a pointer to an incomplete type. Incomplete means that it does exist as a type, but its representation is not known to the compiler. Using opaque pointers is a good way of preventing direct member access in client code.

Advantages

There are definite advantages of information hiding and those are good reasons to apply it to your code.

Less changes in client code
If we change a name in a structure, for example if we wanted to rename our len to length, because we think it is more descriptive, and client code depended on the variable being called len, we will have to modify all that code. Automatic refactoring tools help, but none of them are perfect. If no client code depended on this fact, there is a lot less to modify with such a change. We inofficially call renamings like this API breakage.
Faster recompilation and incremental builds
If you use a compiler cache such as ccache or if you use incremental build tools such as make, you will have noticable speedups when using opaque pointers.
Better binary compatibility
If you decide to add another member to the string structure containing information about the number of reallocations, because you want to find the optimum initial string size for a certain application, you can do so, but if client code directly accessed the size and offset information of the structure, all such code needs to be recompiled. This is inofficially called ABI breakage.

Disadvantages

This all sounds great and we may ask ourselves, why doesn't everybody encapsulate everything. There are also disadvantages to be considered when designing software.

Speed
This is probably the most considered and least important disadvantage. I say most considered, because people often tend to ignore good practice and prematurely optimise their code by not encapsulating data properly. Consider the code required to access a member of a structure directly and the code required to call a function doing the same thing. A member access is a full word addition and a dereference on x86 platforms. A function call is at least a full word decrement, four unconditional moves, two stack pushes, a dereference, an addition, an increment, two stack pops, an unconditional jump and a call return. A lot more to do for our poor little Pentium with 3 GHz.

But our poor little Pentium with 3 GHz will do all that within 70 clock cycles, which is about 10 cycles more than a nop instruction. This is why I say, it is the least important disadvantage. If you are writing speed critical, embedded software or real-time applications, you will ignore this, but then you will also ignore any other principles known to be good practice.
Code bloat
The above argument indirectly contains the code bloat argument. All those instructions needed to call a function returning or setting the member's value exist in object code after compilation. If you have a large structure with accessors for each member, you will substantially increase code size. Modern computers will gladly handle large binaries, so this is an argument for embedded developers only.
Dynamic allocation
In the case of opaque pointers, all objects need to be allocated on the free store (using malloc), There are situations in which this is not viable, for example in embedded systems such as mp3 players, In certain environments, the use of malloc is either prohibited due to its nondeterministic nature with regards to operation time or simply not implemented. In this case, the structure needs to be visible in all locations where it is allocated, so a macro such as the following might help:
```
#define MEW_LOCAL(TYPE, NAME) TYPE NAME##_; TYPE *NAME = NAME##_
```
Or one could just declare the variable locally and pass its address to its operations,

In addition to that, one could name all members " membername_" (trailing underscore) in order to be able to easily spot violations of the principle, visually. It also makes the author of client code think twice.

The compromise

In C, there are a few ways to speed up and shrink the code without violating information hiding. One of these is the use of macros. Instead of defining a function:

size_t strLength (ConstString str);

we define a macro doing the same:

#define strLength(s)  (s)->len

This has disadvantages, as well. Now, you cannot use opaque pointers anymore, no longer avoiding ABI breakages and speeding up incremental builds. It is no longer type-safe, as you could pass any structure with a len member to this macro. It still avoids API breakages, though. This way, you can have some form of encapsulation even in speed critical applications.

Information hiding in other languages

I chose C, as I said, because it is most suitable as example for encapsulation. This does not mean that other languages do not have equally good or even better ways to encapsulate data.

C++
has the pimpl idiom. This idiom is basically the same as using opaque pointers in C, except that the accessor functions are wrapped in a class, making the code look more object oriented.
Java
tends to make all variables private and provide getters and setters for each of them. C has no notion of private, protected or public. These modifiers make the code look even more object oriented. Due to the fact that Java has no references in the C++ sense, accessors cannot be made as natural as in C# or C++.
C#
has properties, which syntactically behave exactly like public member variables, but in fact execute small (or large) portions of code to validate input or invoke events on modification.
Perl
Perl's objects are references blessed into a package. This means that all methods called on the object are looked up in that package. Most common references are hash references, allowing named data members. $ob->{member} is a way to access this data, but it is discouraged. Instead, methods are used: $ob->method.

Pippijn van Steenhoven

Menu