Saturday, May 19, 2007

Casting Pointer Types to Integral Types in Beautiful C/C++

The somewhat intuitive question I should answer before talking about this issue is: why do we need to cast a pointer type to an integral type?
Ok, there is an answer. A good one too. And I am not writing this article just to talk about unthinkable things that C/C++ let us do :-D
The need for casting a pointer type to an integral type is in a hashing function where the pointer is the key. This is the short answer, to make it more clear I will describe what a hashing function and a hash table are. Then I will answer the question that some readers may ask: Again, why do we need to use a pointer as the key to a hash?

The Hash:
A hash table is simply an array that you can access with indexes that are not integers (e.g. strings, floats, user-defined types, etc...)
Lets say we have an addressbook with two attributes; name and email. We want to access the address book by giving it a name, and expecting and email for return value.
One way we could do this is by adding all address book entries to a linked-list. But then every time we want to do a "find" operation on the linked-list to get an email corresponding to a certain name, we may have to traverse the whole list to find our goal. This is time consuming, especially for large linked-lists [complexity O(n)].
Hashtables provide a solution for this problem, they reduce the complexity of the "find", "search", "insert" operations to "ideally" O(1).
The idea is to make a function that will map the key value (the name, in our case) to an integral index for the array. This function is called the hash-function. So now whenever we want to get an email of someone, we could use something like:
email e = myHashTable.get("Ahmad");

or alternatively if we have the [] operator overloaded:
email e = myHashTable["Ahmad"];
Internally the get() function (or the operator[]) would look like this:
email get(char* name) {
int index = hash(name);
return hashStore[index];
}
where hashStore is an array of email, and hash() is the hashing function.
It is worth-mentioning that the STL has a hash table implementation under the name map.

Alright, but why use a pointer as a key to a hash table? If we already have a pointer to an object, then why we need to search for it?
Let's say we have a class Rectangle, we create an array of Rectangles as follows:
Rectangle myRectangles[] = new Rectangle[10];
and then we make a for loop to initialize our array:
for (int i=0; i<10;>
//new Rectangle of random width and height
int height = rand();
int width = rand();
myRectangles[i] = new Rectangle(width, height);
if (width == height){
//if it is a square, display it
myRectangle[i].display();
//hash it, to know what do we have on display
displayedRectsHash.add(&myRectangle[i]);
}//end if
}//end for
//Now that you have displayed all Squares, display all the rectangles
for (i=0; i<10;>
Rectangle * aRect = displayedRectsHash.find(&myRectangle[i]);
if (aRect == NULL) {
//it is not in the hash, i.e. it is not displayed, so we display it
myRectangles[i].display();
//according to need, we could hash this pointer, too, here
}//end if
}//end for
The example above is trivial of course, but it is only meant to explain the idea. Because we could have used the index of the array as the key to the hash. However, sometimes the situation is not as easy as that.

Ok now that I explained when and why one might need to cast a pointer type to an integral type. Let's see how do we do it:
unsigned long hashPointer(void* key) {
return hashLong((unsigned long)key));
}
unsigned long hashLong(unsigned long key) {
unsigned long result;
//...real hashing algorithm here
return result;
}
Note the use of unsigned long variables and casts, not just long. Because if we used long casts, and the pointer held address larger than 2^31, the result of the cast will be a negative value, which of course is undesirable (unless of course if you design the hash function in a way that makes it aware of that fact, which is unusual).
Long type is 32-bit size, according to the C standards. The address space of a modern Intel x86 processor is 32-bit size as well. A unsigned long variable can hold values from 0 to 2^32-1. Large enough to contain any memory address on a x86 processor. On the other hand, signed long variables can contain values from -1*(2^31-1) to 2^31-1. Which is enough for mapping the address space of x86, too. But half of it will be mapped into negative values. And thus the negative result of the cast.

No comments: