Thursday, October 30, 2008

Exploring the topics least spoken about - 1

There are certain topics in Java that are not discussed in detail in a lot of places. As a new grad student, the chances are quite high that the candidate wouldn't have heard about these topics. So I would like to talk about these topics in the following set of blogs.

The first thing that I am planning to write about is Garbage Collection, shortly called as GC. So what is this GC? To answer this, lets see how the memory management in traditional programming languages like C and C++ works. In C and C++, variables, methods, classes and objects all require some definite amout of memory space. Since memory is a precious resource and is limited, memory that is allocated to an object has to be reclaimed back once the object is destroyed. In case of C and C++, the developer has the freedom to allocate memory to an object and at the same time has the responsibility to reclaim the unused memory back. In this way, memory is made available for the object that is being created. But life is not always beautiful. What happens if the developer doesn't claim the unused memory back. Then it results in memory leak and the memory usage of the application keeps growing because of this memory leak. This is not all. What happens if a developer claims back the memory space allocated to an object while object is still being used. This results in corrupt memory pointers, also known as dangling pointers.

In an effort to relieve the developer of all these nightmares, automated memory management system was incorporated into Java runtime environment. As a part of this system, developer doesn't need to take the responsibility of claiming back the unused space. So how does this system work? Well, the memory allocation to the object happens when "new" keyword is used to create a new object. Java has a special area for allocating memory to the object and this area is called "Heap". But in Java programs there are variables that hold the reference to the objects. So where is the variable allocated? Variables referencing the objects are allocated on the method stack and the variable goes out of scope when the stack frame is removed from the stack or in other words, when the execution exits the method, the variables inside the method scope is destroyed.
So back to the heap and allocation, each object is allocated a space in the Java heap and stays in the heap as long as the object is used and once the object is no longer used by anyone, the memory space allocated for the object is claimed back by the memory management system aka GC. So how do you know if an object is used or not. The simplest way will be something like keeping track of all the references to the object and whenever a reference goes out of scope, the reference count is decreased. So this type of reference counting can be assigned to each of the object and whenever a reference points to this object, the reference count is increased and when a variable is destroyed, the reference count of the object is decreased and when the object 's reference count is 0, this object is no longer used and can be claimed back. Ok, this method looks simple and probably reference count increment/decrement and all those things might add some overhead. But still it looks simple and doable. But unfortunately if we use "Reference counting" garbage collection scheme, Java performance might be bad and of course, the science of garbage collection has grown tremendously over the years and we do have some advanced garbage collection techniques. This reference counting system is not used in any JVM implementation that I know of.
Well, thats brings us to defining the relationship between the variables on the stack and the objects on the heap. Well, variables on the stack reference the object on the heap. So if we analyze all the variables on the heap and reference that each of the variables point to on the heap, we can identify all the objects in the heap that are active. So we can mark all these objects as active objects and we can reclaim the heap space occupied by the objects that are not active i.e., inactive objects. so, this particular scheme is called "Mark and Sweep" algorithm. The phase in which the variable references are traced down and marked as active is called mark phase and the phase in which the space is claimed back is called sweep phase. So what about the disadvantages of this scheme? well, during the mark phase, in order to ensure that active object tracing is done without any discrepancy, the execution of the main application program is stopped and active object tracing or mark algorithm is run. So this is called "Stop the world" Garbage collection and it causes pause in the execution of the application program. This is a typical disadvantage of mark and sweep algorithm. So what is the starting point of this mark or tracing operation. Its the active stack frame on the method stack thats the starting point. Each of the variable on the active stack frame is traversed and so on.
More about other garbage collection schemes in the following posts....

No comments:

Post a Comment