Thursday, October 30, 2008

Exploring the topics least spoken about - 2

So when I say in the sweep phase, the memory space is reclaimed there are more than one way of reclaiming the memory space. You can mark the memory as free space and can use that to allocate new objects. This is what typically happens in Mark and Sweep algorithm. There is another variant of Mark and Sweep algorithm and it is called Mark and Compact algorithm. Instead of just marking the space as free space, during the compact phase, all the active objects are moved together and all the free space is grouped together. So what is the advantage of doing this? Well, obvious advantage is improved performance and improved object allocation time. So if all the free space is grouped together, then memory space for objects can be allocated using contiguous blocks of memory units and therefore, leads to better object allocation time as there is no fragmentation.

Ok, so is this the best GC Algorithm available. The answer is no. There are some other variants that use the concept of generational garbage collection. The common observation in object lifetimes is typically an application creates a huge number of short lived objects of small size. Therefore, only a few of the objects live longer and only a few objects are bigger. So how can we take advantage of this observation? Well, the whole heap space is divided into three generations - Young generation, Old generation and Permanent generation. Young generation as the name suggests holds the newly created objects. When the object stays in young generation heap for a long time, then they are moved/tenured to old generation heap space. So what is this permanent generation. Permanent generation holds the objects that are not meant to be Garbage collected very often. Say for example, each compiled class has a corresponding class object representing the class in the Java runtime environment. These objects are not meant to be garbage collected often. So they are present in the permanent generation.

Young generation heap space is further divided into eden space and survivor space. Eden space is the first point of entry into the heap space for any new object. If the object is active during the garbage collection, then they are moved to the survivor space. When the object stays in survivor space for a certain number of garbage collection cycles, then they are tenured to the old generation. If the object is big enough to be accomodated in the eden space or survivor space, then they are directly allocated on the old generation heap space. Typically, young generation heap space is smaller but often garbage collected. In case of old generation heap space, it is allocated larger space but not often garbage collected.

Exploring the topics least spoken about - 1

There are certain topics in Java that are not discussed in detail in a lot of places. As a new grad student, the chances are quite high that the candidate wouldn't have heard about these topics. So I would like to talk about these topics in the following set of blogs.

The first thing that I am planning to write about is Garbage Collection, shortly called as GC. So what is this GC? To answer this, lets see how the memory management in traditional programming languages like C and C++ works. In C and C++, variables, methods, classes and objects all require some definite amout of memory space. Since memory is a precious resource and is limited, memory that is allocated to an object has to be reclaimed back once the object is destroyed. In case of C and C++, the developer has the freedom to allocate memory to an object and at the same time has the responsibility to reclaim the unused memory back. In this way, memory is made available for the object that is being created. But life is not always beautiful. What happens if the developer doesn't claim the unused memory back. Then it results in memory leak and the memory usage of the application keeps growing because of this memory leak. This is not all. What happens if a developer claims back the memory space allocated to an object while object is still being used. This results in corrupt memory pointers, also known as dangling pointers.

In an effort to relieve the developer of all these nightmares, automated memory management system was incorporated into Java runtime environment. As a part of this system, developer doesn't need to take the responsibility of claiming back the unused space. So how does this system work? Well, the memory allocation to the object happens when "new" keyword is used to create a new object. Java has a special area for allocating memory to the object and this area is called "Heap". But in Java programs there are variables that hold the reference to the objects. So where is the variable allocated? Variables referencing the objects are allocated on the method stack and the variable goes out of scope when the stack frame is removed from the stack or in other words, when the execution exits the method, the variables inside the method scope is destroyed.
So back to the heap and allocation, each object is allocated a space in the Java heap and stays in the heap as long as the object is used and once the object is no longer used by anyone, the memory space allocated for the object is claimed back by the memory management system aka GC. So how do you know if an object is used or not. The simplest way will be something like keeping track of all the references to the object and whenever a reference goes out of scope, the reference count is decreased. So this type of reference counting can be assigned to each of the object and whenever a reference points to this object, the reference count is increased and when a variable is destroyed, the reference count of the object is decreased and when the object 's reference count is 0, this object is no longer used and can be claimed back. Ok, this method looks simple and probably reference count increment/decrement and all those things might add some overhead. But still it looks simple and doable. But unfortunately if we use "Reference counting" garbage collection scheme, Java performance might be bad and of course, the science of garbage collection has grown tremendously over the years and we do have some advanced garbage collection techniques. This reference counting system is not used in any JVM implementation that I know of.
Well, thats brings us to defining the relationship between the variables on the stack and the objects on the heap. Well, variables on the stack reference the object on the heap. So if we analyze all the variables on the heap and reference that each of the variables point to on the heap, we can identify all the objects in the heap that are active. So we can mark all these objects as active objects and we can reclaim the heap space occupied by the objects that are not active i.e., inactive objects. so, this particular scheme is called "Mark and Sweep" algorithm. The phase in which the variable references are traced down and marked as active is called mark phase and the phase in which the space is claimed back is called sweep phase. So what about the disadvantages of this scheme? well, during the mark phase, in order to ensure that active object tracing is done without any discrepancy, the execution of the main application program is stopped and active object tracing or mark algorithm is run. So this is called "Stop the world" Garbage collection and it causes pause in the execution of the application program. This is a typical disadvantage of mark and sweep algorithm. So what is the starting point of this mark or tracing operation. Its the active stack frame on the method stack thats the starting point. Each of the variable on the active stack frame is traversed and so on.
More about other garbage collection schemes in the following posts....