Saturday, November 08, 2008

git - a stupid content tracker

To start with Source Code Configuration Management is basically tracking your source code with changes made to it. A consistent change will be recorded and can be kept track as a history. In this way there is no single copy of the file. Its maintained as change of change. This is how source code is maintained in any project in both open source and closed source.
"Maintaining code is equally important to writing good code"
Some of the source version control systems that I can think of are CVS, SVN, Perforce and some other commercial ones. I had worked with CVS and SVN extensively. :) In fact was introduced to SVN first and then asked to move to CVS. Then I realised such a pain CVS was. Why do we need another version control now? Are there any major flaw with CVS or SVN? I wouldn't say "Yes". But few things can be improved. Like....
  • SVN or CVS have one single source repository! In simple, all developers in the code base commit their changes to the server 34.56.7.67 So its a single point failure. What if this server goes down, How will you make a checkin or update your code base?
  • Not so easy branching and def not so easy merging. I was asked to create a new branch in my CVS repository (called tag) for a checkin of code of a future release feature. Its such a painful thing.
  • Share code, Say your peer was also working on the same feature, you needed his code to merge and do a integration testing to see if things are fine before your checkin. But today this is possible by asking him to checkin first and then you update from repository.
  • My source code is so big that maintain one or two copies drinks my hard disk space. Since they are two physical copies of the same file.
  • Oh! you work on multiple projects one on CVS and other an SVN. Also finding it tough to remember those commands for both.
  • Yeah! eclipse plugin on Linux for CVS sucks. It had never told me a diff correctly or updated/merged code at ease.
  • Thinking of more... (welcome and contribute...)
Git for the rescue! Git is a distributed source control management (SCM) software. So whats "distributed" in terms of SCM. The source tree is replicated at numerous number of places than a single server. This means each replication is a repository. Any one can do a checkin any where and checkout anywhere. :) Boy! Isn't that messy. I would say "No" its more flexible now. So finally who has the master copy of the project. May be the master guy would have it. (take Linux Kernel Development)Linus would be having the master source copy. Gregkh is kernel maintainer under Linus, he will in turn have his own copy. Say I'm a kernel developer, I need to send a patch for the Kernel source code for a bug. I will do a checkin and if will ask Gregkh to pull from my repository and see if things are fine in his branch. If things are fine, Linus will in turn pull it from Kregkh or Gregkh can collect all commits for a week and ask Linus to pull on Friday evening so that Linus would merge it over the weekend. This forms a network of trust. If there was an issue with my fix Gregkh will throw back the code to me saying the fix was bad. If tomorrow Linus repository has a bad network that I cannot update my code, then I can pull the latest changes from Gregkh too.

Branching and merging is one of the most impressive feature of git. In git a branch is created with 0 bytes. Yes zero bytes (techinically 40 bytes for internal house keeping). So then based on the changes to the brach the diff of source that it changes from master (main branch) is alone stored. But in case of CVS or SVN its a separate physical copy of the file in the disk. Say your project has 6000 files. You had created a new brach for the new feature they are adding. How many files will be edited, I could guess as some 100. So why maintain a copy of other unchanged files in this branch. git has an intelligent merge algorithm that does merge is seconds. Linus claims it be one of the fastest. Auto merge is beautiful and helps you to easily resolve conflict.

The source code checkin is saved diff of previous and a SHA1 hash is created for each commit made to the source code. This saves a lot of disk space and git uses compression techniques to make sure the size of source code is less despite it bloating on the CVS or SVN repository. Say you have the source code in your laptop and you travel daily 2 hours commuting to your office, you can sit in the car and commit to the local branches within your system and then push all changes together to the main repository once you reach office. ;)

git also comes with a nice gitk gui that shows the tree structure of the branches and whats the commit that's done. Its also in Handy to do a search of commit comments and look at the diff.

Oh Wait, I'm already using SVN at my office. I cannot ask my CTO to adopt git for upcoming projects. Hold your peace, git can work over CVS and SVN like a wrapper. You do all the changes for your source code in your desktop and maintain it using git. Everything happens using git and your back end in CVS or SVN will get updated as normal commits. :)

References:
Google Tech Talk by Linus Trovalds. I should admit I got inspired from this.
#git on freenode
Git Community Book - really comprehensive.

Thursday, October 30, 2008

Exploring the topics least spoken about - 2

So when I say in the sweep phase, the memory space is reclaimed there are more than one way of reclaiming the memory space. You can mark the memory as free space and can use that to allocate new objects. This is what typically happens in Mark and Sweep algorithm. There is another variant of Mark and Sweep algorithm and it is called Mark and Compact algorithm. Instead of just marking the space as free space, during the compact phase, all the active objects are moved together and all the free space is grouped together. So what is the advantage of doing this? Well, obvious advantage is improved performance and improved object allocation time. So if all the free space is grouped together, then memory space for objects can be allocated using contiguous blocks of memory units and therefore, leads to better object allocation time as there is no fragmentation.

Ok, so is this the best GC Algorithm available. The answer is no. There are some other variants that use the concept of generational garbage collection. The common observation in object lifetimes is typically an application creates a huge number of short lived objects of small size. Therefore, only a few of the objects live longer and only a few objects are bigger. So how can we take advantage of this observation? Well, the whole heap space is divided into three generations - Young generation, Old generation and Permanent generation. Young generation as the name suggests holds the newly created objects. When the object stays in young generation heap for a long time, then they are moved/tenured to old generation heap space. So what is this permanent generation. Permanent generation holds the objects that are not meant to be Garbage collected very often. Say for example, each compiled class has a corresponding class object representing the class in the Java runtime environment. These objects are not meant to be garbage collected often. So they are present in the permanent generation.

Young generation heap space is further divided into eden space and survivor space. Eden space is the first point of entry into the heap space for any new object. If the object is active during the garbage collection, then they are moved to the survivor space. When the object stays in survivor space for a certain number of garbage collection cycles, then they are tenured to the old generation. If the object is big enough to be accomodated in the eden space or survivor space, then they are directly allocated on the old generation heap space. Typically, young generation heap space is smaller but often garbage collected. In case of old generation heap space, it is allocated larger space but not often garbage collected.

Exploring the topics least spoken about - 1

There are certain topics in Java that are not discussed in detail in a lot of places. As a new grad student, the chances are quite high that the candidate wouldn't have heard about these topics. So I would like to talk about these topics in the following set of blogs.

The first thing that I am planning to write about is Garbage Collection, shortly called as GC. So what is this GC? To answer this, lets see how the memory management in traditional programming languages like C and C++ works. In C and C++, variables, methods, classes and objects all require some definite amout of memory space. Since memory is a precious resource and is limited, memory that is allocated to an object has to be reclaimed back once the object is destroyed. In case of C and C++, the developer has the freedom to allocate memory to an object and at the same time has the responsibility to reclaim the unused memory back. In this way, memory is made available for the object that is being created. But life is not always beautiful. What happens if the developer doesn't claim the unused memory back. Then it results in memory leak and the memory usage of the application keeps growing because of this memory leak. This is not all. What happens if a developer claims back the memory space allocated to an object while object is still being used. This results in corrupt memory pointers, also known as dangling pointers.

In an effort to relieve the developer of all these nightmares, automated memory management system was incorporated into Java runtime environment. As a part of this system, developer doesn't need to take the responsibility of claiming back the unused space. So how does this system work? Well, the memory allocation to the object happens when "new" keyword is used to create a new object. Java has a special area for allocating memory to the object and this area is called "Heap". But in Java programs there are variables that hold the reference to the objects. So where is the variable allocated? Variables referencing the objects are allocated on the method stack and the variable goes out of scope when the stack frame is removed from the stack or in other words, when the execution exits the method, the variables inside the method scope is destroyed.
So back to the heap and allocation, each object is allocated a space in the Java heap and stays in the heap as long as the object is used and once the object is no longer used by anyone, the memory space allocated for the object is claimed back by the memory management system aka GC. So how do you know if an object is used or not. The simplest way will be something like keeping track of all the references to the object and whenever a reference goes out of scope, the reference count is decreased. So this type of reference counting can be assigned to each of the object and whenever a reference points to this object, the reference count is increased and when a variable is destroyed, the reference count of the object is decreased and when the object 's reference count is 0, this object is no longer used and can be claimed back. Ok, this method looks simple and probably reference count increment/decrement and all those things might add some overhead. But still it looks simple and doable. But unfortunately if we use "Reference counting" garbage collection scheme, Java performance might be bad and of course, the science of garbage collection has grown tremendously over the years and we do have some advanced garbage collection techniques. This reference counting system is not used in any JVM implementation that I know of.
Well, thats brings us to defining the relationship between the variables on the stack and the objects on the heap. Well, variables on the stack reference the object on the heap. So if we analyze all the variables on the heap and reference that each of the variables point to on the heap, we can identify all the objects in the heap that are active. So we can mark all these objects as active objects and we can reclaim the heap space occupied by the objects that are not active i.e., inactive objects. so, this particular scheme is called "Mark and Sweep" algorithm. The phase in which the variable references are traced down and marked as active is called mark phase and the phase in which the space is claimed back is called sweep phase. So what about the disadvantages of this scheme? well, during the mark phase, in order to ensure that active object tracing is done without any discrepancy, the execution of the main application program is stopped and active object tracing or mark algorithm is run. So this is called "Stop the world" Garbage collection and it causes pause in the execution of the application program. This is a typical disadvantage of mark and sweep algorithm. So what is the starting point of this mark or tracing operation. Its the active stack frame on the method stack thats the starting point. Each of the variable on the active stack frame is traversed and so on.
More about other garbage collection schemes in the following posts....

Monday, August 04, 2008

How to secure wireless network

I'm mainly writing down the point to secure a "HOME" wireless network.

1. Change SSID
Many -- almost all -- wireless router comes with default SSID (service set Identifier) - kind of unique in itself. Something like linksys router uses "linksys" as default SSID. So if you are using a default SSID that gives more information about your wireless network. Its something like broadcasting "Hey I am a wireless router and you can connect to me. And yeah by the way, I am linksys router". Though that may not be much, but it give information to a potentiality hacker to know which router you are using and may be narrow down to one which he should try to hack or use -- may be get on net and know what are possible attack on that and try them.

Better set SSID to some random junk characters. Copy the SSID and save for later use.

This may not be much security, but its good not to offer low hanging fruits -- encouraging someone to actually get them.

2. Hide SSID
So the other people can't see it. Hide SSID by disabling SSID broadcast. So they will not see your network and not try to connect to them.

3. Change router's password
Again most router are shipped with default password. For example linksys router has the default password as admin. And its very easy to get default password for these boxes. Just google for "default password" and find the vendor and model. So if you have default password, someone can logoin to your router and change configuration, may be stealing all the bandwidth or do much more than that.

4. Disable remote administration
By disabling remote administration, actually you are saying "No can can change my router without connecting to wireless router through a wire." That's pretty cool. So no one out there can change your wireless router stuffs. That's a very much important.

5. Carefully upgrade firmware
Upgrade firmware if you really need to. Check the change log and find out if there is really some critical upgrade that you should install in. Secondly never upgrade the firmware on wireless, always connect physically to router and upgrade firmware.

6. Enable encryption
Most important point and a must. Enable encryption on your router so that no one can actually sniff the packet and know what you are doing or be man in middle. And it would be good to use a intelligent password - like mix of upper case and lower case, alpha numeric, numbers and large string so that its practically impossible -- or at least hard enough - to break it. May be use some kind of random key generator to do that. (and save the password somewhere safe and secure)


----
That will do a lot to make your network secure. There are even other stuff like "MAC Address filtering" but that may not be of much use cause mac address is actually sent as part of IP packet. so ...... anyway do that if you wish so.


Wednesday, July 23, 2008

Cannot detect USB drives on Windows?

Recently in my home PC (windows XP) none of the USB drives started detecting. External hard drives, Ipod (gets detected via iTunes) but not through explorer or even an USB drive. Googling around landed me with this Microsoft support link that helped me resolve the exact issues. Also listed down possible places where they can fail and work around for them. Also trick I learned is to connect the USB drive directly to the USB ports in the mother board than through an USB extension cable.

Hope this will help, next time around when you face this.

Another known trick in the book for reducing the startup time is configure your options in 'Startup' tab of 'msconfig'. Start->run->msconfig.

You know more of this sort, send them in.

Saturday, February 16, 2008

S#%t! I broke the build

A usual scenario with big projects where code is written an maintained by a big crowd of geeks & nerds is tend to break the system once a while. They are usually termed as 'Build breaks'. It could be because of compliation, it could be because of linking. Common reasons could be, it compiled well on my system well but broke when comitted. Finally more than the hours that we put on development of the feature, we need to put to fix these build breaks.

Could this be avoided, NO. But could be identified and rectified soon. Here is a tool from Apache, that could help you in the activity. 'Apache Continuum' is a one such tool that helps in triggerin automated builds for your project. They provide with a simple configuration portal that is easy to deploy and start your timely builds. Try it!

Some features that has impressed me are,
  • Support for various source repository.
  • They work with Maven projects and could launch a cmd job.
  • Notifications from mail to IRC to Chat clients seems to be comprehensive.
  • Keeping track of the check-ins and some intellegence in triggering a build.