Saturday, November 08, 2008

git - a stupid content tracker

To start with Source Code Configuration Management is basically tracking your source code with changes made to it. A consistent change will be recorded and can be kept track as a history. In this way there is no single copy of the file. Its maintained as change of change. This is how source code is maintained in any project in both open source and closed source.
"Maintaining code is equally important to writing good code"
Some of the source version control systems that I can think of are CVS, SVN, Perforce and some other commercial ones. I had worked with CVS and SVN extensively. :) In fact was introduced to SVN first and then asked to move to CVS. Then I realised such a pain CVS was. Why do we need another version control now? Are there any major flaw with CVS or SVN? I wouldn't say "Yes". But few things can be improved. Like....
  • SVN or CVS have one single source repository! In simple, all developers in the code base commit their changes to the server So its a single point failure. What if this server goes down, How will you make a checkin or update your code base?
  • Not so easy branching and def not so easy merging. I was asked to create a new branch in my CVS repository (called tag) for a checkin of code of a future release feature. Its such a painful thing.
  • Share code, Say your peer was also working on the same feature, you needed his code to merge and do a integration testing to see if things are fine before your checkin. But today this is possible by asking him to checkin first and then you update from repository.
  • My source code is so big that maintain one or two copies drinks my hard disk space. Since they are two physical copies of the same file.
  • Oh! you work on multiple projects one on CVS and other an SVN. Also finding it tough to remember those commands for both.
  • Yeah! eclipse plugin on Linux for CVS sucks. It had never told me a diff correctly or updated/merged code at ease.
  • Thinking of more... (welcome and contribute...)
Git for the rescue! Git is a distributed source control management (SCM) software. So whats "distributed" in terms of SCM. The source tree is replicated at numerous number of places than a single server. This means each replication is a repository. Any one can do a checkin any where and checkout anywhere. :) Boy! Isn't that messy. I would say "No" its more flexible now. So finally who has the master copy of the project. May be the master guy would have it. (take Linux Kernel Development)Linus would be having the master source copy. Gregkh is kernel maintainer under Linus, he will in turn have his own copy. Say I'm a kernel developer, I need to send a patch for the Kernel source code for a bug. I will do a checkin and if will ask Gregkh to pull from my repository and see if things are fine in his branch. If things are fine, Linus will in turn pull it from Kregkh or Gregkh can collect all commits for a week and ask Linus to pull on Friday evening so that Linus would merge it over the weekend. This forms a network of trust. If there was an issue with my fix Gregkh will throw back the code to me saying the fix was bad. If tomorrow Linus repository has a bad network that I cannot update my code, then I can pull the latest changes from Gregkh too.

Branching and merging is one of the most impressive feature of git. In git a branch is created with 0 bytes. Yes zero bytes (techinically 40 bytes for internal house keeping). So then based on the changes to the brach the diff of source that it changes from master (main branch) is alone stored. But in case of CVS or SVN its a separate physical copy of the file in the disk. Say your project has 6000 files. You had created a new brach for the new feature they are adding. How many files will be edited, I could guess as some 100. So why maintain a copy of other unchanged files in this branch. git has an intelligent merge algorithm that does merge is seconds. Linus claims it be one of the fastest. Auto merge is beautiful and helps you to easily resolve conflict.

The source code checkin is saved diff of previous and a SHA1 hash is created for each commit made to the source code. This saves a lot of disk space and git uses compression techniques to make sure the size of source code is less despite it bloating on the CVS or SVN repository. Say you have the source code in your laptop and you travel daily 2 hours commuting to your office, you can sit in the car and commit to the local branches within your system and then push all changes together to the main repository once you reach office. ;)

git also comes with a nice gitk gui that shows the tree structure of the branches and whats the commit that's done. Its also in Handy to do a search of commit comments and look at the diff.

Oh Wait, I'm already using SVN at my office. I cannot ask my CTO to adopt git for upcoming projects. Hold your peace, git can work over CVS and SVN like a wrapper. You do all the changes for your source code in your desktop and maintain it using git. Everything happens using git and your back end in CVS or SVN will get updated as normal commits. :)

Google Tech Talk by Linus Trovalds. I should admit I got inspired from this.
#git on freenode
Git Community Book - really comprehensive.


  1. >>> Gregkh can collect all commits for a week and send it over to Linus on Friday evening so that Linuz would merge it over the weekend.

    No one needs to send anything. You can just pull changes to a .git repo from another git repo.

    >>> In git a branch is created with 0 bytes. Yes zero bytes.

    I guess a branch takes 40 bytes.

    gitcasts is also nice resource page

    Good write-up.

  2. Thanks Sankar! Rephrased them.

  3. Should the content be strictly stupid to be managed with git? ;) nice write up. Also using it as a versioning file-system is a nice use-case!