Git vs. Subversion – Which to Use for Your Next Project

I recently did some research to support using Git or Subversion for a new project, and decided to include that in my blog (with permission).  While I don’t formally give attributions, any items in quotes came from other resources on the web.


Git Advantages

  1. Just as Subversion is the next evolution of open source code control from CVS, Git is the next open source code control evolutionary step
  2. Git offers distributed and federated branching as opposed to Subversion’s limitation of a single server with multiple clients.  A Git client can check out from a remote Git repo.  The user can make changes in units of work, and then commit those changes locally.  They can repeat the cycle of local units of work locally committed.  When ready, they can then decide to push their changes back to the remote repo, where everyone else can later pull them.  This allows the local developer the safety of make small changes as a unit and committing, as part of a much larger change, and then submitting that larger change as a single unit when ready (which at the same time constantly be pulling the latest changes from the remote depot)
  3. In addition, other users can be set up to check out from the user’s local depot, resulting in a federated model.  There is no strong convention as to which depot is the official master, except by convention and agreement.  Git uses a peer to peer model, which Subversion is client server.  This becomes even more important should the official repo be lost for some reason
  4. Because it is a distributed model, the workflow is established by the developer, not by the centralized repository owner.  Git does not depend on a centralized server, but does have the ability to syncronize with other Git repositories – to push and pull changes between them. This means that you can add multiple remote repositories to your project, some read-only and some possibly with write access as well, meaning you can have nearly any type of workflow you can think of”
  5. Due to being distributed, you inherently do not have to give commit access to other people in order for them to use the versioning features. Instead, you decide when to merge what from whom.  That is, because subversion controls access, in order for daily checkins to be allowed – for example – the user requires commit access. In git, users are able to have version control of their own work while the source is controlled by the repo owner.”
  6. Creating a new branch in Git in much quicker (i.e., 5 seconds), easier, and less centralized than in Subversion.  A developer can make that decision locally without having to consult or impact anyone else.  Until all parties agree, that branch can remain invisible to the rest of the team.   This allows more experimentation, parallel development, and rollback of failed prototypes or Scrum spiking with no impact to the rest of the team.  New branches require 41Kb, and deleting a branch means just deleting a single file (though there are commands to do it).  “Creating a repository is a trivial operation: mkdir foo; cd foo; git init That’s it”
  7. Branches and labels are not just copies that can be altered, but are true first class citizens in Git.  Audit trail and real TAGS, as opposed to using BRANCHES to simulate tags.  In GIT at each point in history, a SHA key is generated that identifies the stated of the code. It is easy to track the history if someone tries to tamper with the code or mistakenly deploys the wrong code into production environments. Git has a very strong audit trail
  8. Integrating branches and merging is far easier and less conflict ridden (with less chances of accidents or problems) in Git.  Git has very strong merge algorithms.  Developers can do full merges locally before having to push the merge back into the main branch
  9. “Branch merging is simpler and more automatic in Git. In Subversion you need to remember what was the last revision you merged from so you can generate the correct merge command. Git does this automatically, and always does it right. Which means there’s less chance of making a mistake when merging two branches together”
  1. “Branch merges are recorded as part of the proper history of the repository. If I merge two branches together, or if I merge a branch back into the trunk it came from, that merge operation is recorded as part of the repostory history as having been performed by me, and when. It’s hard to dispute who performed the merge when it’s right there in the log”
  2. “If you have partial merges for a work in progress, you will take advantage of the Git staging area (index) to commit only what you need [break it up and check in what you want now], stash the rest, and move on on another branch.”  By stash, he means if you are working on a project, and a bug comes from production, you can stash your current work as a built-in function of Git, seamlessly switch to the production branch, make the code change, check it in, and then unstash your work and continue working just as you were before
  3. When you check out with Git, you get the entirety of the repo, not just the one branch.  You get the full history, branches, merges, versions, and everything in your local version.  This is what allows you to fully work remotely without having to have a network connection.  In addition, each new branch created carries forward the pre-branch history
  4. It is faster.  Since all operations (except for push and fetch) are local there is no network latency involved to a) perform a diff, b) view file history, c) commit changes, d) merge branches, e) obtain any other revision of a file (not just the prior committed revision), or f) switch branches”
  5. Git stores its information in a more compressed manner than Subversion, which reduces the size effects of the previously noted advantage.  “Git’s file format is very good at compressing data, despite it’s a very simple format. The Mozilla project’s CVS repository is about 3 GB; it’s about 12 GB in Subversion’s fsfs format. In Git it’s around 300 MB”
  6. “The repository’s internal file formats are incredible simple. This means repair is very easy to do, but even better because it’s so simple its very hard to get corrupted. I don’t think anyone has ever had a Git repository get corrupted. I’ve seen Subversion with fsfs corrupt itself. And I’ve seen Berkley DB corrupt itself too many times to trust my code to the bdb backend of Subversion
  7. Git does not require little .svn folders in each of the subdirectories as SVN does, which can cause minor problems sometimes.  All the git information is stored in a .git folder at the top level of the depot.  In SVN, I’ve dealt with developers from novice to experts, and the novices and intermediates seem to introduce File conflicts if they copy one folder from another SVN project in order to re-use it. Whereas, I think in Git, you just copy the folder and it works, because Git doesn’t introduce .git folders in all its subfolders (as SVN does).
  8. SVN is the third implementation of a revision controlRCS, then CVS and finally SVN manage directories of versioned data. SVN offers VCS features (labeling and merging), but its tag is just a directory copy (like a branch, except you are not “supposed” to touch anything in a tag directory), and its merge is still complicated, currently based on meta-data added to remember what has already been merged.  Git is a file content management (a tool made to merge files), evolved into a true Version Control System, based on a DAG (Directed Acyclic Graph) of commits, where branches are part of the history of datas (and not a data itself), and where tags are a true meta-data.”  In other words, having started as a tool to merge files into a true VCS is what makes Git so much more powerful than Subversion
  9. You have to go with a DVCS, it is like a quantum leap in source management


Subversion Advantages

 The following are the advantages of Subversion:

  1. You can check out part of a branch instead of the entire thing having to be checked out
  2. Subversion is stronger in storing and managing very large binary files.  SVN is the only VCS (distributed or not) that doesn’t choke on my TrueCrypt files (please correct me if there’s another VCS that handles 500MB+ files effectively). This is because diff comparisons are streamed (this is a very essential point). Rsync is unacceptable because it’s not 2-way.”
  3. There were earlier problems with using Git on Windows back in 2008 due to lack of support, but that has been addressed at this point
  4. If your development is linear and simpler (without requiring branches and parallel work), you should stick with Subversion
  5. Because Subversion has been around longer, it may have better tool support.  This was more a problem around 5 years ago, but Git has mainstream tool adoption at this point
  6. Most people already know how to use Subversion instead of Git.  To use Git, some internal training (which I can do) will be involved to not only use Git, but to use Git as it was intended (and not to use as if one were using Subversion)
  7. Walking through versions is simpler in Subversion because it uses sequential revision numbers (1,2,3,..); Git uses unpredictable SHA-1 hashes. Walking backwards in Git is easy using the “^” syntax, but there is no easy way to walk forward”

Information Architecture – Part 2

I read about half of the “Information Architecture for the World Wide Web”, and then stopped at that.   Not because it is not a good book; I just don’t plan to become a professional Information Architect.  If I need to go deeper at some point, I’ll read the rest.  All in all, I definitely recommend the book.

Nonetheless, it is really interesting to think about information architecture, organization, structure and search as abstract concept independent of an actual application, as well as applying to the real world of grocery stores, department stores, libraries, etc.

A couple of things really stuck out for me in addition to what I learned in “Don’t Make Me Think”:

There are two main ways to organize based upon the needs of your users:

  • Exact organizational scheme. When the user knows exactly what they are looking for (e.g., white pages)
  • Ambiguous organizational scheme. When users don’t know what they are looking for.  Some types include organizing by task or topic, among other things

Hierarchical organizational structures can be tricky as many things do not neatly fit into a strict taxonomy.  A taxonomy that allows cross-listing is referred to as polyhierarchical.  However, if too much of this takes place, the value of the organization is reduced.

In addition, there is tension between the breadth and depth of the hierarchies.  It is generally better to go for greater breadth, particularly as your site grows.

My knowledge grew the most in the area of search.  I have a friend whose site needs search capability.  I had thought of plugging something in for him for the entire site, and until did not realize the complexities involved in matching the users needs to the search capabilities.

For one, your users may need recall or precision, but they can’t have simultaneously as these are mutually exclusive:

  • Recall. Recall is oriented toward finding a greater number of relevant matches (e.g., doing due dilligence on a company you are considering joining)
  • Precision. Precision is oriented toward finding just a few very high quality matches (e.g., instructions on deck staining)

You will need to configure or choose your search engine accordingly.  In addition, choosing to take advantage of automatic stemming (e.g., using thesauri) in your searching will result in greater recall at the expense of the precision.  Your users’ goals needs to play an important part in this configuration.

In addition, you can choose to index your entire site, or break it up different search zones.  Once again, the former increases recall, while the latter increases precision.

Finally, there are numerous ways and levels of details to provide around search results.  Once again, this will be based upon your users goals.

In the end, you need to really understand your users’ goals to be able to create the appropriate Information Architecture.

Information Architecture – Part 1

Taking a break from development and Agile activities, I have started reading “Information Architecture for the World Wide Web” by Peter Morville & Louis Rosenfeld.  This is part of my attempt to increase the user-oriented side of my skills in addition to more technical web development skills.

The book begins by using an analogy of physical building architecture, which builds nicely on the mental model I started to build up from Steve Krug’s “Don’t Make Me Think”.  Different building architectural styles serve different user purposes, labeling and classification enable users to navigate effectively, and the importance of search.  While there are also similarities with physical libraries, the multi-dimensionalities associated with the web present a different set of problems.

What I found interesting is their discussion on Information Needs and Information Seeking Behaviors.  They discussed four types of Information Needs:

  • The Perfect Catch. You know exactly what you are looking for – someone’s telephone number, a fact about the population of the state of Louisiana, etc.  Basically, you are looking for “the right answer”
  • Information Exploration. You might looking for the best apartment swapping services in Paris, or different investment options in your online 401K service (as I was recently).  There are multiple good matches.
  • Exhaustive Research. You might be doing research for your thesis, or conducting medical research about a disease a friend may have acquired.  You want to leave no stone unturned.
  • Refinding. This is where services like come in handy, or the “Favorites”  link in YouTube

The point is, how you design search capability and organize your site is going to differ vastly for these purposes, and an understanding of how your users will want to use your site will play a major role in your information architecture.  How you organize search, links, content and navigation will either enable or befuddle your users in their goals.  In other words, you want to set things up in such a way that your users do not need to think.

Another thing I found interesting (I am only through Part 1) is the Berry Picking Model by Dr. Marcia Bates of USC:

  1. Start with an information need
  2. Formulate an information request (query)
  3. Move through the site(s) in different ways
  4. Pick up important bits of information (berries)
  5. Refine your query based upon what you already found and repeat

This stuck out because I was just doing this this morning before reading this:

  1. Searching for help on using a particular technique
  2. Finding some helpful articles (and either adding to or Evernote) while browsing then rejecting unhelpful ones
  3. Altering the search query in the hopes that it would offer more links that better fit my need

Finally, even though the product I work on a web-based application as opposed to a site, an understanding of Information Architecture can also be helpful in terms of how we present information and work with user requests.

Using WATIR for Browser Based Testing

I had wanted to learn to use Selenium to automate browser-based testing, but a QA person at work gave a brown bag on WATIR.  Given the number of browsers I could use it with, I decided to play around with it.

The WATIR site has a page on installation which is pretty straight-forward (I didn’t bother with supporting Safari on my Mac, and the Windows install was pretty straight forward as well).  Installing the plugins for FireFox referenced on the install page was also straightforward.

You do need to start Firefox up initially using the -jssh option (for IE on Windows, it comes up automatically).  Here is how I did it for Mac:

cd into /applications/
./firefox-bin -jssh

For Windows 7:

cd into Program Files (x86)\Mozilla Firefox
firefox.exe -jssh

I brought up Ruby’s IRB to start playing around.  On my Mac, I could execute commands in the IRB, but in my ruby script file, it was failing on the following:

require ‘watir’

For the scripts, I needed to add require ‘rubygems’ first.  On Windows, I needed to do this for both IRB and ruby scripts:

require ‘rubygems’
require ‘watir’

To bring Firefox, I did the following (you can just comment out the first line to bring up IE):

Watir::Browser.default = “firefox”
b = Watir::Browser.start “”

The browser popped up to this site.  I needed to login, so I specified the name and password by finding the element by id and then specifying the text to be typed in:

b.text_field(:id, “user”).set(“name_of_user”)
b.text_field(:id, “password”).set(“the_password”)

In the browser, it was almost as if an invisible person typed in the text.  Next, I needed to the “Sign In” button, but there was no id associated with it, so I had to click it after finding it by its value:

b.button(:value, ‘Sign In).click

Now the home page came up.  I wanted to create a new Foo, so I needed to get to the Foo page, which is referenced by a link and is called “New Foo” on the page:, ‘New Foo’).click

I was now on the new page, and starting following the steps to fill out the fields to create the new Foo.  However, I was getting the following error:

 C:/Ruby/lib/ruby/gems/1.8/gems/watir-1.6.5/lib/watir/element.rb:56:in `assert_ex ists': Unable to locate element, using :id, "username" (Watir::Exception::Unknow nObjectException) from C:/Ruby/lib/ruby/gems/1.8/gems/watir-1.6.5/lib/watir/element.rb:288 :in `enabled?' from C:/Ruby/lib/ruby/gems/1.8/gems/watir-1.6.5/lib/watir/element.rb:60: in `assert_enabled' from C:/Ruby/lib/ruby/gems/1.8/gems/watir-1.6.5/lib/watir/input_elements .rb:327:in `set' from watir_fun.rb:6 

I double checked the id’s; all looked well.  It turned out that that page was doing some additional javascript after having been loaded, so these fields were not ready for me to access.  By adding ‘sleep 2’ to the script prior, the page had time to load; and I was able to follow the steps to create the entity.

But was the entity created successfully?  Because none of the displayed fields had a unique id (and this was only a quick experiment), I simply checked with something like the following:

b.text.include?(“Hot Stocks”)

Obviously, for production readiness this is not acceptable, so we would likely had keys or some easier way to access via XPATH and so forth.   The point is, WATIR gives you an easy way to automate interacting with a browser and seeing what the results are.

In the end, I found picking up WATIR to be quite straightforward.  Back when I was at my current company previously, I used to go through a short manual test script before checking it to make sure my changes didn’t break anything.  This weekend (for fun), I hope to code up this former script in WATIR in just a few hours and start having us use it in Development next week.

Additionally, the WATIR web-site is helpful, well-organized, and I hear the help and mailing lists are quite responsive and friendly.

Choosing Clojure Over Scala

I saw Venkat Subramanian given a presentation on Scala at one of the NOVAJUG meetings. Venket is an excellent presenter and a very smart guy.

Two weeks or so ago Stuart Halloway gave a presentation on Clojure at another NOVAJUG meeting held at Oracle. Stuart is one of those guys who makes you marvel how fast his brain works. Unfortunately, his one hour presentation probably could have used 90 minutes to two hours to fully absorb. I went by the book store the next night, and all copies of his Clojure book were gone.

For the last five months or so, I have been trying to figure out whether to study Clojure or Scala. Given the fact that I am in chapter 2 of Stuart’s Clojure (as well as the subtle title of this post), I have decided to go with Clojure.

Scala looked very promising, but Clojure seemed far more different than Java. I view that as a good thing. Getting experience in a variety of language styles as opposed to those that seem closer to Java can only be a good thing, No, I am not saying the Scala is just like Java.

Clojure does have a lot of common with LISP, which I enjoyed programming in in college. But it appears to have made some improvements over standard LISP.

Its functional and transactional approach to concurrency (as opposed to manual locking) seems interesting.

Finally, as Ruby is a more expressive language than Java, Clojure appears to be even more expressive than Ruby. Clojure may be very different than Java, but it is easier to program in (as claimed by Stuart).

Will Clojure be the next big language? Possibly. Will I be a much better developer by learning a language such as Clojure? Definitely.

Ruby Strings – How to Put String With Single Quotes in Single Quotes

I set up a series of Ruby unit tests to better explore SED.  Basically, I invoke it as follows:

def sed_result(input_str, sed_str)
     %x[echo #{input_str} | #{sed_str}]

It worked well, until I did this:

assert_equal("howdy\n", sed_result("howdy from Texas", "sed 's/\([a-zA-Z]*\).*/\1/'"))

Assigning the latter string to a variable in ruby produces a different string, because strings in double quotes are processed: “sed ‘s/([a-zA-Z]*).*/01/'”

For example, when I want to do string interpolation, I have double quotes instead of single quotes as follows:  “The value of foo is #{foo}” to the string “The value of foo is bar”.  This time, I don’t want any processing of the string, but want to pass it as is.  Thus, I need to use single quotes, except my string contains single quotes. What to do?

Well, the following use of %q[] is the same as using single quotes, and worked well:

assert_equal("howdy\n", sed_result("howdy from Texas", %q[sed 's/\([a-zA-Z]*\).*/\1/']))

Additionally, you can use %Q[] in place of double quotes.

Spawning a Process in Ruby

A good way for me to learn code-related technologies is to write unit tests.  It helps to work through the nuances, and gives me something I can go back to refresh myself as needed.

Right now, I have been going through SED, which does not have an associated unit test framework.  To test this, I want to use Ruby and then spawn a process to execute SED and then get the results back for the assert_equal call of all my tests.  I want to do this while avoiding the overhead of using files and file redirection (where possible).

I obviously can’t use Kernel::exec, as that will replace my currently running process and end my testing real quick.  I want to do the following:

     system("echo #{input_str} | #{sed_str}")

Where input_str is the input SED is processing, the sed_str is the SED command str.

Of course, I can’t do this as this is two processes being piped together.

If I use read and write files, I can do the following:

     system("#{sed_str} output.txt")

In the end, I got it to work as follows:

def sed_result(input_str, sed_str)
     %x[echo #{input_str} | #{sed_str}]

There are other options to look into – IO.pipe, IO.popen and Open3 (among others).  But I have a lot of SED unit tests to write first!