git ready

learn git one commit at a time
by Nick Quaranto

how git stores your data

committed 17 Feb 2009

This is an extremely broad overview on how the Git object model works, based mostly on the Git Community Book. Future posts will definitely look into the object model in more depth, but this information is definitely essential to those who are learning Git. The images from this post were taken from my presentation on Open Source Collaboration with Git and GitHub.

The most basic data storage is the blob. Git stores just the contents of the file for tracking history, and not just the differences between individual files for each change. The contents are then referenced by a 40 character SHA1 hash of the contents, which means it’s pretty much guaranteed to be unique. Pretty much every object, be it a commit, tree, or blob has a SHA, so learn to love them. Luckily, they’re easily referenced by the first 7 characters which are usually enough to identify the whole string.

One awesome advantage to storing only the content means that if you have two or more copies of the same file in your repository, Git will only store one copy internally. A blob can be represented like so:

The next object is a tree. These can be thought of as folders or directories: they contain other blobs and trees.

Finally, this brings us to the most important object: the commit. Commits can be thought of as snapshots: they know what the trees looked like at one point in time. They also have some other information associated with them, such as the author, date, and a message.

Commits are organized in a Directed Acyclic Graph. For those who missed that lecture in Data Structures about it, basically it means that the commits “flow” in one direction. Usually this direction is simply the path of history for your repository, which could be very simple or quite complex if you have branches. From a broad standpoint it will look something like this:

This all becomes much more apparent when using a tool like GitX. You can clearly see the commit objects and their associated data, and then drill down from there to see the commit’s tree.

So now hopefully when you see some of the strange terminology in your commits, you’ll understand a little more of how it works. Check out the awesome guide at Git for Computer Scientists and Scott Chacon’s talk on Getting Git.