Git Forklore: The Object Storage Model

If you’re a developer, whether professional or a hobbyist, chances are you’ve heard of Git, the ubiquitous VCS (Version Control System) that supposedly the entire world (except companies with humongous monorepos like Meta for instance) depends on. Some people see it as a time-travelling wizard, one that can create parallel timelines, merge conflicting ones, prune them or undo your mistakes. Sounds a lot like Time Variance Authority, yeah?

In this blog series by Delta Force, we’ll delve deeper into the wizard’s trickery and unravel the magic tricks. Disclaimer: readers are expected to have some experience of working with Git. Familiarity with the terminal helps as well - especially in appreciating the subheading titles.

History of Git |`git checkout 2002-2005`

Around the time of development of the Linux Kernel, there was a need for a proper VCS. One of the most popular VCS around that era was CVS, but due to critical issues with how it handled race conditions - it wasn’t used for the development of Linux kernel. Due to community pressure, eventually Linus settled down on a closed-source VCS, Bitkeeper (a distributed VCS). The company behind Bitkeeper offered the Linux Kernel developers a free license at a cost - they were not allowed to work on any other VCS. FOSS zealots in the kernel developers team were unhappy with the fact that Bitkeeper was closed-source, and thus started to reverse engineer it. Bitmover got a wind of it and piped down on the license completely. And thus, the developer team found themselves without a VCS once again, and with a dozen patches to be merged.

Enter Linus Torvalds to build something monumental once again, within 5 days the first version of Git was out. What is Git again? Git is a content addressable file-system with a dedicated VCS interface. Back in the days when it first came out, the dedicated VCS interface part was a bit lackluster. And thus, people begun to ponder what this bizzare creation was - the so-called information manager from hell.

In many ways you can just see git as a filesystem—it’s content-addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.

- Linux Torvalds

Exploring `.git` | `cd .git && ls -al`

The core Git is often called “plumbing”, with the prettier user interfaces on top of it called “porcelain”. You may not want to use the plumbing directly very often, but it can be good to know what the plumbing does when the porcelain isn’t flushing.

Essentially everything that you or I have done in Git is execute Git porcelain commands - which in fact are mere abstractions that execute a chain of Git plumbing commands sequentially. Throughout this series, we’ll go over the basics - make a Git commit, make a separate branch, push our local repository to a remote repository, all without using git commit or git push. In other words, we’re gonna perform all the magic without any magic wands or super-secret-tools. Sounds exciting?

Fire. I’ll start with jumping into the repository for my website.

Nothing too fancy here, a standard Astro repository. Next destination? .git.

Alright, there we go. branches/hooks/HEAD/FETCH_HEAD should be self explanatory, but don’t worry. By the end of this article, you’ll have a decent idea about all of these files and how everything works under the hood. I’ll break down everything in this folder, first and foremost. COMMIT_EDITMSG: refers to the commit message for the last commit made. config: literally, config. contains the URLs for any remote branches, default merging strategies and so on. project-specific configuration, to be precise. description: used only by the GitWeb visualizer as a description for your project. FETCH_HEAD: refers to the hash of the commit last fetched from your remote. HEAD: refers to the hash of the commit pointed to by HEAD. hooks: contains all the client-side hooks. info: contains a exclude file to allow you to ignore patterns you don’t want to track in a .gitignore file. logs: the time-travelling back-end, allows you to go back and forth in time between commits. ORIG_HEAD: records the original position of the HEAD in case of a detached HEAD operation.

Poof. Too many buzzwords? I agree. You don’t have to remember all of these. If you noticed, I skipped a few folders/files out there - namely objects, refs, index and branches.

Okay, what are Git Objects? | `tree .git/objects`

What does the objects folder contain, exactly? Again, back to my website repo.

By itself, all these objects probably do not make much sense to you, but fret not - for we have a swiss knife to examine such objects. Running git cat-file (refer to the man pages, git cat-file -p object_name prints the contents, git cat-file -t object_name prints the type - all to stdout) we get these outputs.

The output is a file in the repo, an .md file to be precise for one of the blog-posts hosted on the website. And, it has the type blob. Similarly, running it on another object, we get

The type is tree, and it refers to the directory src in the repository, note that it itself is composed of different trees (more directories), and different blobs!

Lastly, the type is commit and it refers to a parent commit and a tree (referring to a new state of the tree), along with the commit message and other commit details.

To summarize what we just learnt - Git stores everything in the form of objects. blob objects contain file data, tree represent directories (or precisely, the state of the directory - see next section for more) and commit objects represent our git commits and are linked with tree and a parent commit. I think, the entire object model might be beginning to take shape in your mind by now, if the answer is a no - continue reading :)

Getting Hands Dirty | `git hash-object time-to-make-our-own-objects.md`

Congratulations for making it till here. This part and beyond is where we get our hands dirty. Starting with our original quote -

Git is a content-addressable file system with a dedicated VCS interface.

A content-addressable file system/content-addressable storage system at its core is essentially a methodological approach to store files not with filenames, but with a hash that reflects their content. A single change in the content of a file will result in a different hash being computed, and thus a CAS solves an issue of tracking changes within files, or detecting duplicate files.

Initialize an empty git repository in an experimental folder and cd .git/objects.

The objects folder is (as expected) empty. I’ll create a .md file (you can use any file) to demonstrate how git objects are made. We’ll use the command git hash-object. As always, before using any command, make sure to go through the man pages :)

The -w flag is used to write the object to the objects database rather than just returning it to stdout. The output you see is a SHA-1 (from August 2023 and onwards, SHA-256. I looked into the source code so that you don’t have to.) hash - a checksum of the content + a header. Going back to our objects directory, we have a new folder there!

Compare the folder and the file name with the earlier computed hash, and you’ll understand how git stores its objects in the objects database. Onto the next step - what happens when there is a revision in a file?

A very important point of distinction - blobs do not store differences between the files, they’re exact snapshots of the file at a particular instance in time. Note that if a file has no changes (remember, we’re hashing the content!) - the blob object will be reused. But, what are these blobs exactly? They are blobs. Blobs are just..blobs. Binary large objects, no structure, feature-less objects. But, these blobs are fundamental to how git manages and stores everything under the hood. Remember, it is the contents that are stored. Two files with the same content but different names will be stored only once.

Remembering the hashes of every file you index is impossible. And how do we group files together? For instance, the src folder in our repository may have dozens of files representing the different pages of the website. Any guesses? Trees.

Git Trees & Commits | `git write-tree`

Similar to UNIX, or perhaps most operating systems out there, Git stores all the content in directories and files. The tree objects refer to directories, the blobs refer to files. A tree object is simply a list of hashes of all the objects (note: objects, not blobs. Remember, we may also have sub-directories) it contains, sorted by name.

To create a tree object to store our files effectively, we need an index (or, a staging area). Remember index? Yes, that exactly.

git update-index --add --cacheinfo 100644 db3455... delta-force.md. --add adds the file we specify to the index, using --cacheinfo refers to the object file in .git/objects 100644 represents standard Unix File Types. db3455 refers to the hash of our blob object.

Poof! We finally have our tree in the database, and it represents our delta-force.md file. Lastly, we’ll make a commit object to record our changes. A commit object needs a commit message, which we can just pipe into the git commit-tree tree_hash command. (ignore the cd).

Now that we’re done with the basics, lets kick it up a notch and add more files and make more commits.

After making blob objects for new-file.md, time-to-go-nuts.md and making a tree for both of them after adding them to index, then a tree for this tree and our initial delta-force.md our .git/objects looks a little like this.

Running git commit-tree once again with the parent commit as the previous commit and then finally running git log on our latest commit, here’s the output! (alternatively, you can echo "last_commit_hash" > .git/HEAD and run git log ).

Congratulations, you have a successful git history - configured from scratch without the use of git add, git commit. Pretty sure, by now you have gained a new-found love for the commands git add and git commit - they do alot under the hood. We’re still just touching the surface of Git Forklore, there’s still branches, remote branches, merges left to be discussed. See you there at the part-2 of this blog-series :).

Git Forklore - Part 1

Written with ❤️ by:

Git Forklore: The Object Storage Model

History of Git |`git checkout 2002-2005`

Exploring `.git` | `cd .git && ls -al`

Okay, what are Git Objects? | `tree .git/objects`

Getting Hands Dirty | `git hash-object time-to-make-our-own-objects.md`

Git Trees & Commits | `git write-tree`

References:

Git Forklore - Part 1

Written with ❤️ by:

Git Forklore: The Object Storage Model

History of Git |git checkout 2002-2005

Exploring .git | cd .git && ls -al

Okay, what are Git Objects? | tree .git/objects

Getting Hands Dirty | git hash-object time-to-make-our-own-objects.md

Git Trees & Commits | git write-tree

References:

History of Git |`git checkout 2002-2005`

Exploring `.git` | `cd .git && ls -al`

Okay, what are Git Objects? | `tree .git/objects`

Getting Hands Dirty | `git hash-object time-to-make-our-own-objects.md`

Git Trees & Commits | `git write-tree`