Git under the hood

Foundations, structures, and useful features of Git.

2020-02-02

Introduction

Git is a regular part of many developers' modern day routines. It allows us to collaborate with our teammates at work and collaborate with the world at home. One of the great things about Git is that it's fairly easy to get started with just a few commands. On the flip side, it can get really confusing really quickly when things don't work how we expect them to.

In this blog post, we're going to Git under the hood of Git. (Git it? Okay, sorry.) We'll start by learning computer science concepts that are foundational to Git. Armed with this understanding, we'll explore the structures underlying Git. Finally, we'll explore a handful of useful features that I wish more Git users knew about.

I'm going to intentionally keep things simple. The goal here won't be to learn everything about Git, but to remove some of the mystery and add a few more tools to your toolbox. My hope is that by the end of this blog post, Git will be something you understand rather than memorise.

What is Git anyway?

Before we start, we should clear some things up. What exactly is Git, and why does it exist?

The birth of Git

Git was created by Linus Torvalds in 2005 to help with Linux kernel development. There's some juicy drama there if you care enough to look.

"Git" is also a noun that means "a foolish or worthless person". Pretty weird thing to name a program after, right? There's no way that was intentional...

Fun fact, it was! If you read the man page, you'll notice it describes Git as "the stupid content tracker". Linus himself says "I'm an egotistical bastard, and I name all my projects after myself". You can connect the dots.

Many people use the terms "Git" and "GitHub" interchangeably. This is understandable given that many people interact with Git solely in the context of GitHub. However, they're not actually the same!

Git is just another command-line program like find or any others. Specifically, it's a version control system. GitHub is a service (now owned by Microsoft) that lets you host your repositories and provides other features. Similar services include GitLab and Bitbucket.

Git is distributed, not centralised

Git is a popular version control system among developers, but it wasn't the first. One of the things that set it apart at the time of its creation was that it was distributed instead of centralised. This distinction has a huge impact on the way its users are able to work.

Centralised systems use a client-server architecture. The current version of a project and its history are stored on a server. Developers check out a copy of the entire project, make the changes they need to, and then check those changes back into the server. I hope you can appreciate just how restrictive this workflow is when compared to Git!

Distributed systems like Git use a peer-to-peer architecture instead. Every developer has a local copy of the project including the history on their own computer. Developers keep their project copies in sync with one another by transferring patches to each other directly.

In practice, teams usually have a central repository that developers try to keep in sync with. This repository might be hosted on a service like GitHub. To get their local copy, developers clone this "main" repository. They can then integrate changes by pushing directly to the main repository or opening a pull request.

Relevant concepts in computer science

In order to understand how Git works the way it does, there are a few computer science concepts we need to know first. You might wonder what this stuff has to do with Git, but trust me, it'll make sense in the end.

Hash functions

We know that functions map inputs to outputs, right? Hash functions specifically map input data of any size ("keys") to output values of a fixed size ("hashes").

A good hash function should have the following properties:

It should compute hashes very quickly. You don't want to be waiting forever to get the output value.
It should result in hashes that are uniformly distributed over the key space. That means that output values shouldn't be clustered around any subset of the input data space. Simply put, it should seem totally random which hash you get for a certain key. Small changes in the input shouldn't have small changes in the output.
It should minimise the number of identical hashes resulting from different keys. When the exact same hash value can be created from different keys, this is known as a "hash collision".

Hash functions are super useful for data access because they allow you to refer to data by their hashes instead of their contents. For example, if you store a user's password as a hash, you don't need to store their password. Instead, you can compare the hash of the password they give you the next time they try to log in to the hash you've previously stored. Password hashing is a little more complicated than that when done properly, but that's the basic idea.

As you can imagine, building a good hash function is very difficult! If you ever need a hash function for something you're building, go with an existing and established hash function that suits your needs. Building your own hash function makes a fun coding project though, as long as it's just for fun!

SHA-1 hash function

There are many hash functions out there, but the one Git uses is called the SHA-1 hash function. To see how it works, we'll hash a few different keys.

First off, let's use my name, Kiyan:

SHA-1(Kiyan) =
    b277135fb277598b94a3ef261e938d4ddc38a0d1

Cool! Now let's use my name again, but with the first letter changed to lowercase:

SHA-1(kiyan) =
    af90ce4e58131e82f21721c97bd24f33feb92220

You'll notice that this hash is completely different to the previous hash, even though we only changed the case of the first letter. This is an example of how resulting hashes are uniformly distributed over the key space.

Okay, now let's input the entire Bee Movie script:

SHA-1(According to all known laws of aviation...) =
    8d23eda2ef6252cc4def3402841133311b6300ad

Notice how even though the Bee Movie script is really, really long, the resulting hash is the exact same length as the hashes of my 5-letter name. This is because hash functions map keys of any size to hashes of a fixed size.

Arrays and linked lists

For the most part, arrays are just pieces of data stored contiguously in memory. This means that each piece of data is stored one after the other. They might seem similar to "lists" and "slices" in many higher-level languages, but they're much simpler.

Linked lists are a different way to implement the same concept of "a bunch of data in some order". The key difference is that linked lists store a reference along with each piece of data. Each of these references points to the next piece of data. We'll call each of these "data plus reference" things a "node". Because each node stores a reference to the next node, each node can be stored in a completely different spot in memory.

This means that you can modify the linked list to skip a node, add a new node, and so on without moving the rest of the linked list. This is because all you need to do is update the relevant references! Meanwhile, doing the same with an array would require moving each piece of data around until all pieces are in the right order. However, traversing the linked list requires moving from node to node until you find the one you want, while array indices can be accessed in constant time.

Typically, nodes in a linked list only store references to the following nodes in the sequence. In a variation called a "doubly linked list", nodes also store references to the previous nodes.

Graphs

Simply put, a graph is a bunch of vertices. Pairs of these vertices are connected by edges. Edges can be one-way connections (directed) or two-way connections (undirected). If you think about it, graphs are kind of like linked lists whose nodes can refer to more than one other node.

A directed acyclic graph (DAG) is a specific type of graph. DAGs have two restrictions: they must be directed and acyclic, as their name implies. We already know what it means for a graph to be directed. The "acyclic" part means that the graph can't have any cycles, which you can think of as loops. Because DAGs are both directed and acyclic, you can't reach the same node twice while traversing a path. That's because there can't be any paths like A <-> B or A -> B -> C -> A.

A tree is another type of graph. Trees also have two restrictions: they must be minimally connected and they can't have any cycles. We just learned what it means for a graph to be acyclic. Graphs that are "minimally connected" don't have more connections than they need to. As a result, each node in a tree can't have more than one parent. Otherwise, we'd have more than one path to that node.

Because the restrictions on what defines a tree means they end up looking like, well, a tree, there are a few special terms for specific kinds of nodes. The "top-level" node of a tree, from which the rest of the nodes "grow", is called the "root". The "lowest-level" nodes of all subtrees, where no more nodes grow from, are called "leaves".

Trees can be either directed or undirected in mathematics. However, computer scientists normally assume that a "tree" with no other qualifications is directed and rooted. When a tree is rooted, it just means that one node has been designated the "root" as described above. We'll be using the common computer science definition of trees for this blog post.

Merkle trees

Now let's bring together our understandings of trees and hashes! Merkle trees, also known as hash trees, are basically the foundation of how Git manages versions.

Merkle trees are different from regular graphs because they can store data in "blocks". Each data block has an associated leaf node which is labelled with the hash of its block. Then each parent node is labelled with the hash of its children, all the way up to the root.

Remember that good hash functions minimise the number of hash collisions? And remember that, because it's unlikely for different keys to result in the same hash, we can check if multiple data are the same by comparing their hashes?

Well, because each node is labelled with a hash, we can test if multiple subtrees are identical by comparing their hashes! This also allows us to avoid storing the same data multiple times. Because we know it's the same, we can simply share references to the identical data.

How Git works

Now that we understand hashes, graphs, and Merkle trees, let's dive into Git's internals.

An overview

There are a few main structures in Git that are necessary to understand.

When you're working on a project using Git, you're probably editing a directory with files in it. While it's common to think of this directory as your repository, it's not! This directory is known as the working tree, and it has a repository associated with it.

Whenever you make changes to your working tree, Git keeps track of them. It does so by registering those changes in a mutable cache called the index, also known as the staging area.

You don't just keep your changes registered in the index forever. At some point you make a commit, which is a snapshot of the working tree in a particular state. If you check out a different commit, your working tree changes to the state of the working tree at the checked out commit.

When you initialise a Git repository with git init, Git creates a .git directory inside your project folder. This directory is the repository, and it contains a collection of commits along with a bunch of other data. Hopefully the distinction between the repository and working tree is clear now!

You might have noticed that the working tree, index, and repository have a circular relationship. You're probably familiar with this relationship already, although you might not have known it.

When you use git add, you're registering your changes to the working tree in the index. With git commit, you're committing your changes to the repository based on the state of the index. Finally, git checkout allows you to check out previously committed states of the working tree from the repository. And so the cycle continues.

Objects

The repository keeps track of the different working tree states using objects. It stores these objects as files in .git/objects. Let's walk through the steps Git takes when we make a commit.

First, Git hashes the contents of each changed file with SHA-1. It creates blob objects to reference each file by its hash, and hashes the blobs. It then creates tree objects to references blobs by their hashes, and hashes the trees. While each blob references its one corresponding file, each tree references the blobs and subtrees that are within its corresponding directory. Git creates more tree objects until the entire working tree associated with the repository is hashed and recorded.

Do you see what's happening here? Git is using a Merkle tree to hash the working tree contents. Because each hash effectively uniquely identifies its contents, Git can easily determine if there are any differences between multiple hashed working trees. If even one file has its contents changed, its hash will change and so will the hashes of objects above it in the Merkle tree, until and including the root node.

Okay, now what about the commit itself? Git creates a commit object which stores a reference to the top-level tree object for the current snapshot of the working tree. Remember that with a directed rooted tree, if we have a reference to the root node we can traverse the whole tree.

The commit object also stores the parent commits, the author/committer information (user.name, user.email, and timestamps), and the commit message. Like the other objects, the commit object is hashed. A change in any of these inputs (including the working tree hash) would change the hash of the resulting commit object, creating a new one. Because commit objects reference their parents, they form a DAG called the commit graph.

It's a common misconception that the commit graph is a Merkle tree, but that doesn't make sense:

Tree nodes can only have one parent, but commits that merge multiple commits must necessarily have multiple parents.
Trees can only have one root, but commits don't have to share a common ancestor.
Trees are connected, but it's possible to have multiple disconnected subgraphs of commits.

References

Repositories use several references to keep track of commits in the graph. Git stores all these references in .git/refs.

A branch is a reference to a commit that's used to keep track of a commit's lineage. You don't usually need to update branch references yourself. For example, if you make a commit to your branch named dev, the dev reference will update to refer to your new commit. It's common to think of branches as sequences of commits, but they're not! They're just references that help you keep track of the last commit in a certain lineage of commits.

A tag is also a reference to a commit. Unlike branches, tags always refer to the same commit. They can also have descriptions too. Tags are handy when you want to keep track of a specific commit in your graph. People often use them to keep track of project releases.

Finally, HEAD is also a reference to a commit. (Do you see a pattern?) Instead of a category of references like branches and tags, HEAD itself is a reference. HEAD is used by the repository to define which working tree state is currently checked out. Like branches, HEAD can be automatically updated. When you make a commit following the one you've checked out, HEAD updates its reference to the new commit.

If you ever check out a commit that isn't pointed to by a branch, Git will tell you that HEAD is "detached". As scary as a "detached HEAD" sounds (I like my own head right where it is!), don't be alarmed. All this means is that HEAD is now referring to a commit that isn't referred to by a branch.

If you make a commit while HEAD is detached, the new commit will be a child of the commit you've checked out. But because there is no branch pointing to the parent commit, there is no branch that will update to track this new commit. This means that if you make the new commit then check out to somewhere else, you'll lose track of it. Don't worry, we'll learn how to recover from this sitation later.

Doing stuff with Git

Phew! We could discuss Git internals forever. But now that we have the basics down, how can we up our Git game? Next up is a quick tour of some lesser-known features of Git. We won't cover everything, just things that I've found extra useful.

Referencing commits

In order to do stuff with commits, you typically need to be able to reference them. Otherwise, it wouldn't be clear which commit you want to do stuff with.

There are a ton of ways to reference commits in Git. Here are a few:

By its reference where applicable. Why come up with your own reference when Git is already keeping one? You can use branch names, tag names, and HEAD to refer to whatever those labels are currently referencing.
By its 40-character SHA-1 hash. Remember how every commit has an associated hash? Because we're pretty much guaranteed to have no hash collisions, they work as unique references.
By the first N characters of its SHA-1 hash, where N is at least the minimum required for a unique reference within the repository. Unless you have a ton of commits, you'll probably only need a handful of characters to uniquely identify a hash.

Now say you have a reference to a commit. You can use that reference to reference other commits!

With {commit}~N to reference the commit N generations before {commit}.
With {commit}^N to reference the commit N parents before {commit}.

Parents and generations

As just described, ~ specifies generations while ^ specifies parents. But what does that mean?

When you merge commits A and B, you end up with a new commit C. Merging is basically a way of taking two different working tree states and mashing them together to form a new one. Because C results from both A and B, it has two parents.

Parents are measured by their recency to the child. This means that {commit}^1 refers to the first parent while {commit}^2 refers to the second, and so on. Generations are measured by following the first parent. For example, {commit}~3 is the same thing as {commit}^1^1^1. If you leave out N, the default value is 1.

Let's assume that our three example commits were in alphabetical order. Then B is the first parent to C and A is the second. Say that C can be referenced with HEAD~1. Then B can be referenced with HEAD~1^1 or HEAD~2, while A must be referenced with HEAD~1^2. The first parent of B can be referenced with HEAD~3 and the first parent of A can be referenced by HEAD~1^2~1.

In short, ~ usually does what you want, but ^ is needed when commits in the graph have multiple parents.

It's important to note that children maintain references to their parents, but parents do not maintain references to their children. This makes a ton of things easier in Git because commits just need to know which commits they follow. Also, it wouldn't be possible for both of these options to be true because the graph is directed.

Stashing commits

You've probably had Git tell you that you should stash your code if you want to blah blah blah. If you're like many people, you've probably just ignored this.

Stashing creates blobs for entire directory contents, including the working tree and index. So not only does it record your changes to the working tree, it also records the state of your cache! It's considered good practice to stash your changes at the end of each day.

Importantly, the stash is not branch-specific. This means that things can get messy if you stash changes from multiple branches. However, it also means that you can stash changes from one branch and apply them to another.

Patching commits

The index is useful because it lets us choose exactly which changes we want to commit. You probably know that we can add changes to only specific files instead of the entire working tree by using git add {file}.

Git allows us to get a lot more granular with our adding. We can choose to only add specific hunks of changes from given files instead of adding the changes from entire files. All we have to do is specify the patch option: git add --patch {file}. It's also possible to select hunks across multiple files.

Patching is really useful when we've made multiple changes in a file which are fundamentally about different things and don't belong in the same commit. You can also specify the patch option for git rm, for example if you want to do a partial rollback of a commit.

When patching, Git will iterate through all hunks in the file and ask if you want to stage each one. It will spit out a lot of single-character options at you. The final option is ?, which will explain the meaning of each option. These aren't even all the options! If you run git add --help, you'll find the full list.

To start, these are the important ones:

y - stage this hunk
n - do not stage this hunk
e - manually edit the current hunk

I find myself using the e option whenever my changes are too close in proximity to other changes for Git to tell them apart. It's obviously not as simple as the y or n options, but the hunk editor has comments telling you how to use it. You can get by with just these three, but the other options will save you time once you know what they are. For example, you can leave hunks undecided, stage or skip all remaining hunks, and move between hunks.

Mixed, soft, and hard resets

Resets are a common thing to come across on Stack Overflow while searching how to fix your mistakes. Unless you've always been a Git superstar and have never been in that situation, you've probably wondered what the difference is between the reset types.

The command has the following format: git reset [--mixed | --soft | --hard] {commit}. That's because there are three different types of resets:

A mixed reset reverts HEAD and the changes in the index to the specified commit. If you don't specify what type of reset you want to do, Git will do a mixed reset by default. Basically, this means HEAD will be updated and your staging area will be updated, but your working tree will stay the same.
A soft reset reverts HEAD to the specified commit but does not change the index. Because all you're really doing is changing which commit HEAD points to, a soft reset is equivalent to git update-ref HEAD {commit}.
A hard reset reverts HEAD, the index, and the working tree to the specified commit. Be very careful with hard resets! Because it will overwrite your uncommitted changes, only use a hard reset if you're sure it's what you want to do.

Rebasing commits

A rebase is similar to a merge, but it incorporates changes from one branch's commits into another. As a result, it changes the commit lineage along the rebased commits. This is because at least one file's contents must have changed, so the graph's hashes must change too.

For example, say you have branch foo with graph root <- A <- B <- C <- D and branch bar with graph root <- A <- E <- F <- G. You'll notice that foo and bar share the common path root <- A. However, A has two children B and E, each of which has their own descendents.

Let's write out the commit graph more clearly:

         <- B <- C <- D  :: foo
root <- A
         <- E <- F <- G  :: bar

Let's say you want to combine those two branches. Specifically, you want to incorporate bar into foo. You have two options: merge and rebase. In both cases, you'll end up with a new commit we'll call Z. Z is the combination of D and G, and we could say that Z = D + G.

When merging, the differences between foo and bar are only incorporated into the new commit Z. Z will be the direct child of both D and G, which preserves the existing commit graph. The resulting shape of the graph will show two paths that diverged after A and merged with Z.

The commit graph after merging:

         <- B <- C <- D
root <- A              <- Z  :: foo
         <- E <- F <- G

When rebasing, the differences between foo and bar are incorporated in every commit involved in the rebase. This will create a new commit for each commit in the path E to G. Let's call them X, Y, and Z. In this case, X = D + E, Y = D + F, and as we already know, Z = D + G. The resulting shape of the graph will show one path from A to Z because bar had its base changed from A to D.

The commit graph after rebasing:

root <- A <- B <- C <- D <- X <- Y <- Z  :: foo

If you're the only person who had access to the commits E to G, no one will be any the wiser. For all they'd know, bar never existed and you'd based commits X to Z on D since the very beginning. However, if anyone else had access to E to G, you'd need to make sure that they sync up with the new graph properly. Otherwise it could cause trouble because you've chosen to ignore that path in favour of the modified commits X to Z. Unless a merge isn't sufficient for your needs, just merge the commits.

By default, rebasing preserves the author date but updates the committer date. The --reset-author-date or --ignore-date flags will set the author date to the committer date value, and the --committer-date-is-author-date flag will set the committer date to the author date value.

Rebase will save your life

I know I just warned you not to rebase unless your commits are local or you really need to... But sometimes it's the right thing to do.

You might want to rebase if:

A password was accidentally committed a long time ago and needs to be removed from the commit graph.
You made a mistake two commits ago but haven't pushed your commits yet.
The commit history is messy. Maybe there are 6 commits that should really be just 1.
A teammate wants to remove their old name from a project, and the team is on board.

When rebasing, I recommend using it in interactive mode, which allows you to specify exactly how the commit graph will be written. It lets you collapse commits, reorder commits, remove commits, modify commits, and more. It's awesome.

The reflog will save your life

The reflog is short for the "reference log". It records a commit every time a reference is updated. This could be HEAD, a branch, or something else. As a result, anything you do through Git that results in a reference update will be recorded in the reflog. This allows you to search through and recover previous states.

The reflog can help you with:

Recovering or investigating previously stashed working trees.
Recovering commits that you no longer have a reference to. No, you don't have to make those changes all over again!
Removing sensitive information in commits that were removed from the graph but not garbage collected.

Since the reflog is never committed to the repository, it's effectively local to your computer. The nice thing about this is it's impossible for it to get clogged up with other people's reference changes. The not so nice thing is that if you re-clone the repository for whatever reason, you won't have your old reflog any more.

The reflog's contents are automatically garbage collected 90+ days after entry. You can change the default expiry time if 90 days doesn't suit your fancy. You can also manually expire contents using any expiry time, which might be useful if you need to immediately clear up space.

Conclusion

That's all for now! I hope you learned something new about Git that you didn't know before. Still having Git trouble? Don't be afraid to reach out! I'm always happy to help friends and coworkers.