Distributed Version Control
In the previous article, where we looked at the history of version control systems, we talked about SCCS, RCS, CVS and SVN, four of the most popular version control systems of the past but all four of these use a central code repository model. That's where one central place is used to store the master copy of your code. And when you're working with the code, you check out a copy from that master repository. You work with it to make your changes, and then you submit those changes back to the central repository. Other users can also work with that repository, submitting their changes, and it's up to us as users to keep up to date with whatever's happening in that central code repository to make sure that we pull down and update any changes that other people have made.
Git Workflow
Git doesn't work that way. Git is distributed version control. Different users each maintain their own repositories instead of working from a central repository, and the changes are stored as sets or patches, and we're focused on tracking changes, not the versions of the documents. Now that's a subtle difference. You may think well, CVS and SVN, those track changes too. They track the changes that it takes to get from version to version of each different file, or the different states of a directory. Git doesn't work that way. Git really focuses on these change sets, and encapsulating a change set as a discrete unit, and then those change sets can be exchanged between repositories. We're not trying to keep up to date with the latest version of something. Instead the question is do we have a change set applied or not? So you might say that you merge in change sets or you apply patches between the different repositories. So there's no single master repository. There's just many working copies, each with their own combination of change sets.
Example
Imagine that we have changes to a single document as sets A, B, C, D, E, and F. We're just going to give them arbitrary letter names so that we can help see it. We could have a first repository that has all six of those change sets in it. We can have repository two that only has four of those changes in it. It's not that it's behind repository one, or that it needs to be brought up to date. It just simply doesn't have the same change sets. We can have repository three that has sets A, B, C, and E, and repository four that has A, B, E, and F.
- Repo 1: A, B, C, D, E, F
- Repo 2: A, B, C, D
- Repo 3: A, B, C, E
- Repo 4: A, B, E, F
None of these repositories is right, and none of them is wrong. No one of them is the master repository, and the others are somehow out of date or out of sync with it. They're all just different repositories that happen to have different change sets in them. We could just as easily add change set G to repository three, and then we could share it with repository four without ever having to go to any kind of central server at all, whereas with CVS and SVN, for example, you would need to submit those changes to a central server, and then people would need to pull down those changes to update their versions of the file.
Now by convention, we often do designate a repository as being the master repository, but that's not built into Git. It's not part of the Git architecture. It's just a convention, that we say okay, this is going to be the master repository and everyone is going to submit their changes to this repository, and we're all going to stay in sync from that one, but we don't have to. We can actually have three or four different master repositories that have different versions in them, and we could all be contributing to those equally and just swapping changes between them.
Advantages
Now because it's distributed, that has a couple of advantages.
- There's no need to communicate with a central server, and that makes things faster and it means that it's not necessary to have network access to submit our changes. We can work on an airplane, for example.
- There's no single point of failure. With CVS and SVN, if something goes wrong with that central repository, that can be a real show stopper for everyone else who's working off of that central repository. With Git we don't have that problem. Everyone can keep working. They've each got their own repository that they're working from, not just a copy that they're trying to keep in sync with some central repository.
- It also encourages participation in forking projects, and this is really important for the open source community because developers can work independently. They can make changes, they can make bug fixes, feature improvements, and then they can submit those back to the project for either inclusion or rejection, and if you're working on an open source project and you don't like the way that it's going, you can fork that project, create your own version and take it in a completely different direction. That becomes a really powerful and flexible feature that's well suited to collaboration between teams, especially loose groups of distributed developers like you have in the open source world.
Distributed version control is an important part of the Git architecture, and it's important to learn about it, especially if you have previous experience with other version control systems like CVS or SVM. For now, just make sure that you understand that there is no central repository that we all work from. All repositories are considered equal by Git.