Version Control System
Data Model
The following pseudo code can explain the git data model clearly.
Firstly, we will talk about the three models in git.
// a file is a set of data
type blob = array<byte>
// a directory which concludes files and directories
type tree = map<string, tree | blob>
// every commit concludes a parent, metadata and top tree
type commit = struct {
parent: array<commit>
author: string
message: string
snapshot: tree
}
How should we understand the bolb, tree and commit?
In my opinion, the best way to understand these three data models in git is to follow the tutorial or experiment of the offical website(中文版).
I will introduce this tutorial from a macro perspective.
The experiment process is diveded into three parts.
- Introduces the concept of blob
1.1 Use the command git-hash-object to understand the object id of blob
1.2 Use the process of editing the test.txt to understand the version control preliminarily- Create test.txt, then write some content (such as "version 1") to it and last create a object id for it
- Edit the content of test.txt (such as the content becomes to "version 2") and create a new object id for it
- Delete the original file test.txt, you will find you can use the content id to restore the content of test.txt
- Introduce the concept of tree
2.1 Add the test.txt (the content is "version 1") to index (staging area)
2.2 Create a object id for current staging area
2.3 Create a new file named new.txt
2.4 Add the test.txt (the content is "version 2") and new.txt to index
2.5 Create a object id for current staging area
2.6 Add the first object of staging area to the second object of staging area - Learn about the concept of commit
3.1 Commit the first tree
3.2 Commit the second tree and specify the parent commit is the first commit
3.3 Commit the third tree and specify the parent commit is the second commit
The commands which are used in experiment are placed in the commends at the end of the text
After the experiment, we will have a deeper understanding of these three data models.
A premise is that git need to complete the requirements of version control.
- At an abstract level, blob、tree and commit are three data models in git.
- At an physical level, Blob、tree and commit are a ordinary binary file that saves the content and other information of file or directory. The uniqueness of them lies in the name, they were named by the hash value of the content and other information, which is calld object id of the file.
In order to get a depper understanding of these statements, we need to pay attention to the actual physical file of these models.
- When we create a blob for a file, we will find that a new file shows up in the
.git/objects
. Although we delete the original file, we can use blob to restore it. In other words, the blob save the content of the file. - Creating a tree concludes two parts, the first one is to put some files or directories into staging area, after this operation, we will find a new file named
index
shows up in the.git
. The second one is to create a tree for staging area, we will find a new file shows up in the.git/objects
, its format is same as a blob. - When we create a commit for a tree (a commit can only be created for a tree), we will find a new file shows up in the
.git/objects
Let us comb the logical relationship of blob、tree and commit.
A blob can save the content of file, but it can't save the filename and catalog relationship. So git introdces the tree, it can save the filename and catalog relationship. But tree can't save other information such as commitor and commit time and we can't understand the meaning of hash value (such who did what), so git introduces the commit, bases on the tree and add other useful information.
Secondly, let us see the concept of object in git.
// a object in git may be a blob(file), tree(directory) or commit(or snapshot, a state at a certain moment)
type object = blob | tree | commit
// git maintenances a big dictionary, which uses a string as a key to index an object
objects = map<string, object>
// how does git store an object
def store(object):
id = sha1(object)
objects[id] = object
// how does git load an object
def load(id):
return objects[id]
Although commit have some useful information, which can help people understand these commits, but we can only find a commit by its long and meaningless hash value string now. So git introduces the reference, it contains two parts of file. The first one is the .git/HEAD
and the second one is the .git/refs/xxx
. Simply speaking, reference is just a short and meaningful string and it refer to a commit. People can use this string to find a commit.
The so-called branch is a synonym for a reference. The difficulty is that top level of operation for branch is different from bottom level of operation. If you don't want to confuse the understand, please remember that a reference is just a pointer, it uses a short string to simplify the long hash value string, although at top level, we always say commit current changes at this branch, a branch (reference) is not a real way.
If you have feeled puzzle about the relationship between different branches, a good way to comb them is that you can ignore the existed branches, what you are facing is only the current working directory. Actually, when we say that create a new branch and git commit at one branch, it means we create a new reference and a commit, then make the reference point to this commit and keep the original reference unchanged.
Thirdly, let us see the concept of reference in git.
// git uses a long and meaningless string to index an object, which is difficult for human to memorise.
// So it introduces a new concept, reference. A reference is a simple and custom string, which can substitude the hash string.
references = map<string, string>
def update_reference(name, id):
references[name] = id
def read_reference(name):
return references[name]
def load_reference(name_or_id):
if name_or_id in references:
return load(references[name_or_id])
else:
return load(name_or_id)
To insure current working stage, git introduces a HEAD, which can point to a reference (branch) or a commit. When the HEAD doesn't point to a reference, it is called 'detached HEAD' state.
When you use top level command git checkout
, you will edit the HEAD, the .git/HEAD file. When you use top level command git branch
, you will edit the branch/reference, the .git/refs/heads file.
staging area
暂存区目前比较难理解,尤其是其内容的变化不知道是如何进行的,以及tree该如何理解
Staging area is a middle state between current state and snapshot/commit.
If you have created many modifications, and you only want to use some of them to create a snapshot. Now, using current state to create a new snapshot/commit directly is inconvenient,
Every staging area is a tree, it is like a directory, which concludes some files and/or some directories.
Tree is also a object in git, so there is a hash string of tree in .git/objects. If you use the command git cat-file -p
to get the value of a tree object, you will find its content is some files and/or some directories.
Movement
HEAD movement
Using relative reference
You can use ^
operator to move to a previous state, such as main^
and HEAD^
. And you can use multiple operators to move to multiple previous states.
To simplify the operation of multiple states movement, you can use ~
operator. For example, HEAD~2
and HEAD^^
are equivalent.
^ + <number>
isn't similar with ~ + <number>
, its usage is shown in the following figure.
Branch/Commit movement
Merge branches
The begining state is shown in the following figure:
-
git merge xxx
Assuming current branch is main, you want merge bugFix branch into main, you can excutegit merge bugFix
, the result is shown in the following figure.
-
git rebase xxx
Assuming current branch is bugFix, you want merge bugFix branch and main and you want keep just one record line of commit, you can excutegit rebase main
, the result is shown in the following figure.
git rebase xxx
means moving current branch to the xxx branch, git rebase yyy xxx
means moving xxx branch to the yyy branch
An important and easily overlooked point is that excute an rebase command to a commit doesn't mean just move this commit to the target commit, it will find the lowest common ancestor between source branch and target branch, move the source commits below the LCA to the target branch (below the latest commit) following the original index.
If the work of using command git rebase
to adjust the branch is too complex, you can use the interactive version of rebase. Just add a extra parameter -i
(interactive).
Revoke changes
The begining state is shown in the following figure:
git reset xxx
After you excute the command git reset HEAD^
, the main back to the previous commit.
What need we notice is that the c2 changes don't disappear, these changes just don't add to staging area.
git revert xxx
If we are editing a remote branch which need to be shared with other people, we need to use this command.
After we excute the command git revert HEAD
The difference between the two commands is that the latter command create a new commit, which can be push to the remote, so that other people can see the change.
Move the specify commit
git cherry-pick commit...
In software development, the term cherry-pick is used to describe the action of choosing specific commits from one branch and applying them to current branch. It is derived from the analogy of a tree with branches that have many cherries (commits).
The begining state is shown in the following figure
Now we are in the main branch, after we excute the command git cherry-pick bugFix
, the bugFix branch will be applied into current main branch. The result is shown in the following figure.
If you select multiple commits when you use this command, it will pick the commit one by one, its role is similar with git rebose -i
Realistic scenarios
- select a specific historical commit to commit
The scenario is that you just want to select the bugFix branch to merge with main and to commit.
You can switch to the main branch, then use the git cherry-pick bugFix
to select the bugFix branch. The result is shown in the following figure.
- edit the previous commit
Note: I still don't know the way of datas change, when we use the
git rebase -i
to edit many commit.
Your task is that edit the newImage commit.
Firstly, excute git rebase -i HEAD~2
, adjust the index of rebasing, get a result that is shown in the following figure.
Secondly, edit in the newImage branch, then excute git commit --amend
.
Thirdly, excute git rebase -i HEAD~2
, adjust the index of rebasing.
At last, switch to main branch and excute git rebase caption
to move the main to the caption.
You can also use the
git cherry-pick
to solve this problem
Tag
git tag <tag_name> <commit>
: create a tag named tag_name at specify commit
Remote warehouse
git fetch
- download commit records which don't exist in local from remote
- update the remote reference
git fetch
is equivalent to download data from remote, it doesn't edit the local date such as local reference
Parameters for the git-fetch
-
参数
git fetch origin foo: download the data from remote foo branch
download the remote source branch to the local destination branch, for example
If the source is null, the command will create a new local destination branch
git pull
Because git fetch
just download the remote data, we need edit the local data manually. So git pull
will complete two works of download remote data and merge the remote and local reference.
The local and remote structure is shown in the following figure.
After we excute the command git fetch
, the remote reference go ahead one step.
Then we excute the command git merge o/main
.
If we excute git pull
, we can get the same result. All in all, git pull
= git fetch
+ git merge
git push
git pull --rebase
= git fetch
+ git rebase
Before push the local commit, maybe we need to handle some special situations.
c3 bases on c1, but remote branch has gone to the c2, so direct git push operation fails.
We need git fetch
to gain the remote warehouse data, then merge the data, at last push the data to the remote.
And we can also use the git merge
to substitute the git rebase
Parameters for the git-push
-
git push <remote> <place>
After setting up the mapping, you can use origin as the remote directly. Place means the source of data for commit to the remote warehouse. -
git push origin
If the source is null, the command will delete the remote destination branch
Remote tracking
The remote tracking means that you can specify a branch to track a remote branch.
Think about it, why when we commit an main branch, it will commit to the main branch in remote warehouse. The reason is that the local main branch maps to the original main branch.
You can use git branch -u original/main side
to set the side branch to track the remote main branch, after it, you can git push directly at side branch. The result of changing the map relationship is shown in the following figure.
Some questions
.gitignore无效
原因:.gitignore 只能忽略那些原来没有被track的文件,如果某些文件已经被纳入了版本管理中,则修改.gitignore是无效的
[root@kevin ~]# git rm -r --cached .
[root@kevin ~]# git add .
[root@kevin ~]# git commit -m 'update .gitignore'
[root@kevin ~]# git push -u origin master
Comments
The commands which are used in experiment
—————————————————————————————————
control the ordinary file
// create hash string of object from standord input and file
git hash-object -w --stdin
git hash-object -w test.txt
// get the value of object by hash string:
git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
// judge the type of object:
git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
—————————————————————————————————
control the staging area
// index is a equivalent concept of staging area
// select a file and put it into the staging area
git update-index --add test.txt
//what is the meaning of parameter cacheinfo?
// Now, in the working area, you just have one file named test.txt, but this file has twe versions because of different file content, such as version 1 and version 2.
// The file is version 2 now, but you only want to use version 1 to update index, so you can use the parameter cacheinfo to specify to use version 1.
git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 test.txt
// create an object id for staging area which concludes selected files and directories
git write-tree
—————————————————————————————————
// control the commit object
echo 'first commit' | git commit-tree d8329f
echo 'second commit' | git commit-tree 0155eb -p fdf4fc3
—————————————————————————————————
// control the reference
// create a new reference, the object must be a commit
git update-ref refs/heads/main <object id>
// determine which branch now
git symbolic-ref HEAD
// edit the HEAD
git symbolic-ref HEAD refs/haeds/other