Git - Brief Introduction

Overview

This article will introduce you to Git and how it can be used to do source control on a project. The article starts with an explanation of the conceptual framework for Git. A demonstration of using Git on a project is provided using the Git application GitAhead.

Conceptual overview of Git

Git is a source control tool that allows you to track the history of your project. A project's history is created by archiving the project when it reaches a state that you may need to come back to. You decide when to archive the project. Some examples of when you may wish to archive a project are when you are ready to publish a paper, share some key results with a collaborator or mentor, you have completed a step (major or minor) of the project, or you want to embark on a change in direction. This archiving of a project is called a commit in Git parlance. When you commit, You decide which project files to archive. This maybe all or only some of the files you have updated. Typically the entire set of changes to a file are committed together. Committing part of a file requires using interactive staging.

Git provides an organizational structure for a project's commits. This organization is based on a projects work being done serially. That is a commit builds on the work achieved by a prior commit. A serially set of commits is referred to as a branch in git parlance. Parallel development work is done on separate branches. The figure below shows an example history of a project with two branches. Each commit in the graph (or tree) has an identifier. A commit is linked to the prior commit on the branch. These links are depicted in the figure by arrows pointing to the prior commit. For example, the work done in commit 4 builds on the work archived at commit 2. At commit 2 the project is split into two branches. The dashed arrow between commit 3 and commit x represents that there may be many commits on the branch. The two branches are mergered into a single branch at the id z commit. A display of commits is called a log. Git GUI (Graphical User Interface) applications will display a graph similar to this with the log. The branches of a tree are named. The default name for the first branch in a tree is master. You choose the names of the branches you create.

Git history graph
Git history graph

Prior saved versions of your project can be checked out if they are ever needed. The ability to demonstrate prior results is a key element of reproducible research. The ability to access earlier versions also gives you confidence to try bold new ideas without fear or having many of backup directories, if bugs are introduced it allows you to see exactly when, who introduced the error (and with commit comments, what you/they were thinking), facilitates collaboratively editing code--even same files at the same time.

Git stores your commits in what is called a repository. A local repository has a working dirrectory associated with it. The working directory is the folder (and possibly sub folders) that contain the files of the project. This is where you do your work. The set of files in the working directory are associated with a commit. This associated commit is identifies by the head pointer in Git. The head pointer is typical the most recent (last) commit of the branch being worked on. When files in the working directory are committed, git archives the changes that have been made since the commit of the head pointer. The head pointer is updated to point to the new commit. This is shown the figure below. Project updates made after commit id 4 are recorded in commit id 5 and the head pointer is changed from commit 4 to commit 5. This serial approach results in only one branch being available in the working directory at a time.

local Repository
local Repository

A staging index is shown in the repository in the figure above. Files are moved from the working directory to the staging index prior to being committed. committing moves only the files of the staging index into the commit. You can stage a file and then continue to edit the file. When a commit is done only the file version in the staging index is committed. Thus your project files can possibly be at three different stages, working directory, staging index, and as represented by the commits through the head pointer. When Git is run from the command line, there are separated commands to stage files and commit them. Most Git GUIs can stage and commit together and we will use that approach in this article.

Work can be shared between collaborators through Git repositories. This is typically done by creating a repository where a researcher can publish their work and get the work of others. This repository for sharing work is called a remote repository. An example of a project with two collaborators is shown in the figure below. In this figure, collaborator 1 has committed some work to their local repository with commit id y. This work is not available to be shared with the other repositories of the project after the commit. The work is made available to the other collaborators by pushing the branch in repository 1 to the remote repository. A push moves all new commits of a branch to the remote repository. The work through commit y can be pulled by the other repositories of the project when they need what has been shared.

multiple Repositories
multiple Repositories

When a push is done, Git checks if other collaborators have pushed any commits since your last push or pull. If no other collaborators have made changes, Git can merge your work with the remote branch by doing a fast forward merge, applying your commits to the remote branch and moving the head pointer forward. If there are new commits from your collaborators, you will need to merge your work with these commits. This merge is done in your local repository. You can start this merge by pulling from the remote. If there are no conflicts, the same line of a file changed by the new commits and your work, Git will put the changes made by the remote commits and your commits together. You will have the opportunity to review these combined changes. If you accept the changes, The updated files are placed in your working directory. (Note, you should not have any uncommitted work in your working directory when doing a merge.) If there are conflicts, you will need to resolve them by updating the code. Git has tools to assist with the resolution of any line level conflicts. When the files are merged. A new commit is done to apply your updates to the repository.

This conceptual overview was intended to provide a framework to understand what Git does. You will not need to know any of the details of the inter-workings of a repository.

Example using GitAhead

Setting up GitAhead

GitAhead is free to use for non-commercial work. We believe that most researchers here are eligible for a free use version. You are still responsible for reviewing the license or contact GitAhead to confirm your non-commercial status. The application can be down load at the GitAhead website.

When you start GitAhead for the first time, it will ask you to choose a theme. This is a personal preference and there is no recommended choice. We used the light background theme for the images in this article. After selecting the theme, GitAhead will display the "Repository Chooser" window.

local Repository
local Repository

Creating the Git repository

We start by creating a new repository that will be used in the example.

  • Click the "+" button at the bottom of the Repository pane in the GitAhead window.

  • Select "Initialize New Repository" from the drop down menu that appears.

  • In the "Initialize Repository" dialog window, enter a name for the project folder where the repository will be located and the directory for this folder. GitAhead will create the project folder for you if it is not already created.

    We will used GitExmp for the name and we will put this folder on the U Drive.

  • Click the "Initialize" button in the bottom right corner of the "Initialize Repository" dialog window.

  • The "Initialize Repository" dialog window closes and GitAhead will display a log for the new repository. Since there are no files or commits in this repository, the log will be empty.

GitAhead will display a screen similar to below.

Empty Repository
Empty Repository

You will need to enter your name and email address so that the information can be recorded with each commit you do. This is done by

  • Clicking on the cog button in the top right set of buttons in GitAhead. You can also select Options in the Tools drop down menu.

  • A dialog box is opened. Select the General button from the row of buttons at the top of the dialog window.

  • Enter your name and email in the corresponding text boxes.

  • Click the OK button in the lower right of the dialog window.

GitAhead now knows how you are.

Example overview

This example tracks changes made to a project that uses Stata. The code for this project is from the SSCC's Knowledge Base article "Stata for Researchers: Working With Data". This code will a few lines at a time. This is done to simulate what would occur on a larger project that is developed over a longer period of time. This committing a few lines at a time is not meant to reflect the typical amount of code that is included in a commit. Smaller amounts are used here for ease of use in the example.

If you do not have Stata on your computer, you can follow this Git example by editing the code in an editor such as notepad.

The choice of programming language and the code used is not important to this example. The example could have been done just as easily in many other programming languages, these include document generating languages such as Markdown and Latex.

Inital commit

We will create a new Stata .do file in the GitExmp folder. Open Stata and create a new do file. Save this do file as "workWithData". The workWithData do file will now be seen in the GitAhead window.

This example uses the auto data set. The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2015 dollars. To do so you need to multiply the prices by a conversion factor which is the CPI in 2015 divided by the CPI in 1978, or about 3.6.

Copy and paste the following code into your do file.

clear all
capture log close

set more off
log using data.log, replace

webuse auto

gen price2015 = price * 3.6

l make price*

save autoVersion2, replace
log close

Run this code in Stata. You will now see in GitAhead a .log and .dta file. These files were created by running the workWithData do file. We do not want to place these files in the repository, since these files can be generated again from the do file. This practice of excluding from source control files that can be recreated by project code is common practice.

The following step will ignore these files.

  • In GitAhead's File drop down menu, select New file.

  • A new window will open. The file will be untitled at this point.

  • add the following lines to the new file.

    *.dta

    *.log

    The "*" indicates that any files with any name that end with either .dta or .log will be ignored.

  • Save the new file as ".gitignore"

  • Close the ".gitignore" file

Let's commit this work.

  • Save your do file in the folder for the example project. We named this file "workWithData".

  • Click the check box to the left of the workWithData and gitignore file names in the GitAhead window,

  • In the commit message box, we will enter "initial commit". It is common practice for the first commit message to be this.

  • Click the "Commit" button to the right of the commit Message box.

The commit message is a reminder to you, and possibly others, of what changed in the project with this commit. It should be short (think tweet) since it will most often be viewed as a single line in a log of commits. The convention we will use in this article is to start the commit message with the action, something like "added", "changed", or "removed". The action will be followed with what the action is applied to. Examples would be "Added x feature" and "Changed y function's parameters".

Additional commits

Let's assume a little more precision is needed for price2015. We will use 3.57 as the conversion factor. Change the 3.6 to 3.57 in your do file.

gen price2015 = price * 3.57

This is possible an example of when you might save the file but not commit. The use of Git should not change when and how you save your files as you work.

The change to the price into 2015 dollars is assumed to now have the accuracy needed. We can now use that as the price in our project. Change the gen of price2015 to a replace of price.

replace price = price * 3.57

Now that we have the price variable worked out, commit the changes. This is done as was done above. You will use a different commit message. I will use, "corrected price calculation".

Note, you may need to select the uncommitted changes node in the log to see the list of files to commit.

Branching

The next step in our project is to collapse the five-point scale of the rep78 variable into a three-point scale. This work will be done on a new branch. This allows us to work on the repair variable without impacting other work that might being done on the main branch.

The following steps create a new branch.

  • From GitAhead's Branch drop down menu, Select New Branch.

  • A window opens that is used to create the new branch.

  • In this new window, enter "repair" in the Name text box. The starting point will be left master.

  • Click the "Create Branch" button at the bottom of this window.

In GitAhead, you will see that there is a head pointer for the repair branch. We can now continue are work on the code for the repair variable.

Add the following code to your do file.

gen rep3 = 1 if rep78 < 3
replace rep3 = 2 if rep78 == 3
replace rep3 = 3 if rep78 > 3 & rep78 < .
rename rep3 repairRecord

Run the code in Stata. Commit this change using "create 3 level rep3" for the commit message.

Save and close the do file.

We are now going to go back to the master branch and work on a gasGuzzler variable.

Check out master branch.

  • From GitAhead's Branch drop down menu, Select Checkout.

  • A window opens that is used to select a branch to switch to.

  • Use the drop down button on the References text box to select master.

  • Click the "Checkout" button at the bottom of the window.

Open the workWithData do file. Add the following code for the gasGuzzler variable to the do file.

gen gasGuzzler = (mpg < 20)

l make mpg if gasGuzzler

drop gasGuzzler
gen gasGuzzler = (mpg < 20) if mpg < .

drop if gasGuzzler

Commit this change. I used "added gasGuzzler" for the commit message.

Save and close your do file. Switch back to the repair branch. Open your do file.

Enter the following code in your do file after the prior code for the repair variable.

recode rep78 (1 2 = 1) (3 = 2) (4/5 = 3), gen(rep3b)

assert repairRecord == rep3b

label variable foreign "Car Origin"

label define rep 1 "Bad" 2 "Average" 3 "Good"

label values repairRecord rep

list repairRecord

Run the do file in Stata. Commit this change. I used "added labels to rep" for the commit message.

We are now done with our work on the repair variable and we can merge it with the master branch.

Save and close your do file. Switch to Master branch.

Merge the repair branch with the master branch.

  • From GitAhead's Branch drop down menu, Select Merge.

  • A window opens that is used to merge a branch with the current branch (which is master.)

  • Use the drop down button on the References text box to select repair.

  • Click the "Merge" button at the bottom of the window.

  • Git will identify conflicts.

  • Use the Edit Hunk to resolve the conflicts. Since we want the code from both branches, delete the text used to identify which branch the code is from.

  • Commit the edited do file with "Merge branch repair".

The GitAhead window will be displaying a log which should look similar to the screen shot below.

branch log
branch log

Git verse GitHub, Bitbucket, etc.

Git is both an application program and a syntax. As noted above, there are other applications besides the Git application that support the syntax defined by Git. For example, we used the GitAhead application in the example above. GitHub and Bitbucket are also applications that uses the Git syntax as well as being a site that hosts repositories. GitHub has a GUI application as well as support for command line Git syntax.

GitHub repositories typically are public to the world. (Repositories can also be private to a group of contributors.) The work in these public repositories is available to all. Only the owner of the repository can make changes to the repository. To allow the owner control of what changes get made to their repository, GitHub introduced additional commands. (These commands are now part of Git.) A collaborator forks the shared repository. The fork creates a copy of the shared repository for you to work in. This is a copy that is private and only you can modify. When you are ready to request your work be added to the main shared repository, you make a pull request to the owner. So the main differences between gitHub an Git is where the repositories are stored and how sharing is done between repositories.

References to other material

https://marklodato.github.io/visual-git-guide/index-en.html

Last Revised: 3/21/2018