By Mike Schilli
The Git version control system easily outperforms old-timers such as CVS, Subversion, or Perforce. Having the entire project's history available offline and Git's branch strategy are such powerful features that many programmers wonder how they ever managed to develop software before the invention of decentralized version control systems.
That said, Git typically focuses on a single project, and its support for subprojects is rudimentary at best. Because of this, active developers tend to create or clone dozens of Git repositories in the course of time, and their authoritative copies often reside on different servers. Keeping your local copies up to date then becomes a pain as the number of repositories grows, and all hell breaks loose if you happen to change machines and have to rediscover and reclone everything all at once. After buying a new laptop, or moving to a new development desktop, it would be really useful to have a copy of all your projects waiting for you.
Additionally, in many cases, computers will be assigned to groups that require different repositories. For example, you might not want to keep a Git repository with large images on your laptop for space reasons, and you would probably want to avoid storing private content on your computer at work. A configuration file, stored somewhere on the Internet, could store the locations of the repositories you use. These values tend to change quickly as you add new projects and delete or move others. This scenario sounds like a task for a version control system: How about using Git to maintain your metadata (meta) repository?
The configuration data will be stored in a plain text file in YAML format, because it's easily read by both humans and machines. This specific new dialect shall be called GMF (Git Meta Format), and my configuration files will have a .gmf
file extension.
Figure 1 gives an example. The first entry points to a privately hosted repository that resides on a fictitious server, private.server.com, which supports SSH access.
gitmeta.gmf
file in the Git meta repository somewhere on the Internet.The second entry points to the official Git repository for the Perl 5 kernel, which stores the entire commit history ever since Larry Wall released the first version of Perl back in 1987. Both repository locators in the configuration can be used directly by the git clone
command, which creates a local directory with a copy of the remote repository.
Of course, the GMF file could simply list all the active repositories, but busy developers would find the effort of continually adding new projects and removing defunct ones strenuous in the long term.
For example, if you launch a dozen projects on Github.com, or on a server with SSH access, you could save yourself the trouble of adding these locations if the meta repository understood how to interpret these collections without human interaction. The meta repository would thus need to understand instructions such as "Grab all the repositories in this directory on this server over SSH" or "All repositories by this user stored on GitHub."
The YAML blocks assigned to the two lower dashes in Figure 1 each represent two hashes (see Figure 2 for the Perl format to which the YAML configuration is converted) that designate collections of repositories with specific properties in the meta format.
gitmeta.gmf
YAML file gets transformed into a Perl data structure.The first hash has a value of Github
in its type
field, and the user
entry, with its value of mschilli
, indicates that all repositories belonging to user mschilli
on Github.com should be copied to the local machine or updated.
Instead of dozens of separate entries, there are just two lines, and if the user were to create some new repositories on Github.com, they would automatically become part of the meta repository without requiring modification of the configuration file.
If the user deleted a project on GitHub, the updater would not explicitly delete the local project. But if the user removed the local copy, cloning would no longer occur.
The YAML entry next to the last dash in Figure 1 (or the final data structure in Figure 2) references a collection of Git repositories that reside in a directory on the specified server with SSH access. Again, the updater automatically picks up new entries without needing user interaction: To do this, the processing script lists the subdirectories and then clones the individual repositories it finds in this way.
The gitmeta-update
script in Listing 1 handles the original cloning and later updating procedures of local repositories based on the data stored in the meta repository. The meta repository will typically reside on a server with SSH access to restrict the use of this potentially confidential meta information to the authorized user.
Listing 1: gitmeta-update |
01 #!/usr/local/bin/perl -w 02 use strict; 03 use GitMeta::GMF; 04 use Sysadm::Install qw(:all); 05 use File::Basename; 06 use Getopt::Std; 07 use Log::Log4perl qw(:easy); 08 09 getopts("vn", \my %opts); 10 11 if ($opts{v}) { 12 Log::Log4perl->easy_init( 13 $DEBUG); 14 } 15 16 my ($gmf_repo, $gmf_path, 17 $local_dir) 18 = @ARGV; 19 20 die "usage: $0 gmf-repo ", 21 "gmf-path local-dir" 22 unless defined $local_dir; 23 24 main(); 25 26 ############################# 27 sub main { 28 ############################# 29 my $gm = GitMeta::GMF->new( 30 repo => $gmf_repo, 31 gmf_path => $gmf_path 32 ); 33 34 my @urls = $gm->expand(); 35 36 if ($opts{n}) { 37 for my $url (@urls) { 38 print "$url\n"; 39 } 40 return 1; 41 } 42 43 cd $local_dir; 44 45 for my $url (@urls) { 46 my $repo_dir = 47 basename $url; 48 $repo_dir =~ s/\.git$//g; 49 if (-d $repo_dir) { 50 cd $repo_dir; 51 tap "git", "fetch", 52 "origin"; 53 cdback; 54 } else { 55 tap "git", "clone", $url; 56 } 57 } 58 return 1; 59 } |
The script expects three command-line parameters: the location of the meta repository, the path to the GMF file within it, and the local directory in which the mirrored repositories will be stored. The following command line
gitmeta-update -v user@secret.server.com:git/gitmeta gitmeta.gmf /path/to/local/repo/directory
contacts the server at secret.server.com, logs in via SSH as user
, changes to the remote git/gitmeta
directory below the home directory belonging to user
and mirrors the Git repository it finds there to a temporary directory on the local disk. Then it loads the current version of the gitmeta.gmf
file, runs it through the YAML parser, and processes the array entries one after another.
Incidentally, the -v
option in the previous command line sends verbose output from the commands being processed to stderr, with some help from the Log4perl API.
The Perl code wraps the retrieval and processing of the YAML file in the GitMeta::GMF class, but more of that later. Lines 29 through 32 in Listing 1 call the new()
constructor and pass in the repository locator, $gmf_repo
, along with the path $gmf_path
to the GMF file. The call to the expand()
method in line 34 resolves direct and indirect references in the YAML file and returns a list of repository locators that point to the repositories that need to be mirrored.
If the -n
option is set, the script performs a dry run, and line 36 branches off to a for
loop that only outputs the identified locators for test purposes and then terminates without actually mirroring anything. In production use, line 43 would use the Sysadm::Install module's cd
command to change to the local mirror directory and start cranking.
The for
loop in lines 45-56 iterates over all the found repository locators, removes any .git
extensions from the names, and checks whether the corresponding directory already exists (i.e., whether the repository has already been mirrored). If so, it uses the command git fetch
to retrieve the changes that happened in the remote location. It does not merge them with the local git branch, like a call to git pull
would, because this could cause conflicts that the user would painstakingly have to resolve. The gitmeta-update
command aims to create a fast mirror while the Internet connection is up. Once you have retrieved the changes, you can always use Git to merge them offline.
If no local directory for the repository exists yet, git clone
in line 55 of Listing 1 creates one and then fetches the data from the remote repository to create a full clone.
The whole magic of the script is contained in the GitMeta::GMF class and its expand()
method, which is called in line 34 and doesn't just fetch a GMF file but recursively interprets its entries.
Listing 3 implements the GitMeta::GMF class, which inherits from the GitMeta.pm
base class in Listing 2. Its expand()
method expects two parameters: the repository locator, repo
, and the relative path to the remote GMF file, gmf_path
. The almost virtual base class in Listing 2 provides the standard constructor, new()
, which is inherited by derived classes. This saves typing and avoids code duplication.
Listing 2: GitMeta.pm |
01 ############################# 02 package GitMeta; 03 ############################# 04 05 ############################# 06 sub new { 07 ############################# 08 my ($class, %options) = @_; 09 10 my $self = {%options}; 11 bless $self, $class; 12 } 13 14 ############################# 15 sub expand { 16 ############################# 17 die "You need to ", 18 "implement 'expand'"; 19 } 20 21 ############################# 22 sub param_check { 23 ############################# 24 my ($self, @params) = @_; 25 26 for my $param (@param) { 27 if ( 28 !exists $self->{$param}) 29 { 30 die "Parameter $param ", 31 " missing"; 32 } 33 } 34 } 35 36 1; |
Listing 3: GMF.pm |
01 ############################# 02 package GitMeta::GMF; 03 ############################# 04 use strict; 05 use warnings; 06 use base qw(GitMeta); 07 use File::Temp qw(tempdir); 08 use Log::Log4perl qw(:easy); 09 use YAML qw(Load); 10 use Sysadm::Install qw(:all); 11 use File::Basename; 12 13 ############################# 14 sub expand { 15 ############################# 16 my ($self) = @_; 17 18 $self->param_check("repo", 19 "gmf_path"); 20 21 my $yml = 22 $self->_fetch( 23 $self->{repo}, 24 $self->{gmf_path}); 25 26 my @locs = (); 27 28 for my $entry (@$yml) { 29 my $type = ref($entry); 30 31 if ($type eq "") { 32 33 # plain git url 34 push @locs, $entry; 35 } else { 36 my $class = 37 "GitMeta::" 38 . ucfirst( 39 $entry->{type}); 40 eval "require $class;" 41 or LOGDIE 42 "Class $class missing"; 43 my $expander = 44 $class->new(%$entry); 45 push @locs, 46 $expander->expand(); 47 } 48 } 49 50 return @locs; 51 } 52 53 ############################# 54 sub _fetch { 55 ############################# 56 my ($self, $git_repo, 57 $gmf_path) 58 = @_; 59 60 my ($tempdir) = 61 tempdir(CLEANUP => 1); 62 63 cd $tempdir; 64 tap "git", "clone", 65 $git_repo; 66 my $data = 67 slurp(basename($git_repo) 68 . "/$gmf_path"); 69 cdback; 70 my $yml = Load($data); 71 return $yml; 72 } 73 74 1; |
Additionally, the GitMeta.pm
base class defines the param_check()
method called by the subclasses to check whether their constructors have been handed the parameters they expect. The method terminates the program if any of them are missing. All subclasses refer to their base class GitMeta
by a use base qw(GitMeta)
statement, as in line 6 of Listing 3, for example.
The instance of the expand()
method defined in the base class (line 15, Listing 2) simply contains an instruction that terminates the program and is never executed if the subclass defines its own expand()
method. The die
instruction serves as a reminder to subclass programmers to implement this virtual base class method in the subclass.
The _fetch()
method defined in lines 54-72 (Listing 3) clones the specified Gitmeta repository into a temporary directory and slurps the YAML data provided by the GMF file into a Perl structure, which it returns as a result. The underscore in the method name indicates that this is an internal, private method that does not belong to the API exported by the class.
The exported expand()
method first calls _fetch()
and then iterates in the for
loop starting in line 28 over all the YAML array elements found in the GMF file. If these elements are normal repository locators without a type
entry, line 34 appends them to the @locs
array without any modification. However, if the current YAML element contains a structure with an entry in its type
field, GMF.pm
delegates processing to a subclass of the corresponding type.
Supported values for type
are github
and sshdir
, which pass on processing of the entry to the subclasses GitMeta::Github and GitMeta::SshDir, respectively. To allow this to happen, the eval
command in line 40 finds and loads the required class to the active program; line 44 calls the class's constructor new()
with the remaining parameters found in the YAML entry.
Following polymorphic tradition, all subclasses have an expand()
method that returns lists of repository locators. Values returned, regardless of which instance of expand()
found them, are sent to the end of the @locs
array and contribute to the total list of repositories in the main
script.
If the script comes across a github
entry in the type
field when interpreting a GMF file, it activates GitMeta::Github (Listing 4). This class also inherits from the GitMeta
base class and simply overwrites its expand()
method by fetching the names of all repositories residing on GitHub for the defined user. To do so, it uses GitHub's simple XML API, which is freely available under the /api/v1/xml/username
path on Github.com and doesn't even require you to register or submit a token. Calling decoded_content()
on the response coming back from the web server ensures that project descriptions encoded in UTF8 will still return valid XML.
The XML returned by the web query is then grabbed by the XMLin()
function of the CPAN XML::Simple module, which converts it into a deeply nested hash data structure. Line 35 dives into the hash using the {repositories}->{repository}
key and receives a hash whose keys represent the repository names.
Lines 43 and 44 create a typical GitHub-style repository locator from the name. Local users will have read and write access, assuming they identify themselves correctly with a valid SSH key on Github.com.
Listing 4: Github.pm |
01 ############################# 02 package GitMeta::Github; 03 ############################# 04 use strict; 05 use warnings; 06 use base qw(GitMeta); 07 use LWP::UserAgent; 08 use XML::Simple; 09 10 ############################# 11 sub expand { 12 ############################# 13 my ($self) = @_; 14 15 $self->param_check("user"); 16 17 my $user = $self->{user}; 18 my @repos = (); 19 20 my $ua = 21 LWP::UserAgent->new(); 22 my $resp = $ua->get( 23 "http://github.com" . 24 "/api/v1/xml/$user" 25 ); 26 27 if ($resp->is_error) { 28 die "API fetch failed: ", 29 $resp->message(); 30 } 31 32 my $xml = XMLin( 33 $resp->decoded_content()); 34 35 my $by_repo = 36 $xml->{repositories} 37 ->{repository}; 38 39 for 40 my $repo (keys %$by_repo) 41 { 42 push @repos, 43 "git\@github.com" . 44 ":$user/$repo.git"; 45 } 46 47 return @repos; 48 } 49 50 1; |
Listing 5 contains another specialized class. The Gitmeta::SshDir package defined there, which also inherits from GitMeta
, is responsible for repositories that reside as subdirectories in a directory on a server with SSH-protected access. This is perfect for private repositories, because neither the content nor the names are published anywhere.
Listing 5: SshDir.pm |
01 ############################# 02 package GitMeta::SshDir; 03 ############################# 04 use strict; 05 use warnings; 06 use base qw(GitMeta); 07 use Sysadm::Install qw(:all); 08 use Log::Log4perl qw(:easy); 09 10 ############################# 11 sub expand { 12 ############################# 13 my ($self) = @_; 14 15 $self->param_check("host", 16 "dir"); 17 18 INFO "Retrieving repos ", 19 "from $self->{host}"; 20 21 my ($stdout) = tap "ssh", 22 $self->{host}, 23 "ls", $self->{dir}; 24 25 my @repos = (); 26 27 while ($stdout =~ /(.*)\n/g) 28 { 29 push @repos, 30 "$self->{host}:" . 31 "$self->{dir}/$1"; 32 } 33 34 return @repos; 35 } 36 37 1; |
To parse a list of the directories available on the server and to pass it to the updater later, line 21 (Listing 5) uses the SSH protocol to run an ls
command on the server, thus obtaining the subdirectories of the given path. The output is separated by newlines, because the Unix shell sends it this way.
The while
loop in lines 27-32 creates a repository locator for the Git-via-SSH protocol from each line and appends it to the resulting @repos
array, which the method then passes back to the caller as a long list.
Meta repositories can also reference other meta repositories, as in Figure 3. The type
field in the entry shown here has a value of GMF
; the processing code thus hands over responsibility to the GitMeta::GMF class, which in turn fetches the remote repository and obtains and processes its GMF file.
The script resolves entries recursively and creates a long list of repositories that need updating. With this information, programmers then can define groups of repositories and assign a tailored collection of repositories to each system by cleverly combining different meta repositories, with no need to reference repositories multiple times in multiple configurations. The configuration shown in Figure 3 precisely matches
gitmeta-update user@devhost.com:git/gitmeta privdev.gmf ...
apart from the fact that the command line is followed by the name of the directory that is supposed to receive the local clones. The GMF files in the meta repository can also be stored in subdirectories to improve the structure. I could imagine creating a meta repository with two GMF files, priv/free.gmf
and priv/commerce.gmf
, to separate free software from commercial software. To reference either one instead of privdev.gmf, you would simply adapt the gmf_path
in the GMF configuration or the second gitmeta-update
parameter on the command line.
To avoid repeatedly typing your password for SSH access, create private/public key pairs and install the public parts on all SSH servers involved. Otherwise, when the servers request a password, you won't see the prompt because the tap()
commands eat them, which leaves you wondering what's going on. GitHub doesn't support passwords for Git access, anyway, and requires users to deposit their public keys on the website.
To run the gitmeta-update
script on a newly installed machine, install the Perl and the CPAN modules it and its supporting classes use. The four classes I have mentioned must be stored in the following directory tree on your filesystem below a path the script can find:
GitMeta.pm GitMeta/GMF.pm GitMeta/Github.pm GitMeta/Sshdir.pm
To make new GMF files, create a new Git repository on a server with SSH access, edit the GMF file, and run a commit
(Figure 4). After creating the meta repository on the server, access it via a locator, like user@some.host.com:repodir/gitmeta
. The cloning starts when you run gitmeta-update
with this locator and a local target directory. If you don't have a new laptop to experiment on, this gives you a perfect excuse to buy one.
INFO |
[1] Listings for this article: http://www.linuxpromagazine.com/Resources/Article-Code
|
THE AUTHOR |
Mike Schilli works as a software engineer with Yahoo! in Sunnyvale, California. He can be contacted at mschilli@perlmeister.com. Mike's homepage can be found at http://perlmeister.com. |