I've been hearing about and reading about Gearman for a couple years now, but, due
to the nature of my work, it's
never really been something I needed to investigate; when you're writing
backend code, scalability is something you leave to the end-users, right?
Wrong! But perhaps an explanation is in order.
Background
We're looking at migrating to Git for our
primary development on the framework. There are a lot of use cases we need
to accommodate:
- We want to support atomic changesets (i.e., changesets that include all
changes related to a single issue: usually, unit tests, code, and often
documentation).
- At the same time, developers want the ability to pull in just the
library code as a git submodule,
or a single language of the manual, etc.
- Users want a read-only Subversion
respository so that they can continue using svn:externals.
Just because we're migrating doesn't mean our users should need to.
- Of course, lots of folks like to keep on top of commits via RSS feeds
- And then there's masochists like myself who like having commit
emails. (Is it a wonder I never hit inbox zero?)
The first two items are hard to accomplish at the same time, as it turns
out. If you make every distinct sub-tree you want discretely cloneable, and
then build a repository consisting of a bunch of git submodules, you lose
atomicity. (You end up having a commit for each submodule you touch, plus
one for the master repository. Eww.)
I found a way to do it, however, using subtree merges. However,
since this post is about writing Gearman workers, I'll leave that for
another day. The important thing, however, is that I discovered something
else that was interesting.
Git allows you to define "hooks", scripts that are run at various points of
the git commit lifecycle. One hook that can run on the server is called
"post-receive". What I discovered is that even though
post-receive runs after a commit is accepted to the repository,
if you perform operations on a git repository while the hook is still
running, you can get some strange behavior. In my example, I was having the
script trigger a "git pull" in a working tree. While the working tree
received the changesets, it couldn't apply them cleanly, since the server
actually hadn't finalized its state. The only way I could get a clean pull
was if I pulled after the hook was complete. Which foiled my
attempts at automation.
And now we get to Gearman. I realized I could have my
post-receive script queue a background task to Gearman. Since
this is almost an instantaneous operation, it meant that my hook was
completed before Gearman started a worker; if it wasn't, I could always do a
sleep() within my worker to ensure it was.
Writing Gearman Tasks
So, now I was able to do my task, I started thinking about what other things
I could do, and suddenly Gearman looked like an excellent solution for the
architecture. Basically, it prevents the end-user who is committing changes
from having any lag based on the hook scripts, while simultaneously allowing
me do perform the task automation we need.
So I wrote two tasks as a proof-of-concept, using a mixture of straight PHP
and Zend Framework; these are for the subtree merge I mentioned earlier (the
actual work is done in a bash script, actually), and also one for RSS feeds.
The Gearman client: a post-receive hook
First, let's look at my hook script, which uses a Gearman client. I'm using
ext/gearman, from PECL. My post-receive hook
script looks like this:
#!/usr/bin/env php<?php$workload =
implode(':',
$argv);
$client =
new GearmanClient
();
$client->
addServer();
$client->
doBackground('post-receive',
$workload);
$client->
doBackground('rss',
$workload);
The above should be pretty straight-forward: I create a
GearmanClient, tell it to use the default server (localhost),
and trigger two Gearman functions, "post-receive" and "rss," using the
arguments my script received as a payload. I use the
doBackground() method so that the tasks can execute
asynchronously; the hook script doesn't need to be blocked on the execution
of any given task, and can continue merrily on its way.
The tasks
I wrote two classes, one for each Gearman job I wanted to create. I could
have done these as lambdas,
plain old functions, etc. I chose objects so that I could test them, as well
as consume them from other scripts if I want. These classes implement a
Command interface, which simply defines an
execute() method that accepts a GearmanJob
instance.
The first is the job that triggers my subtree merge:
<?phpnamespace ZF\Git;
class MergeSubtree implements Command
{ protected
$_logger;
protected
$_wd =
'/var/spool/gearman';
public
function setWorkingDir
($path) { if (!
is_dir($path)) { throw
new \Exception
('Invalid path provided for working directory');
} $this->_wd =
$path;
} public
function getLogger
() { if (null ===
$this->_logger
) { $this->
setLogger(new \Zend_Log
(new \Zend_Log_Writer_Stream
($this->_wd .
'/merge_subtree_error.log')));
} return $this->_logger;
} public
function setLogger
(\Zend_Log
$logger) { $this->_logger =
$logger;
} public
function executeMerge
() { chdir($_ENV['HOME'] .
'/working/zf-master');
$return =
shell_exec($this->_wd .
'/update-master.sh');
return $return;
} public
function execute
(\GearmanJob
$job) { $this->
getLogger()->
info('Received merge request');
$return =
$this->
executeMerge();
if (strstr($return,
'Failed')) { $this->
getLogger()->
err('Failed pull: ' .
$return);
$job->
sendFail();
return;
} $this->
getLogger()->
info('Merge complete');
}}
(Note the backslashes in front of the ZF class names; since I'm using
namespaces, I need to fully-qualify my classes.)
The above class is probably overkill. But it has some nice features,
particularly for a Gearman environment: it logs anytime it sees failures in
my merge script. This way I can go look through my logs anytime I start
seeing discrepancies between my repositories.
My next class is a bit more complex, and yet for many, probably more
useful. It takes the most recent 15 git log entries, and
creates an RSS feed:
<?phpnamespace ZF\Git;
class Log2RSS implements Command
{ protected
$_repo;
protected
$_feedDir =
'/var/spool/gearman/feeds';
protected
$_feedName =
'rss';
protected
$_baseLink =
'http://some.viewgit.repo/?a=commit&p=zf&h=';
public
function setRepo
($repo) { if (!
is_dir($repo) || !
is_dir($repo .
'/.git')) { throw
new \Exception
('Invalid repository specified; not a Git repository');
} $this->_repo =
$repo;
return $this;
} public
function getRepo
() { if (null ===
$this->_repo
) { throw
new \Exception
('No repository directory specified');
} return $this->_repo;
} public
function setBaseLink
($url) { $this->_baseLink =
$url;
return $this;
} public
function getBaseLink
() { return $this->_baseLink;
} public
function setFeedDir
($path) { if (!
is_dir($path) || !
is_writable($path)) { throw
new \Exception
('Invalid feed directory specified, or not writeable');
} $this->_feedDir =
$path;
return $this;
} public
function getFeedDir
() { return $this->_feedDir;
} public
function setFeedName
($feedName) { $this->_feedName =
(string
) $feedName;
return $this;
} public
function getFeedName
() { return $this->_feedName;
} public
function generateFeed
() { $feed =
new \Zend_Feed_Writer_Feed;
$feed->
setTitle('git commits');
$feed->
setLink('http://some.viewgit.repo/');
$feed->
setFeedLink('http://some.viewgit.repo/feeds/' .
$this->
getFeedName() .
'.xml',
'rss');
$feed->
addAuthor(array( 'name' =>
'Name of this feed',
'email' =>
'git@somedomain',
'uri' =>
'http://some.viewgit.repo/',
));
$feed->
setDateModified(time());
$feed->
setDescription('git commits');
$logs =
$this->_parseLogs
();
foreach ($logs as $log) { $date =
strtotime($log['date']);
$entry =
$feed->
createEntry();
$entry->
setTitle($log['commit']);
$entry->
setLink($this->
getBaseLink() .
$log['commit']);
$entry->
setDateModified($date);
$entry->
setDateCreated($date);
$entry->
setDescription($log['subject']);
$entry->
setContent($log['subject'] .
"\n\n" .
$log['notes']);
$feed->
addEntry($entry);
} $output =
$feed->
export('rss');
file_put_contents
($this->
getFeedDir() .
'/' .
$this->
getFeedName() .
'.xml',
$output);
} public
function execute
(\GearmanJob
$job) { $this->
generateFeed();
} protected
function _parseLogs
() { $repoPath =
$this->
getRepo();
$command =
'git --git-dir=' .
$repoPath .
'/.git --work-tree=' .
$repoPath .
' log --max-count=15 --format=\'Commit: %H%nAuthor: %an%nDate: %cD%nSubject: %s%nNotes: %N%n\' -p';
$log =
shell_exec($command);
$lines =
preg_split('/[\r\n?|\n]/',
$log);
$logs =
array();
$index =
0;
$current =
false;
foreach ($lines as $line) { if (preg_match('/^(Commit|Author|Date|Subject|Notes): (.*)$/',
$line,
$matches)) { $current =
strtolower($matches[1]);
$value =
$matches[2];
if ('commit' ==
$current) { $index++;
$logs[$index] =
array();
} $logs[$index][$current] =
$value;
} elseif (false !==
$current) { $logs[$index][$current] .=
"\n" .
$line;
} } return $logs;
}}
The above object could use a few more customization vectors -- ways to
inject the RSS feed name, URL, etc., and some threshold for the description
limit so it can truncate past a certain number of lines. However, it gets
the job done -- it creates an RSS feed with entries based on each commit.
The Gearman Worker
Now, for the worker. Since I'm using some Zend Framework classes, and
relying on autoloading, I need to setup some autoloading. I also need to
instantiate these classes, configure the instances, and attach them to the
Gearman worker.
#!/usr/bin/env php<?phpini_set('memory_limit', -
1);
$autoloader =
function($class) { $file =
str_replace(array('\\',
'_'), DIRECTORY_SEPARATOR,
$class) .
'.php';
return include_once $file;
};
spl_autoload_register
($autoloader);
$mergeSubtree =
new ZF\Git\MergeSubtree
();
$mergeSubtree->
setWorkingDir(__DIR__
);
$log2rss =
new ZF\Git\Log2RSS
();
$log2rss->
setRepo('/home/gitolite/working/zf-master') ->
setFeedName('zf');
$worker =
new GearmanWorker
();
$worker->
addServer();
$worker->
addFunction('post-receive',
array($mergeSubtree,
'execute'));
$worker->
addFunction('rss',
array($log2rss,
'execute'));
while ($worker->
work()) { if (GEARMAN_SUCCESS !=
$worker->
returnCode()) { echo "Worker failed: " .
$worker->
error() .
"\n";
}}
To tie it all together, I'm using supervisord to manage this script, so
that I don't have to worry about it dying on me; it will always be available
when Gearman needs it. I won't go into the setup here; it's incredibly
straight-forward. (Many thanks to Sean Coates for his
2009 PHP Advent article on using supervisord with PHP, and to Mike Naberezny for introducing
me to supervisord many years ago.)
Conclusions
Gearman is a great tool for parallelizing tasks, as well as creating
asynchronous processes. Coupled with supervisord and the scripting language
of your choice, you can achieve some incredible results with very little
effort.
This is also a nice example of cherry-picking ZF components for use in
simple tasks -- I'm using Zend_Log to do reporting on the
status of jobs, and Zend_Feed_Writer to generate the RSS feed.
These are two components that work very well standalone, and which are also
ideally suited for long-running environments, where you don't need to worry
about how long the task takes.
I highly encourage you to investigate using tools for asynchronous
processing -- there are a variety of messaging systems, queues, and more
that you can leverage, and which can help you offload resource intensive
tasks from your main application.
For those of you curious about the subtree merge workflow I'm
developing, I'll be writing additional posts this month on that
subject.