Sean M. Burke
Find a safe part and use it as an anchor
-- Brian Eno and Peter Schmidt, Oblique Strategies
In my previous The Perl Journal article, Resource Locking with Semaphore Files, I showed some of the problems that can appear when simultaneously running processes try accessing the same file (or other scarce resource) at the same time; and for a solution, I proposed semaphore files based on Unix flock. However, that depended on all the tasks in question being on the same machine, and that machine basically supporting a Unix file-locking model. In this article, Ill deal with how to do resource locking when we cant assume any of these things, and where the processes in question might be on different machines with no common filesystem, and where possibly none of them are Unixes.
Many Ways To Do It
The thing that makes file locking work correctly is that no two processes do it at the same time if several try, then only one is allowed to succeed. Thats not a characteristic of just file locking, though:
When doing resource locking over a network, many people use some variation of the first solution, involving databases. This approach is inconvenient unless you have a network-accessible database system already set up; and its not possible unless you know what operations the database system will treat like the above operations. Instead, the third way is the one I prefer, and it works like this:
There could be several concurrent processes that want to manipulate some resource (like reading in a file called /stuff/big_config.txt, making some changes, and then re-writing it). And if it would be bad for more than one of them to be manipulating that resource at the same time, then before any process deals with that resource, it must try to mkdir /stuff/big_config_lock. If that mkdir operation succeeds, then that process has permission to manipulate that other resource, the /stuff/big_config.txt file. But if a process calls mkdir /stuff/big_config_lock, and it doesnt succeed, then we take this to mean that it already exists, and some other process must be in the middle of accessing the /stuff/big_config.txt resource. Or in Perl terms:
sub wait_until_lock_dir { my $dir = $_[0]; until( mkdir $dir ) { sleep 2 } return; } sub unlock_dir { my $dir = $_[0]; warn "Couldn't rmdir $dir: $!" unless rmdir $dir; return; } wait_until_lock_dir("/stuff/big_conf_lock"); print "OK, I've got the lock!\n!'; ... do whatever with /stuff/big_conf.txt ... unlock_dir( "/stuff/big_conf_lock");
Compared to the Unix file-locking model, this model has the following disadvantage. Suppose, under the Unix file-locking model, that a process opens and locks a semaphore file, and then dies while manipulating the process. The operating system will close the file, and thereby end the lock; so under that model, its impossible for a file to be locked, unless there is a process that is running and has it locked. But suppose that instead of a semaphore file, were using the existence of a directory to mean that resource is in use. If a process starts up, successfully makes the directory, and then dies while manipulating the process, then theres nothing to automatically make that directory go away. Unless something comes along and deletes it, every process that tries to deal with that resource will end up waiting forever for it to be available. This problem can be coped with in various ways, as Ill describe later in this article.
This use of mkdir as a locking mechanism has two advantages over normal Unix-ish flock. First off, it works under OSs other than Unix! Second off, it works across ftp! That is, instead of making the directory on the local filesystem with mkdir($dirname), youd log into an ftp server and $ftp->mkdir($dirname). The point of doing this is so that ftp server can essentially serve as a lock server for any processes that can access it.
Using ftp mkdir for Resource Locking
So lets use a concrete example here, such that should be familiar to all our readers. Youre the Chief Technical Officer for KAOS, the worldwide consortium of supervillians. Among your many evil duties is coordinating the exchange of the staff phone number directory, phone.tsv. In an ideally evil world, you would have all the evil resources needed to build a solid network. But for years now, all available funds have been diverted to the construction of the giant orbiting death-ray, leaving you with only a ramshackle mess of legacy systems.
Notably, the chief of Henchman Resources, Generalissima Mok, personally updates the phone directory on a daily basis, as old staff members are killed for their incompetence and others are promoted. She has learned to do this using FileMaker on an old Apple laptop running MacOS 7. You once meekly suggested that she use a Web interface for maintaining the phone database, and the Generalissima reacted by torturing you with electric shocks until you promised to forget the whole idea.
She did, however, let you write and install a MacPerl program on her laptop to ftp the phone.tsv file from her Mac to the main server, and she dutifully runs the program after every evil update to the database. The meat of that program looks like this:
use Net::FTP; # The class for FTP connections my($host, $user, $pass, $file) = ( 'ftp.kaos.int', 'phoneacc', 'dmb4ever', 'phone.tsv' ); my $conn = Net::FTP->new($host) || die "Can't ftp-connect to $host: $@"; $conn->login($user, $pass) || die "Can't log in as $user\@$host"; $conn->put($file) || die "Can't ftp-put $file"; $conn->quit(); undef $conn;
And that program does a fine job of ftping up the file when run. The problem you quickly run into is that others might be downloading phone.tsv from the server just as shes uploading it, resulting in their getting a file thats truncated or garbled. This upsets people, because if they fail to find some friends name in the phone directory, they assume he has been killed for incompetence. This leads to a tense work environment. You solve the problem of unnerving partial transfers by using a lock directory, like so:
... $conn->login($user, $pass) || die "Can't log in as $user\@$host"; my $lock_dir = 'phone_lock'; until( $conn->mkdir($lock_dir) ) { sleep 3 } # Send it to the remote server $conn->put($file) || die "Can't put ftp file $file"; $conn->rmdir($lock_dir); $conn->quit(); undef $conn;
The until line waits until its the one and only process that succeeds in making the directory phone_lock, which then signals that it is the only one writing to it. The only other new command that weve added here is the rmdir, which deletes phone_lock, signaling that were no longer staking a claim on this resource, now that were done with it.
On various hosts that need their own copy of phone.dat, you write Perl programs to do the downloading. They look exactly like the above program, except that the put line is replaced with this:
# Get it from the remote server $conn->get($file) || die "Can't get ftp file $file";
So if one of those processes logs in when Generalissima Mok is uploading her phone.tsv, theyll get as far as the until line that tries to create phone_lock. But that mkdir command will fail, causing the loop body of sleep 3 to be executed, and then itll try again, and keep trying until in the meantime the Generalissima will have finished the upload and rmdird the phone_lock directory. Then when this downloader process tries to call mkdir again, it will succeed, and the download can continue.
Or, going the other way, if the Generalissimas uploader process tries to log in while some other process is downloading phone.tsv, it will go into a sleep loop until that process in finished and rmdirs the directory.
There are three problems with this system:
First, the system dutifully stops Generalissima Mok from uploading a phone.tsv at the same moment that anyone is downloading phone.tsv. But it does this by stopping any two processes, of any kind, from accessing the file at the same time. This is a bit broad, because it stops two download processes from downloading the file at the same time; but two processes downloading the same file at the same time is harmless. The fact that the system stops this from happening is a problem only in that it slows things down somewhat; but I see no way around it.
Second, what if a process cant rmdir the lock directory? Ive seen this happen only in a particular case, where a server would let a user mkdir a directory under an a+rwx directory that was owned by someone else, but wouldnt let the same user rmdir the directory that hed just created! The solution is to run an interactive ftp session before you set up your lockdir system, to make sure that your ftp host doesnt behave that way. If it does, then put the lock directory in a directory the account can access, so that account can rmdir in as needed. This should just be a simple matter of setting $lock_dir to something different, like this:
my $lock_dir = '/evilusers/phoneacc/phone_lock';
And third, what if a process successfully makes the directory phone_lock, then starts downloading or uploading phone.tsv, but then the ftp connection dies because of a network outage? In that case, phone_lock would stick around forever. As I pointed out, this isnt a problem with Unix flock, but its sure a problem here. More and more processes would accumulate, one by one connecting to the remote server and waiting literally forever for phone_lock to be removed. I see no general solution to this problem; but in these circumstances, I have used two makeshift solutions. First, I make the clients give up after a while, by changing the until loop to this:
my $give_up_after = time() + 10 * 60; # Give up after 10 minutes from now until( $conn->mkdir($lock_dir) ) { sleep 3; die "Tired of waiting for $lock_dir!" if time() > $give_up_after; }
Second, I decide that a lock should not be any good if its beyond a certain age. The simplest way to do this is to crontab a process to run every few minutes on the ftp server and to forcibly remove that directory if it exists but is old. This is done on the presumption that any lock directory that old must be the result of a process that didnt clean up after itself since no operation could take that long [1]. The program to run on the ftp server is a simple one:
#!/usr/bin/perl # Delete phone_lock if it's from earlier than six minutes ago use strict; my $dir = "/evil/ftp/phone_lock"; my $six_mins_ago = time() - 6 * 60; if(-e $dir and (stat(_))[9] < $six_mins_ago ) { rmdir $dir or warn "Couldn't rmdir $dir: $!"; } exit;
I arbitrarily chose ten minutes as how long any process would wait for a lock, and six minutes as the maximum lifetime of a lock file. Youd need to come up with figures that may be much longer or shorter, depending on the time it could reasonably take any process to transfer phone.tsv, and the number of processes that are ever likely to be queued up waiting for a lock at any one time.
Incidentally, you can write a program like the above that kills old directory files, and have it run not locally, but instead over ftp. But, the mechanics of asking the age of a directory over ftp are very implementation-dependent, and this is left as an exercise for the reader.
Locksmith Processes
The real trouble in the above scenario is the possibility of processes creating a phone_lock directory but failing to remove it. The program I showed previously that removes too old phone_lock directories will eventually catch all such cases. But in some situations we can do better. A program going away without cleaning up its phone_lock directory might be the result of network trouble, and we cant do much about that. But often its the result of the program throwing an exception. Thats not very likely in the above program, because the only thing happening between the mkdir and the rmdir is a single get or put operation. But suppose that the program were more complex, pulling down one file, running some sort of elaborate conversion on it, and then sending it back up; and suppose that that conversion could potentially run out of system memory, which isnt a trappable exception the OS will simply kill the process. In that case, theres no way for this process to be sure to rmdir that lock directory.
However, if we have mostly Unix-like [2] process control where the program is running, we can delegate the whole business of obtaining and releasing the lock to a very simple independent subprocess, which I call a locksmith. All that a locksmith process needs to do is get the lock (or report that it cant), and then wait around for one of two things to happen: either the parent processs signaling that it wants the lock to be released, or the parent processs untimely demise. When either happens, the locksmith processes will then tidily remove the lock directory.
In very Perlish fashion, theres more than one way to do it, but this is the way I implement it: the parent process opens a one-way pipe from netlock, which is what Ive named my locksmith program. The parent passes the information about how to log in as command-line parameters, and then netlock prints one line: either something starting with OK to signal that it got the lock, or anything else (consisting of an error message) to signal that it couldnt get a lock. Then it just waits either for the parent process to die (which can be discerned with getppid() == 1) or for the parent process to close the childs pipe to the parent (which can be detected by a SIGPIPE signal being thrown when the child tries to write to that pipe).
The netlock program is simple enough, introducing few actually new constructs:
use strict; # netlock. Call me like: "netlock phonestuff" my %profile = ( 'phonestuff' => [ 'phoneacc', 'dmb4ever', 'ftp.xaos.int', 'phone.tsv' ], ); # - - - - - - - - - - - - - - - - use Net::FTP; my $profile = $ARGV[0]; $| = 1; # Don't buffer my STDOUT! sub bye { print @_, "\n"; exit } # Routine to complain that we couldn't get a lock bye "What profile?" unless defined $profile; my $auth = $profile{$profile}; bye "I don't know profile $profile'?" unless $auth; my($user, $pass, $host, $dir) = @$auth; my $ftp = Net::FTP->new($host) or bye "Can't connect to $$auth[2]"; $ftp->login($user,$pass) or bye "Can't log into account $user\@$host"; my $quit_flag; $SIG{'PIPE'} = sub {$quit_flag = 1}; my $locked; until($quit_flag or 1 == getppid) { if( $ftp->mkdir($dir) ) { $locked = 1; last; } sleep 4; # Don't totally hammer the ftp server! } exit unless $locked; # If we aborted early somehow. print "OK\n"; # tell the parent that we got the lock! # Wait around until the parent says to release, or dies. until($quit_flag or 1 == getppid) { $ftp->pwd(); # Don't let the connection time out. sleep 5; print STDOUT '.'; } $ftp->rmdir($dir); $ftp->quit(); exit;
This may seem a bit mysterious, but there are only three unusual things here. First, theres getppid, which returns the process ID of the parent process (i.e., the process for which netlock is acting as locksmith). A basic fact to note is that it returns 1 if and only if the original parent process has died. Second, there is this $SIG{'PIPE'} = sub {$quit_flag = 1} line. This simply tells Perl from now on, in this process, if you get a SIGPIPE signal, call this routine that consists just of setting $quit_flag to 1. And third, we have this mysterious loop that prints a period to STDOUT every three seconds. Although the parent never reads anything past the first line, you might expect that the dots do no good. However, we are printing them for their side effect: if we print anything to a STDOUT (our pipe to our parent process) and theres no SIGPIPE signal, then our channel to our parent is alive and well. But, if the parent has closed the pipe, then the printing to that pipe will throw a SIGPIPE, which will cause $quit_flag to be set to 1, which will cause the loop to bail out, and the program to rmdir the directory across FTP, and then quit.
As large as that netlock program is, it encapsulates the whole business of getting a lock, so that the parent process need only call it like so:
open LOCKER, "/path/to/netlock phonestuff |" or die "Couldn't pipe from 'netlock phonestuff' ?!"; my $status = <LOCKER>; $status = '[netlock startup error]' unless defined $status; die "$status from netlock $lockname" unless $status =~ m/^OK/s;
and then when we want to release the lock, we simply call close(LOCKER).
Object Interface to Locksmith Processes
Just as in the last article I ended with a class that implemented a simple class for flock semaphore files, Ill end this article with a class that implements a simple interface for ftp lock directories as accessed and tended by a netlock locksmith process.
package NetLock; sub new { my($class, $tag) = @_; die "You forgot to specify a tag!" unless defined $tag and length $tag; open my $fh, "perl -S netlock $tag |" or die "Couldn't open netlock $tag: $!"; my $status = <$fh>; $status = '[netlock startup error]' unless defined $status; die "$status from netlock $tag" unless $status =~ m/^OK/s; return bless { 'fh' => $fh }, ref($class) || $class; } sub unlock { my $fh = delete $_[0]{'fh'}; return 0 unless defined $fh; close($fh); } 1;
An object of class NetLock represents a lock on a resource as obtained across ftp by a locksmith process that will delete that lock either when this process asks it to, or when this process dies. Such an object is made by calling NetLock->new('tagname') where tagname is a string, such as phonestuff, that is in netlocks %profile hash of account data. The lock stays around either for as long as the object is around, or until we dismiss the lock by calling $lock->unlock, which will return false if weve already closed it, or true in the case where it was just now closed.
Sean M. Burke (sburke@cpan.org) quit his job at KAOS to work on a book for OReilly. Its called Perl & LWP and should be out this summer. Sean would like to thank Arnar Hrafnkelsson, Uri Guttman, Conrad Heiney, Jim Tranowski, and Ronald Schmidt for help and encouragement on the locksmith part of this article.
[1] Incidentally, this doesnt solve the quite serious
problem of the aborted upload leaving a truncated phone.tsv on the server.
To deal with this, the usual approach is to upload your phone.tsv as
a randomly named temp file, then make a lock directory, delete the remote servers
old phone.tsv, and rename your newly uploaded temp file to phone.tsv,
and then rmdir the temp directory. Corresponding solutions apply in the
equally serious case of aborted downloads leaving a truncated local phone.tsv.
[return to text]
[2] And by UNIX-like, I do not mean MSWindows.
However, as this article is going to press, Ronald Schmidt has found a brilliantly
simple way to get this working under MSWindows: instead of setting a signal
handler and checking getppid (neither of which are currently implemented
under Perl for MSWindows), you just check the return value of the print STDOUT
.; If its a false value, then the pipe has broken, and the
while loop should be exited!
[return to text]