=for comment This document is in Perl Plain Old Documentation (POD) format. Converters are available from POD to many text formats. Ask me for details. =head1 Resource Locking with Semaphore Files =head2 Sean M. Burke "When is it for?" -- Brain Eno and Peter Schmidt, I PRODUCTION: indent the above epigraph The worst kinds of bugs are the ones that don't appear during development, but then randomly appear only in real use. In the case of a complicated program running on several different platforms, such problems are not too surprising; but the first time I ran into such a problem was is a very simple program that ran on the same machine I developed it on. It was a simple SSI counter for a web page, and it looked like this: open COUNTER, "; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER); I got it going and it all seemed to work fine: % perl -cw counter.pl counter.pl syntax OK % echo 0 > counter.dat ; chmod a+rw counter.dat % perl -w counter.pl Hits on this page: 1 % perl -w counter.pl Hits on this page: 2 % perl -w counter.pl Hits on this page: 3 I tested it in an F<.shtml> web page and in the browser it merrily displayed "Hits on this page: 4", then on reloading displayed "Hits on this page: 5", and so on. When the web page was put on a public site, it dutifully started reporting "Hits on this page: 249", and I'd check back later and see "Hits on this page: 634", and everything seemed fine. But then I'd look back later and see "Hits on this page: 45". Something was clearly amiss. But I could see absolutely nothing wrong with the tiny counter program. So I sought the advice of others, and they pointed out to me the problem that I will now explain to you. We as programmers are used to putting ourselves in the shoes of our program, and relating to it as an individual: "What file should I open now? What do I do if I can't open that file? What do I do if that other program went and deleted that file?" and so on. But this handy metaphor of ours breaks down where we need to imagine other I instances of our program following the same set of instructions. And that's just how the above counter program was getting into trouble. In testing, I never had two instances of the program running at once; but once the counter was a publicly visible web page, there eventually did get to be two instances of the counter running at once, with various unfortunate results. =head2 Problems with Simultaneous Instances Imagine that two people, at about the same instant, are accessing the web page with the counter discussed above. This leads the Web server to start up an instance of F for each user, at slightly different times. Suppose that the content of F at the beginning is the number "1000", and let's trace what each instance does. PRODUCTION: I use indenting to differentiate the two instances, but it could be reformatted to not be so wide, and/or color/typeface could be used to differentiate them. Instance 1 Instance 2 ----------------- ----------------- open COUNTER, "; close(COUNTER); So instance 1 has read "1000" into C<$hits>. Then: open COUNTER, "; close(COUNTER); Instance 2 has read "1000" into C<$hits>. Then: ++$hits; print "Hits on this page: $hits\n"; ++$hits; print "Hits on this page: $hits\n"; Each instance increments its C<$hits> and each gets 1001, and each displays that figure to its respective user. Then: open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER); Instance 1 has updated F to 1001, and then ends. Then finally: open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER); Instance 2 has updated F to 1001. The problem is that this is incorrect; even though we served the page twice, the counter ends up only 1 hit greater. That's beside the fact that we just told two different users that they were both the 1001st viewer of this page, whereas one was really the 1002nd. Here's a more drastic case: imagine that the two instances are a bit more out of phase. Suppose where instance 1 is writing the value "1501" to F as instance 2 is starting up and reading it: Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "; print COUNTER $hits; close(COUNTER); There, instance 1 overwrites F (with a zero-length file), but just as it's about to write the new value of its C<$hits>, instance 2 opens that 0-length file, and reads from it into its C<$hits>. Reading from a 0-length file is just like reading from the end of any file: it returns undef. Then instance 1 goes and writes "1501" to F, and ends. But instance 2 is still working: ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER); It has incremented C<$hits>, and incrementing an undef value gives you 1. It then tells the user "Hits on this page: 1", and now updates the F with a new value: 1. Our counter just went from 1501 to 1! Each program was perfectly following its own instructions, but together they managed to be wrong: each told its user that it was the 1001st hit in this page, and each updated the F file with that same figure. I had tacitly assumed was that this case where two instance coincide, would never happen; but I never actually put anything in place to stop this from happening. Or maybe I'd assumed it I happen, but that the chances were astronomical. And after all, "it's just a stupid web page counter anyway". But anything worth doing, is worth doing right, and what needed doing here was some way to make sure that the above scenarios couldn't happen. Moreover, the way to get this counter program from losing its count, is also the same way we keep more important data from being lost in other programs: file locking, a Unix OS feature that's meant to help in just these sorts of cases. Now a first hack at using file-locking would change the program to read like this: use Fcntl ':flock'; # import LOCK_* constants open COUNTER, "; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; flock COUNTER, LOCK_EX; # So only one instance gets to access this at a time! print COUNTER $hits; close(COUNTER); PRODUCTION: bold the two "flock COUNTER, LOCK_EX;" lines, and ital the two "# So only one instance gets to access this at a time!" lines. So when a given program instance calls "C, LOCK_EX>" on a given filehandle, it is signaling, via the operating system, that it wants exclusive access to that file; and if some other process has just called "C, LOCK_EX>" first, then our instance will wait around until it's done. And similarly, once we get a lock on this file, if any other process calls "C, LOCK_EX>", the OS will make it wait until we're done. The way one signals that the above program signals that it's done, is by calling C on the filehandle. Although it could have called C, it's enough to just close it, because of these important facts about locking in the basic Unix file model: =over =item * You can't lock a file until you've already opened it. =item * When you close a file, you give up any lock you have on it. =item * If a process dies while it has a file open, the file gets closed. =item * So the only way a file can be locked at any moment is if a process had opened it, and then locked it, and hasn't yet closed it (either specifically, or by ending). =back Unfortunately, this means trouble for our C-using code. Notably, there can still be a problem with instances being out of phase -- since we can't lock a file without already having opened it, things can still happen in the brief moment between opening the file, and locking it. Consider when one instance is updating F just as another new instance is about to read it: Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "; close(COUNTER); flock COUNTER, LOCK_EX; There, the OS dutifully kept there from being two instances at once with an exclusive lock on the file. But the locking is too late, because instance 1, just by opening the file, has already overwritten F with a zero-length file just as instance 2 was about to read it. So we're back to the same problem that we had before we had any C calls at all: two processes accessing a file that we wish only one process at a time could access. =head2 Semaphore Files There are various special solutions to problems like the above, but the most general one is semaphore files. The line of reasoning behind them goes like this: Since you can't lock a file until you've already opened it, any content you have in locked files still isn't safe. So just don't have any content at all in a locked file. However, we I have content we need to protect, namely the data in F. But that just means we can't use that as the file we go locking. Instead, we'll use some other file, never with any content of interest, whose only purpose will be to be a thing that difference instances can lock for as long as they want access to F. The file that we lock but never store anything in, we call a B. The way we actually use a semaphore file is by opening it and locking it before we access some other real resource (like a counter file), and then not closing the semaphore file until we're done with the real resource. So we can go back to our original program and make it safe by just adding code at the beginning to open a semaphore file, and one line at the end to close it: use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; open COUNTER, "; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open counter.dat: $!"; print COUNTER $hits; close(COUNTER); close(SEM); This avoids all the problems we saw earlier; since the above program doesn't do anything with F until it has an exclusive lock on F, and don't give up that lock until it's done, that means that there is every only one instance of the above program accessing F at a time. It can still happen that some other program alters F without first locking F -- so don't do that! As long as every process locks the appropriate semaphore file while it's working on a given resource, all is well. All that you need to do is settle on some correspondence between file(s), and the semaphore file that controls access for them. It's a purely arbitrary choice, but when naming a semaphore file for a resource F, I tend to name the semaphore file F, F, or F. As with any arbitrary decision, I advise picking one style and sticking with it -- clearly the whole purpose of this is defeated if one program looks to F as the semaphore file, while another looks to F. =head2 Semaphore Objects With our simple counter program, our simplistic but effective approach was to just bracket our program with this code; use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; ...do things... close(SEM); ...do anything else that doesn't involve counter.sem... That works quite well when our program is simple and involves just one semaphore file -- all we need to do is C once we're done with F or whatever resource the SEM filehandle denotes a lock for. However, when a given program involves a lot of different files which each require its own semaphore file, and which are being locked and unlocked in arbitrary orders, then you can't just have them all in one global filehandle object called "SEM". You can use lexical filehandles using the Perl 5.6 C syntax, as here: { use Fcntl ':flock'; # import LOCK_* constants open my $sem, ">dodad.sem" or die "Can't write-open dodad.sem: $!"; flock $sem, LOCK_EX; ...things dealing with the resource that dodad.sem denotes a lock on... close($sem); } In fact, the C command there isn't particularly necessary -- assuming you haven't copied the object from C<$sem> into any other variable in memory, then when the program hits the end of the block where C was declared, then Perl will delete that variable's value from memory, and then seeing that that is the only copy of that filehandle object, it will implicitly close the file, releasing the lock. The benefit of using C'd filehandles instead of a globals, is that it avoids namespace collisions; you could have other C variables defined in other scopes in this program, and they wouldn't interfere with this one. But creating each semaphore object would still require the same repetitive C and C calls, and needless repetition is no friend of programmers. We might as well wrap it up in a function: sub sem { my $filespec = shift(@_) || die "What filespec?"; open my $fh, ">", $filespec or die "Can't open semaphore file $filespec: $!"; chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return $fh; } And then whenever you want a semaphore lock on a file, you need only call: my $sem = sem('/wherever/locks/thing.sem'); All you would then do with that object in C<$sem> is keep it around as long as you need the lock on that semaphore file; or you could explicitly release the lock with just a C. If you were an OOP fan, you could even wrap this up in a proper class, an object of which denotes an exclusive lock on a given semaphore file. A minimal class would look like this: package Sem; sub new { my $class = shift(@_); use Carp (); my $filespec = shift(@_) || Carp::croak("What filespec?"); open my $fh, ">", $filespec or Carp::croak("Can't open semaphore file $filespec: $!"); chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return bless {'fh' => $fh}, ref($class) || $class; } sub unlock { close(delete $_[0]{'fh'} or return); return 1; } 1; # End of module Then you need only create the proper semaphore objects like so: use Sem; my $sem = Sem->new('/wherever/locks/thing.sem'); ...later... $sem->unlock; =head2 Conclusion If you've got a data file that's only ever manipulated by one program, and you're sure you'll never run multiple simultaneous instances of that program, then you don't need semaphore files. But you need semaphore files in all other cases, where you have a file or other resource that is accessed by potentially simultaneous processes (whether different programs, or instances of the same program), and if that resource could suffer from uncontrolled simultaneous access. In this article, I've assumed that the programs that you need semaphore files for, are all running on the same machine, that that machine runs Unix (or something with the same basic locking semantics), and that the filesystem you're putting the semaphore files on isn't NFS (which often doesn't implement locking properly). In my next I article, I'll discuss what to do if you need semaphore files, but either you're not under Unix, or the processes you're needing to coordinate are running on several different machines. __END__ Sean M. Burke (C) lives in New Mexico, where he mostly does data-munging for Native language preservation projects.