When is it for?
Brain Eno and Peter Schmidt, Oblique Strategies
The worst kinds of bugs are the ones that dont appear during development, but then randomly appear only in real use. In the case of a complicated program running on several different platforms, such problems are not too surprising; but the first time I ran into such a problem was in a very simple program that ran on the same machine I developed it on. It was a simple SSI counter for a Web page, and it looked like this:
open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);
I got it going, and everything seemed to work fine:
% perl -cw counter.pl counter.pl syntax OK % echo 0 > counter.dat ; chmod a+rw counter.dat % perl -w counter.pl Hits on this page: 1 % perl -w counter.pl Hits on this page: 2 % perl -w counter.pl Hits on this page: 3
I tested it in an .shtml Web page and, in the browser, it merrily displayed Hits on this page: 4, then on reloading displayed Hits on this page: 5, and so on. When the Web page was put on a public site, it dutifully started reporting Hits on this page: 249, and Id check back later and see Hits on this page: 634, and everything seemed fine. But then Id look back later and see Hits on this page: 45. Something was clearly amiss, but I could see absolutely nothing wrong with the tiny counter program. So, I sought the advice of others, and they pointed out to me the problem that I will now explain to you.
We as programmers are used to putting ourselves in the shoes of our program, and relating to it as an individual: What file should I open now? What do I do if I cant open that file? What do I do if that other program went and deleted that file? and so on. But this handy metaphor breaks down when we need to imagine other simultaneous instances of our program following the same set of instructions. And thats just how the above counter program was getting into trouble. In testing, I never had two instances of the program running at once; but once the counter was on a publicly visible Web page, there were eventually two instances of the counter running at once, with various unfortunate results.
Problems with Simultaneous Instances
Imagine that two people, at about the same instant, are accessing the Web page with the counter discussed above. This leads the Web server to start up an instance of counter.pl for each user, at slightly different times. Lets suppose that the content of counter.dat at the beginning is the number 1000 and trace what each instance does.
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER);
So instance 1 has read 1000 into $hits. Then:
open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER);
Instance 2 has read 1000 into $hits. Then:
++$hits; print "Hits on this page: $hits\n"; ++$hits; print "Hits on this page: $hits\n";
Each instance increments its $hits and each gets 1001, and each displays that figure to its respective user. Then:
open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);
Instance 1 has updated counter.dat to 1001, and then ends. Then finally:
open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);
Instance 2 has updated counter.dat to 1001. The problem is that this is incorrect; even though we served the page twice, the counter ends up only 1 hit greater. Thats beside the fact that we just told two different users that they were both the 1001st viewer of this page, whereas one was really the 1002nd.
Heres a more drastic case: imagine that the two instances are slightly more out of phase. Suppose instance 1 is writing the value 1501 to counter.dat as instance 2 is starting up and reading it:
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; print COUNTER $hits; close(COUNTER);
There, instance 1 overwrites counter.dat (with a zero-length file), but just as its about to write the new value of its $hits, instance 2 opens that 0-length file and reads from it into its $hits. Reading from a 0-length file is just like reading from the end of any file: it returns undef. Then, instance 1 writes 1501 to counter.dat and ends. But instance 2 is still working:
++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);
It has incremented $hits, and incrementing an undef value gives you 1. It then tells the user Hits on this page: 1, and updates the counter.dat with a new value: 1. Our counter just went from 1501 to 1!
Each program was perfectly following its own instructions, but together they managed to be wrong. I had tacitly assumed that this case, where two instances coincide, would never happen; but I never actually put anything in place to stop it from happening. Or maybe Id assumed it could happen, but that the chances were astronomical. After all, its just a stupid Web page counter anyway. But anything worth doing, is worth doing right, and what needed doing here was to make sure that the above scenarios couldnt happen. Moreover, the way to keep this counter program from losing its count is also the way we keep more important data from being lost in other programs: file locking, a UNIX OS feature thats meant to help in just these sorts of cases.
A first hack at using file locking would change the program to read like this:
use Fcntl ':flock'; # import LOCK_* constants open COUNTER, "<counter.dat" or die "Can't read-open: $!"; flock COUNTER, LOCK_EX; # So only one instance gets to access this at a time! my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; flock COUNTER, LOCK_EX; # So only one instance gets to access this at a time! print COUNTER $hits; close(COUNTER);
So, when a given program instance calls flock FH, LOCK_EX on a given filehandle, it is signaling, via the operating system, that it wants exclusive access to that file; and if some other process has just called flock FH, LOCK_EX first, then our instance will wait around until its done. And similarly, once we get a lock on this file, if any other process calls flock FH, LOCK_EX, the OS will make it wait until were done. The way the above program signals that its done, is by calling close on the filehandle.
Although it could have called flock COUNTER, LOCK_UN, its enough to just close it, because of these important facts about locking in the basic UNIX file model:
You cant lock a file until youve already opened it.
When you close a file, you give up any lock you have on it.
If a process ends while it has a file open, the file gets closed.
So the only way a file can be locked at any moment is if a process had opened it, and then locked it, and hasnt yet closed it (either specifically, or by ending).
Unfortunately, this means trouble for our flock-using code. Notably, there can still be a problem with instances being out of phase since we cant lock a file without already having opened it, things can still happen in that brief moment between opening the file and locking it. Consider when one instance is updating counter.dat just as another new instance is about to read it:
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "<counter.dat" or die "Can't read-open: $!"; flock COUNTER, LOCK_EX; my $hits = <COUNTER>; close(COUNTER); flock COUNTER, LOCK_EX;
There, the OS dutifully kept two instances at once from having an exclusive lock on the file. But the locking is too late, because instance 1, just by opening the file, has already overwritten counter.dat with a zero-length file, just as instance 2 was about to read it. So were back to the same problem that existed before we had any flock calls at all: two processes accessing a file that we wish only one process at a time could access.
Semaphore Files
There are various special solutions to problems like the above, but the most general one involves semaphore files. The line of reasoning behind them goes like this: Since you cant lock a file until youve already opened it, any content you have in locked files still isnt safe. So just dont have any content at all in a locked file. However, we do have content we need to protect, namely the data in counter.dat. But that just means we cant use that as the file we go locking. Instead, well use some other file, never with any content of interest, whose only purpose will be to be a thing that different instances can lock for as long as they want access to counter.dat. The file that we lock but never store anything in, we call a semaphore file.
The way we actually use a semaphore file is by opening it and locking it before we access some other real resource (like a counter file), and then not closing the semaphore file until were done with the real resource. So, we can go back to our original program and make it safe by just adding code at the beginning to open a semaphore file, and one line at the end to close it:
use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; open COUNTER, "<counter.dat" or die "Can't read-open counter.dat: $!"; my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open counter.dat: $!"; print COUNTER $hits; close(COUNTER); close(SEM);
This avoids all the problems we saw earlier. Because the above program doesnt do anything with counter.dat until it has an exclusive lock on counter.sem, and doesnt give up that lock until its done, there can be only one instance of the above program accessing counter.dat at a time.
It can still happen that some other program alters counter.dat without first locking counter.sem so dont do that! As long as every process locks the appropriate semaphore file while its working on a given resource, all is well. All that you need to do is settle on some correspondence between file(s), and the semaphore file that controls access for them. Its a purely arbitrary choice, but when naming a semaphore file for a resource file.ext, I tend to name the semaphore file file.sem, file.ext.sem, or file.ext_S. As with any arbitrary decision, I advise picking one style and sticking with it clearly the whole purpose of this is defeated if one program looks to counter.sem as the semaphore file, while another looks to counter.dat_S.
Semaphore Objects
With our simple counter program, our simplistic but effective approach was just to bracket our program with this code:
use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; ...do things... close(SEM); ...do anything else that doesn't involve counter.sem...
That works quite well when our program is simple and involves just one semaphore file all we need to do is close(SEM) once were done with counter.sem or whatever resource the SEM filehandle denotes a lock for. However, when a given program involves a lot of different files (which each requires its own semaphore file, and which are being locked and unlocked in arbitrary orders) then you cant just have them all in one global filehandle object called SEM. You can use lexical filehandles using the Perl 5.6 open my $fh,... syntax, as here:
{ use Fcntl ':flock'; # import LOCK_* constants open my $sem, ">dodad.sem" or die "Can't write-open dodad.sem: $!"; flock $sem, LOCK_EX; ...things dealing with the resource that dodad.sem denotes a lock on... close($sem); }
In fact, the close($sem) command there isnt particularly necessary assuming you havent copied the object from $sem into any other variable in memory, then when the program hits the end of the block where my $sem was declared, Perl will delete that variables value from memory. Then, seeing that that is the only copy of that filehandle object, it will implicitly close the file, releasing the lock.
The benefit of using myd filehandles instead of globals is that this method prevents namespace collisions; you could have other my $sem variables defined in other scopes in this program, and they wouldnt interfere with this. But, creating each semaphore object would still require the same repetitive open and flock calls, and needless repetition is no friend of programmers. We might as well wrap it up in a function:
sub sem { my $filespec = shift(@_) || die "What filespec?"; open my $fh, ">", $filespec or die "Can't open semaphore file $filespec: $!"; chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return $fh; }
Then, whenever you want a semaphore lock on a file, you need only call:
my $sem = sem('/wherever/locks/thing.sem');
All you would then do with that object in $sem is keep it around as long as you need the lock on that semaphore file; or you could explicitly release the lock with just a close($sem).
If you were an OOP fan, you could even wrap this up in a proper class, an object of which denotes an exclusive lock on a given semaphore file. A minimal class would look like this:
package Sem; sub new { my $class = shift(@_); use Carp (); my $filespec = shift(@_) || Carp::croak("What filespec?"); open my $fh, ">", $filespec or Carp::croak("Can't open semaphore file $filespec: $!""); chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return bless {'fh' => $fh}, ref($class) || $class; } sub unlock { close(delete $_[0]{'fh'} or return 0); return 1; } 1; # End of module
Then you need only create the proper semaphore objects, like so:
use Sem; my $sem = Sem->new('/wherever/locks/thing.sem'); ...later... $sem->unlock;Conclusion
If youve got a data file thats only ever manipulated by one program, and youre sure youll never run multiple simultaneous instances of that program, then you dont need semaphore files. But you need semaphore files in all other cases, that is, where you have a file or other resource that is accessed by potentially simultaneous processes (whether different programs, or instances of the same program) and that resource could suffer from uncontrolled simultaneous access.
In this article, Ive assumed that the programs for which you need semaphore files are all running on the same machine, that that machine runs UNIX (or something with the same basic locking semantics), and that the filesystem youre putting the semaphore files on is not NFS (which often doesnt implement locking properly). In my next The Perl Journal article, Ill discuss what to do if you need semaphore files, but either youre not under UNIX, or the processes you need to coordinate are running on several different machines.
Sean M. Burke (sburke@cpan.org) lives in New Mexico, where he mostly does data-munging for Native language preservation projects.